Skip to main content

In Search of the Lost Schema

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1540))

Abstract

We study the problem of rediscovering the schema of nested relations that have been encoded as strings for storage purposes. We consider various classes of encoding functions, and consider the mark-up encodings, which allow to find the schema without knowledge of the encoding function, under reasonable assumptions on the input data. Depending upon the encoding of empty sets, we propose two polynomial on-line algorithms (with different buffer size) solving the schema finding problem. We also prove that with a high probability, both algorithms find the schema after examining a fixed number of tuples, thus leading in practice to a linear time behavior with respect to the database size for wrapping the data. Finally, we show that the proposed techniques are well-suited for practical applications, such as structuring and wrapping HTML pages and Web sites.

Work supported by Università di Roma Tre, MURST and Consiglio Nazionale delle Ricerche.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Abiteboul and C. Beeri. On the power of languages for the manipulation of complex objects. The VLDB Journal, 4(4):117–138, 1995.

    Article  Google Scholar 

  2. S. Abiteboul. Querying semi-structured data. In ICDT’97.

    Google Scholar 

  3. ACC+97._S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and J. Siméon. Querying documents in object databases. Journal of Digital Libraries, 1(1):5–19, April 1997.

    Article  Google Scholar 

  4. B. Adelberg. NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents. In SIGMOD’98.

    Google Scholar 

  5. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1994.

    Google Scholar 

  6. N. Ashish and C. Knoblock. Wrapper generation for semistructured Internet sources. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with SIGMOD’97).

    Google Scholar 

  7. P. Atzeni and G. Mecca. Cut and Paste. In PODS’97.

    Google Scholar 

  8. P. Atzeni, G. Mecca, and P. Merialdo. To Weave the Web. In VLDB’97.

    Google Scholar 

  9. D. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98).

    Google Scholar 

  10. V. Crescenzi and G. Mecca. Grammars have exceptions. Information Systems, 1998. Special Issue on Semistructured Data, to appear.

    Google Scholar 

  11. M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.

    Google Scholar 

  12. S. Grumbach and V. Vianu. Tractable query languages for complex object databases. Journal of Computer and System Sciences, 51(2):149–167, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  13. HGMC+97._J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the Web. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with ACM SIGMOD, 1997).

    Google Scholar 

  14. K. Han and H. J. Kim. Prediction of common folding structures of homologous RNAs. Nucleic Acids Research, 21(5):1251–1257, 1993.

    Article  Google Scholar 

  15. R. Hull. A survey of theoretical research on typed complex database objects. In J. Paredaens, editor, Databases, pages 193–256. Academic Press, 1988.

    Google Scholar 

  16. ISO. International Organization for Standardization. ISO-8879: Information Processing-Text and Office Systems-Standard Generalized Markup Language (SGML), October 1986.

    Google Scholar 

  17. N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI’97), 1997.

    Google Scholar 

  18. E. R. Lassettre. Olympic records for data at the 1998 Nagano Games. In SIGMOD’98. Industrial Session.

    Google Scholar 

  19. G. Mecca, A. Mendelzon, and P. Merialdo. Efficient queries over Web views. In EDBT’98.

    Google Scholar 

  20. Nagano 1998 Winter Olympics Web site. http://www.nagano.olympic.-org.

  21. S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In SIGMOD, 1998.

    Google Scholar 

  22. A. Ohori. Semantics of types for database objects. Theoretical Computer Science, 76(1):53–91, 1990.

    Article  MATH  MathSciNet  Google Scholar 

  23. C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.

    Google Scholar 

  24. M.S. Waterman. Mathematical Methods for DNA Sequences. CRC Press, 1989.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Grumbach, S., Mecca, G. (1999). In Search of the Lost Schema. In: Beeri, C., Buneman, P. (eds) Database Theory — ICDT’99. ICDT 1999. Lecture Notes in Computer Science, vol 1540. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49257-7_20

Download citation

  • DOI: https://doi.org/10.1007/3-540-49257-7_20

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-65452-0

  • Online ISBN: 978-3-540-49257-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics