Skip to main content

Processing XML Streams with Deterministic Automata

  • Conference paper
  • First Online:
Book cover Database Theory — ICDT 2003 (ICDT 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2572))

Included in the following conference series:

Abstract

We consider the problem of evaluating a large number of XPath expressions on an XML stream. Our main contribution consists in showing that Deterministic Finite Automata (DFA) can be used effectively for this problem: in our experiments we achieve a throughput of about 5.4MB/s, independent of the number of XPath expressions (up to 1,000,000 in our tests). The major problem we face is that of the size of the DFA. Since the number of states grows exponentially with the number of XPath expressions, it was previously believed that DFAs cannot be used to process large sets of expressions. We make a theoretical analysis of the number of states in the DFA resulting from XPath expressions, and consider both the case when it is constructed eagerly, and when it is constructed lazily. Our analysis indicates that, when the automaton is constructed lazily, and under certain assumptions about the structure of the input XML data, the number of states in the lazy DFA is manageable. We also validate experimentally our findings, on both synthetic and real XML data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 1999.

    Google Scholar 

  2. A. Aho and M. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18:333–340, 1975.

    Article  MATH  MathSciNet  Google Scholar 

  3. M. Altinel and M. Franklin. Efficient filtering of XML documents for selective dissemination. In Proceedings of VLDB, pages 53–64, Cairo, Egypt, September 2000.

    Google Scholar 

  4. I. Avila-Campillo, T. J. Green, A. Gupta, M. Onizuka, D. Raven, and D. Suciu. XMLTK: An XML toolkit for scalable XML stream processing. In Proceedings of PLANX, October 2002.

    Google Scholar 

  5. P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory, pages 336–350, Delphi, Greece, 1997. Springer Verlag.

    Google Scholar 

  6. C. Chan, P. Felber, M. Garofalakis, and R. Rastogi. Efficient filtering of XML documents with XPath expressions. In Proceedings of the International Conference on Data Engineering, 2002.

    Google Scholar 

  7. J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for internet databases. In Proceedings of the ACM/SIGMOD Conference on Management of Data, pages 379–390, 2000.

    Google Scholar 

  8. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.

    Google Scholar 

  9. Y. Diao, P. Fischer, M. Franklin, and R. To. Y filter: Efficient and scalable filtering of xml documents. In Proceedings of the International Conference on Data Engineering, San Jose, California, February 2002.

    Google Scholar 

  10. M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas. In Proceedings of the International Conference on Data Engineering, pages 14–23, 1998.

    Google Scholar 

  11. R. Goldman and J. Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In Proceedings of Very Large Data Bases, pages 436–445, September 1997.

    Google Scholar 

  12. T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing xml streams with deterministic automata. Technical Report 02-10-03, University of Washington, 2002. Available from http://www.cs.washington.edu/homes/suciu.

  13. D. G. Higgins, R. Fuchs, P. J. Stoehr, and G. N. Cameron. The EMBL data library. Nucleic Acids Research, 20:2071–2074, 1992.

    Google Scholar 

  14. J. Hopcroft and J. Ullman. Introduction to automata theory, languages, and computation. Addison-Wesley, 1979.

    Google Scholar 

  15. Z. Ives, A. Halevy, and D. Weld. An XML query engine for network-bound data. Unpublished, 2001.

    Google Scholar 

  16. H. Liefke and D. Suciu. XMill: an efficent compressor for XML data. In Proceedings of SIGMOD, pages 153–164, Dallas, TX, 2000.

    Google Scholar 

  17. M. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of English: the Penn Treenbak. Computational Linguistics, 19, 1993.

    Google Scholar 

  18. J. McHugh and J. Widom. Query optimization for XML. In Proceedings of VLDB, pages 315–326, Edinburgh, UK, September 1999.

    Google Scholar 

  19. NASA’s astronomical data center. ADC XML resource page. http://xml.gsfc.nasa.gov/.

  20. B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on the web. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 437–448, Santa Barbara, 2001.

    Google Scholar 

  21. D. Olteanu, T. Kiesling, and F. Bry. An evaluation of regular path expressions with qualifiers against XML streams. In Proc. the International Conference on Data Engineering, 2003.

    Google Scholar 

  22. G. Rozenberg and A. Salomaa. Handbook of Formal Languages. Springer Verlag, 1997.

    Google Scholar 

  23. A. Sahuguet. Everything you ever wanted to know about dtds, but were afraid to ask. In D. Suciu and G. Vossen, editors, Proceedings of WebDB, pages 171–183. Sringer Verlag, 2000.

    Google Scholar 

  24. A. Snoeren, K. Conley, and D. Gifford. Mesh-based content routing using XML. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001.

    Google Scholar 

  25. J. Thierry-Mieg and R. Durbin. Syntactic Definitions for the ACEDB Data Base Manager. Technical Report MRC-LMB xx.92, MRC Laboratory for Molecular Biology, Cambridge,CB2 2QH, UK, 1992.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Green, T.J., Miklau, G., Onizuka, M., Suciu, D. (2003). Processing XML Streams with Deterministic Automata. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds) Database Theory — ICDT 2003. ICDT 2003. Lecture Notes in Computer Science, vol 2572. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36285-1_12

Download citation

  • DOI: https://doi.org/10.1007/3-540-36285-1_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00323-6

  • Online ISBN: 978-3-540-36285-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics