Abstract
We consider the problem of evaluating a large number of XPath expressions on an XML stream. Our main contribution consists in showing that Deterministic Finite Automata (DFA) can be used effectively for this problem: in our experiments we achieve a throughput of about 5.4MB/s, independent of the number of XPath expressions (up to 1,000,000 in our tests). The major problem we face is that of the size of the DFA. Since the number of states grows exponentially with the number of XPath expressions, it was previously believed that DFAs cannot be used to process large sets of expressions. We make a theoretical analysis of the number of states in the DFA resulting from XPath expressions, and consider both the case when it is constructed eagerly, and when it is constructed lazily. Our analysis indicates that, when the automaton is constructed lazily, and under certain assumptions about the structure of the input XML data, the number of states in the lazy DFA is manageable. We also validate experimentally our findings, on both synthetic and real XML data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 1999.
A. Aho and M. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18:333–340, 1975.
M. Altinel and M. Franklin. Efficient filtering of XML documents for selective dissemination. In Proceedings of VLDB, pages 53–64, Cairo, Egypt, September 2000.
I. Avila-Campillo, T. J. Green, A. Gupta, M. Onizuka, D. Raven, and D. Suciu. XMLTK: An XML toolkit for scalable XML stream processing. In Proceedings of PLANX, October 2002.
P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory, pages 336–350, Delphi, Greece, 1997. Springer Verlag.
C. Chan, P. Felber, M. Garofalakis, and R. Rastogi. Efficient filtering of XML documents with XPath expressions. In Proceedings of the International Conference on Data Engineering, 2002.
J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for internet databases. In Proceedings of the ACM/SIGMOD Conference on Management of Data, pages 379–390, 2000.
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.
Y. Diao, P. Fischer, M. Franklin, and R. To. Y filter: Efficient and scalable filtering of xml documents. In Proceedings of the International Conference on Data Engineering, San Jose, California, February 2002.
M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas. In Proceedings of the International Conference on Data Engineering, pages 14–23, 1998.
R. Goldman and J. Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In Proceedings of Very Large Data Bases, pages 436–445, September 1997.
T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing xml streams with deterministic automata. Technical Report 02-10-03, University of Washington, 2002. Available from http://www.cs.washington.edu/homes/suciu.
D. G. Higgins, R. Fuchs, P. J. Stoehr, and G. N. Cameron. The EMBL data library. Nucleic Acids Research, 20:2071–2074, 1992.
J. Hopcroft and J. Ullman. Introduction to automata theory, languages, and computation. Addison-Wesley, 1979.
Z. Ives, A. Halevy, and D. Weld. An XML query engine for network-bound data. Unpublished, 2001.
H. Liefke and D. Suciu. XMill: an efficent compressor for XML data. In Proceedings of SIGMOD, pages 153–164, Dallas, TX, 2000.
M. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of English: the Penn Treenbak. Computational Linguistics, 19, 1993.
J. McHugh and J. Widom. Query optimization for XML. In Proceedings of VLDB, pages 315–326, Edinburgh, UK, September 1999.
NASA’s astronomical data center. ADC XML resource page. http://xml.gsfc.nasa.gov/.
B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on the web. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 437–448, Santa Barbara, 2001.
D. Olteanu, T. Kiesling, and F. Bry. An evaluation of regular path expressions with qualifiers against XML streams. In Proc. the International Conference on Data Engineering, 2003.
G. Rozenberg and A. Salomaa. Handbook of Formal Languages. Springer Verlag, 1997.
A. Sahuguet. Everything you ever wanted to know about dtds, but were afraid to ask. In D. Suciu and G. Vossen, editors, Proceedings of WebDB, pages 171–183. Sringer Verlag, 2000.
A. Snoeren, K. Conley, and D. Gifford. Mesh-based content routing using XML. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001.
J. Thierry-Mieg and R. Durbin. Syntactic Definitions for the ACEDB Data Base Manager. Technical Report MRC-LMB xx.92, MRC Laboratory for Molecular Biology, Cambridge,CB2 2QH, UK, 1992.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Green, T.J., Miklau, G., Onizuka, M., Suciu, D. (2003). Processing XML Streams with Deterministic Automata. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds) Database Theory — ICDT 2003. ICDT 2003. Lecture Notes in Computer Science, vol 2572. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36285-1_12
Download citation
DOI: https://doi.org/10.1007/3-540-36285-1_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00323-6
Online ISBN: 978-3-540-36285-2
eBook Packages: Springer Book Archive