Skip to main content

A Data Parallel Algorithm for XML DOM Parsing

  • Conference paper
Book cover Database and XML Technologies (XSym 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5679))

Included in the following conference series:

Abstract

The extensible markup language XML has become the de facto standard for information representation and interchange on the Internet. XML parsing is a core operation performed on an XML document for it to be accessed and manipulated. This operation is known to cause performance bottlenecks in applications and systems that process large volumes of XML data. We believe that parallelism is a natural way to boost performance. Leveraging multicore processors can offer a cost-effective solution, because future multicore processors will support hundreds of cores, and will offer a high degree of parallelism in hardware. We propose a data parallel algorithm called ParDOM for XML DOM parsing, that builds an in-memory tree structure for an XML document. ParDOM has two phases. In the first phase, an XML document is partitioned into chunks and parsed in parallel. In the second phase, partial DOM node tree structures created during the first phase, are linked together (in parallel) to build a complete DOM node tree. ParDOM offers fine-grained parallelism by adopting a flexible chunking scheme – each chunk can contain an arbitrary number of start and end XML tags that are not necessarily matched. ParDOM can be conveniently implemented using a data parallel programming model that supports map and sort operations. Through empirical evaluation, we show that ParDOM yields better scalability than PXP [23] – a recently proposed parallel DOM parsing algorithm – on commodity multicore processors. Furthermore, ParDOM can process a wide-variety of XML datasets with complex structures which PXP fails to parse.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Intel XML Software Suite Performance Paper, http://intel.com/software/xmlsoftwaresuite

  2. Microsoft XML Core Services (MSXML), http://msdn.microsoft.com/en-us/xml/

  3. Xerces-C++ XML Parser, http://xerces.apache.org/xerces-c/

  4. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (December 2006)

    Google Scholar 

  5. Berglund, A., Boag, S., Chamberlin, D., Fernandez, M.F., Kay, M., Robie, J., Simon, J.: XML path language (XPath) 2.0 W3C working draft 16. Technical Report WD-xpath20-20020816, World Wide Web Consortium (August 2002)

    Google Scholar 

  6. Cable, L., Chow, T.: JSR 173: Streaming API for XML (2007), http://jcp.org/en/jsr/detail?id=173

  7. Cameron, R.D., Herdy, K.S., Lin, D.: High performance XML parsing using parallel bit stream technology. In: CASCON 2008: Proc. of the 2008 conference of the center for advanced studies on collaborative research, New York, pp. 222–235 (2008)

    Google Scholar 

  8. Chakravarty, M.M.T., Leshchinskiy, R., Jones, S.P., Keller, G., Marlow, S.: Data Parallel Haskell: a status report. In: Proc. of the 2007 Workshop on Declarative Aspects of Multicore Programming, Nice, France, January 2007, pp. 10–18 (2007)

    Google Scholar 

  9. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of the OSDI 2004, San Francisco, CA (December 2004)

    Google Scholar 

  10. Engelen, R.A.V.: A framework for service-oriented computing with C and C++ Web service components. ACM Transactions on Internet Technology 8(3), 1–25 (2008)

    Article  Google Scholar 

  11. Gao, Z., Pan, Y., Zhang, Y., Chiu, K.: A high performance schema-specific xml parser. In: IEEE Intl. Conf. on e-Science and Grid Computing, December 2007, pp. 245–252 (2007)

    Google Scholar 

  12. Ghuloum, A., Smith, T., Wu, G., Zhou, X., Fang, J., Guo, P., So, B., Rajagopalan, M., Chen, Y., Chen, B.: Future-proof data parallel algorithms and software on intel multi-core architecture. Intel Technology Journal 11(4), 333–348 (2007)

    Article  Google Scholar 

  13. Ghuloum, A., Sprangle, E., Fang, J., Wu, G., Zhou, X.: Ct: A Flexible Parallel Programming Model for Tera-scale Architectures, 2007. Intel White Paper (2007)

    Google Scholar 

  14. Goldman, O., Lenkov, D.: XML Binary Characterization. Technical report, World Wide Web Consortium (March 2005)

    Google Scholar 

  15. Grohoski, G.: Niagara 2: A highly threaded server-on-a-chip. In: 18th Hot Chips Symposium (August 2006)

    Google Scholar 

  16. Huhns, M., Singh, M.P.: Service-Oriented Computing: Key Concepts and Principles. IEEE Internet Computing 9(1), 75–81 (2005)

    Article  Google Scholar 

  17. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proc. of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pp. 59–72 (2007)

    Google Scholar 

  18. Kay, M.: SAXON: The XSLT and XQuery Processor, http://saxon.sourceforge.net

  19. Kostoulas, M.G., Matsa, M., Mendelsohn, N., Perkins, E., Heifets, A., Mercaldi, M.: XML screamer: an integrated approach to high performance XML parsing, validation and deserialization. In: Proc. of the 15th International Conference on World Wide Web, New York, pp. 93–102 (2006)

    Google Scholar 

  20. Li, Q., Moon, B.: Indexing and querying XML data for regular path expressions. In: Proc. of the 27th VLDB Conference, Rome, Italy, September 2001, pp. 361–370 (2001)

    Google Scholar 

  21. Megginson, D.: Simple API for XML, http://sax.sourceforge.net/

  22. Nicola, M., John, J.: XML parsing: a threat to database performance. In: Proc. of the 12th International Conference on Information and Knowledge Management, pp. 175–178 (2003)

    Google Scholar 

  23. Pan, Y., Lu, W., Zhang, Y., Chiu, K.: A Static Load-Balancing Scheme for Parallel XML Parsing on Multicore CPUs. In: Proc. of the 7th International Symposium on Cluster Computing and the Grid (CCGRID), Washington D.C., May 2007, pp. 351–362 (2007)

    Google Scholar 

  24. Pan, Y., Zhang, Y., Chiu, K.: Simultaneous transducers for data-parallel XML parsing. In: Proc. of Intl. Symposium on Parallel and Distributed Processing, April 2008, pp. 1–12 (2008)

    Google Scholar 

  25. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for Multi-core and Multiprocessor Systems. In: Proceedings of the 13th International Symposium on High-Performance Computer Architecture (HPCA), Phoenix, AZ (Feburary 2007)

    Google Scholar 

  26. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph. 27(3), 1–15 (2008)

    Article  Google Scholar 

  27. Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugasundaram, J., Shekita, E., Zhang, C.: Storing and Querying Ordered XML Using a Relational Database System. In: Proc. of the 2002 ACM-SIGMOD Conference, June 2002, pp. 204–215 (2002)

    Google Scholar 

  28. TPC. TPC-H (2002), http://www.tpc.org/tpch/

  29. UW XML Repository (2001), http://www.cs.washington.edu/research/xmldatasets

  30. W3C. The document object model (1998), http://www.w3.org/DOM

  31. Wu, Y., Zhang, Q., Yu, Z., Li, J.: A Hybrid Parallel Processing for XML Parsing and Schema Validation. In: Proceedings of Balisage Markup Conference (2008)

    Google Scholar 

  32. Zhang, J., Lovette, K.: XimpleWare W3C Position Paper. In: W3C Workshop on Binary Interchange of XML Information Item Sets (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shah, B., Rao, P.R., Moon, B., Rajagopalan, M. (2009). A Data Parallel Algorithm for XML DOM Parsing. In: Bellahsène, Z., Hunt, E., Rys, M., Unland, R. (eds) Database and XML Technologies. XSym 2009. Lecture Notes in Computer Science, vol 5679. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03555-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03555-5_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03554-8

  • Online ISBN: 978-3-642-03555-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics