Abstract
The extensible markup language XML has become the de facto standard for information representation and interchange on the Internet. XML parsing is a core operation performed on an XML document for it to be accessed and manipulated. This operation is known to cause performance bottlenecks in applications and systems that process large volumes of XML data. We believe that parallelism is a natural way to boost performance. Leveraging multicore processors can offer a cost-effective solution, because future multicore processors will support hundreds of cores, and will offer a high degree of parallelism in hardware. We propose a data parallel algorithm called ParDOM for XML DOM parsing, that builds an in-memory tree structure for an XML document. ParDOM has two phases. In the first phase, an XML document is partitioned into chunks and parsed in parallel. In the second phase, partial DOM node tree structures created during the first phase, are linked together (in parallel) to build a complete DOM node tree. ParDOM offers fine-grained parallelism by adopting a flexible chunking scheme – each chunk can contain an arbitrary number of start and end XML tags that are not necessarily matched. ParDOM can be conveniently implemented using a data parallel programming model that supports map and sort operations. Through empirical evaluation, we show that ParDOM yields better scalability than PXP [23] – a recently proposed parallel DOM parsing algorithm – on commodity multicore processors. Furthermore, ParDOM can process a wide-variety of XML datasets with complex structures which PXP fails to parse.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Intel XML Software Suite Performance Paper, http://intel.com/software/xmlsoftwaresuite
Microsoft XML Core Services (MSXML), http://msdn.microsoft.com/en-us/xml/
Xerces-C++ XML Parser, http://xerces.apache.org/xerces-c/
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (December 2006)
Berglund, A., Boag, S., Chamberlin, D., Fernandez, M.F., Kay, M., Robie, J., Simon, J.: XML path language (XPath) 2.0 W3C working draft 16. Technical Report WD-xpath20-20020816, World Wide Web Consortium (August 2002)
Cable, L., Chow, T.: JSR 173: Streaming API for XML (2007), http://jcp.org/en/jsr/detail?id=173
Cameron, R.D., Herdy, K.S., Lin, D.: High performance XML parsing using parallel bit stream technology. In: CASCON 2008: Proc. of the 2008 conference of the center for advanced studies on collaborative research, New York, pp. 222–235 (2008)
Chakravarty, M.M.T., Leshchinskiy, R., Jones, S.P., Keller, G., Marlow, S.: Data Parallel Haskell: a status report. In: Proc. of the 2007 Workshop on Declarative Aspects of Multicore Programming, Nice, France, January 2007, pp. 10–18 (2007)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of the OSDI 2004, San Francisco, CA (December 2004)
Engelen, R.A.V.: A framework for service-oriented computing with C and C++ Web service components. ACM Transactions on Internet Technology 8(3), 1–25 (2008)
Gao, Z., Pan, Y., Zhang, Y., Chiu, K.: A high performance schema-specific xml parser. In: IEEE Intl. Conf. on e-Science and Grid Computing, December 2007, pp. 245–252 (2007)
Ghuloum, A., Smith, T., Wu, G., Zhou, X., Fang, J., Guo, P., So, B., Rajagopalan, M., Chen, Y., Chen, B.: Future-proof data parallel algorithms and software on intel multi-core architecture. Intel Technology Journal 11(4), 333–348 (2007)
Ghuloum, A., Sprangle, E., Fang, J., Wu, G., Zhou, X.: Ct: A Flexible Parallel Programming Model for Tera-scale Architectures, 2007. Intel White Paper (2007)
Goldman, O., Lenkov, D.: XML Binary Characterization. Technical report, World Wide Web Consortium (March 2005)
Grohoski, G.: Niagara 2: A highly threaded server-on-a-chip. In: 18th Hot Chips Symposium (August 2006)
Huhns, M., Singh, M.P.: Service-Oriented Computing: Key Concepts and Principles. IEEE Internet Computing 9(1), 75–81 (2005)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proc. of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pp. 59–72 (2007)
Kay, M.: SAXON: The XSLT and XQuery Processor, http://saxon.sourceforge.net
Kostoulas, M.G., Matsa, M., Mendelsohn, N., Perkins, E., Heifets, A., Mercaldi, M.: XML screamer: an integrated approach to high performance XML parsing, validation and deserialization. In: Proc. of the 15th International Conference on World Wide Web, New York, pp. 93–102 (2006)
Li, Q., Moon, B.: Indexing and querying XML data for regular path expressions. In: Proc. of the 27th VLDB Conference, Rome, Italy, September 2001, pp. 361–370 (2001)
Megginson, D.: Simple API for XML, http://sax.sourceforge.net/
Nicola, M., John, J.: XML parsing: a threat to database performance. In: Proc. of the 12th International Conference on Information and Knowledge Management, pp. 175–178 (2003)
Pan, Y., Lu, W., Zhang, Y., Chiu, K.: A Static Load-Balancing Scheme for Parallel XML Parsing on Multicore CPUs. In: Proc. of the 7th International Symposium on Cluster Computing and the Grid (CCGRID), Washington D.C., May 2007, pp. 351–362 (2007)
Pan, Y., Zhang, Y., Chiu, K.: Simultaneous transducers for data-parallel XML parsing. In: Proc. of Intl. Symposium on Parallel and Distributed Processing, April 2008, pp. 1–12 (2008)
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for Multi-core and Multiprocessor Systems. In: Proceedings of the 13th International Symposium on High-Performance Computer Architecture (HPCA), Phoenix, AZ (Feburary 2007)
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph. 27(3), 1–15 (2008)
Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugasundaram, J., Shekita, E., Zhang, C.: Storing and Querying Ordered XML Using a Relational Database System. In: Proc. of the 2002 ACM-SIGMOD Conference, June 2002, pp. 204–215 (2002)
TPC. TPC-H (2002), http://www.tpc.org/tpch/
UW XML Repository (2001), http://www.cs.washington.edu/research/xmldatasets
W3C. The document object model (1998), http://www.w3.org/DOM
Wu, Y., Zhang, Q., Yu, Z., Li, J.: A Hybrid Parallel Processing for XML Parsing and Schema Validation. In: Proceedings of Balisage Markup Conference (2008)
Zhang, J., Lovette, K.: XimpleWare W3C Position Paper. In: W3C Workshop on Binary Interchange of XML Information Item Sets (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shah, B., Rao, P.R., Moon, B., Rajagopalan, M. (2009). A Data Parallel Algorithm for XML DOM Parsing. In: Bellahsène, Z., Hunt, E., Rys, M., Unland, R. (eds) Database and XML Technologies. XSym 2009. Lecture Notes in Computer Science, vol 5679. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03555-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-03555-5_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03554-8
Online ISBN: 978-3-642-03555-5
eBook Packages: Computer ScienceComputer Science (R0)