Document Decomposition for XML Compression: A Heuristic Approach

Choi, Byron

doi:10.1007/11733836_16

Byron Choi¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3882))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1039 Accesses
3 Citations

Abstract

Sharing of common subtrees has been reported useful not only for XML compression but also for main-memory XML query processing. This method compresses subtrees only when they exhibit identical structure. Even slight irregularities among subtrees dramatically reduce the performance of compression algorithms of this kind. Furthermore, when XML documents are large, the chance of having large number of identical subtrees is inherently low. In this paper, we proposed a method of decomposing XML documents for better compression. We proposed a heuristic method of locating minor irregularities in XML documents. The irregularities are then projected out from the original XML document. We refered this process to as document decomposition. We demonstrated that better compression can be achieved by compressing the decomposed documents separately. Experimental results demonstrated that the compressed skeletons, for all real-world datasets, to our knowledge, fit comfortably into main memory of commodity computers nowadays. Preliminary results on querying compressed skeletons validate the effectiveness our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Babu, S., Garofalakis, M.N., Rastogi, R.: Spartan: A model-based semantic compression system for massive data tables. In: SIGMOD, pp. 283–294 (2001)
Google Scholar
Berchtold, S., Bohm, C., Keim, D.A., Kriegel, H.-P.: A cost model for nearest neighbor search in high-dimensional data space. In: PODS, pp. 78–86 (1997)
Google Scholar
Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R., Viglas, S.: Vectorizing and querying large xml repositories. In: ICDE, pp. 261–272 (2005)
Google Scholar
Buneman, P., Grohe, M., Koch, C.: Path Queries on Compressed XML. In: Aberer, K., Koubarakis, M., Kalogeraki, V. (eds.) VLDB 2003. LNCS, vol. 2944, pp. 141–152. Springer, Heidelberg (2004)
Google Scholar
Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: Data Compression Conference, pp. 163–172 (2001)
Google Scholar
Cheng, J., Ng, W.: Xqzip: Querying compressed xml using structural indexing. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 219–236. Springer, Heidelberg (2004)
Chapter Google Scholar
Deutsch, A., Fernandez, M.F., Suciu, D.: Storing semistructured data with STORED. In: SIGMOD, pp. 431–442. ACM Press, New York (1999)
Google Scholar
Gray, J., Slutz, D., Szalay, A., Thakar, A.,, J.: vandenBerg, P. Kunszt, and C. Stoughton. Data mining the SDSS Skyserver database. Technical Report MSR-TR-2002-01, Microsoft (2002)
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, pp. 119–130. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Jagadish, H.V., Madar, J., Ng, R.T.: Semantic compression and pattern extraction with fascicles. In: VLDB, pp. 186–198 (1999)
Google Scholar
Jagadish, H.V., Ng, R.T., Ooi, B.C., Tung, A.K.H.: Itcompress: An iterative semantic compression algorithm. In: ICDE, pp. 646–657 (2004)
Google Scholar
Language and Information in Computation at Penn. Penn treebank project, Available at: http://www.cis.upenn.edu/~treebank/
Ley, M.: Dblp bibliography (March 2005), Available at: http://www.informatik.uni-trier.de/~ley/db/
Liefke, H., Suciu, D.: XMill: an efficient compressor for XML data. In: SIGMOD, pp. 153–164 (2000)
Google Scholar
Miller, E., Swick, R., Brickley, D., McBride, B., Hendler, J., Schreiber, G., Connolly, D.: Semantic Web. W3C Working Group (August 2005), http://www.w3.org/2001/sw/
Min, J.-K., Park, M.-J., Chung, C.-W.: Xpress: a queriable compression for xml data. In: SIGMOD, pp. 122–133 (2003)
Google Scholar
Schmidt, A., Waas, F., Kersten, M., Carey, M.J., Manolescu, I., Busse, R.: XMark: A benchmark for XML data management. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 974–985. Springer, Heidelberg (2003)
Chapter Google Scholar
Tolani, P.M., Haritsa, J.R.: Xgrind: A query-friendly xml compressor. In: ICDE, pp. 225–234 (2002)
Google Scholar
U.S. National Library of Medicine. MEDLINE distributed in XML format., Available at: http://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html
Valduriez, P.: Join indices. TODS 12(2), 218–246 (1987)
Article Google Scholar
Wang, K., Liu, H.: Discovering typical structures of documents: a road map approach. In: SIGIR, pp. 146–154 (1998)
Google Scholar
Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Nanyang Technological University, Singapore
Byron Choi

Authors

Byron Choi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, National University of Singapore, Singapore
Mong Li Lee
School of Computing, National University of Singapore, Singapore
Kian-Lee Tan
School of Engineering and Technology, Asian Institute of Technology, P.O. Box 4, 12120, Klong Luang, Pathum Thani, Thailand
Vilas Wuwongse

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Choi, B. (2006). Document Decomposition for XML Compression: A Heuristic Approach. In: Li Lee, M., Tan, KL., Wuwongse, V. (eds) Database Systems for Advanced Applications. DASFAA 2006. Lecture Notes in Computer Science, vol 3882. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11733836_16

Download citation

DOI: https://doi.org/10.1007/11733836_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33337-1
Online ISBN: 978-3-540-33338-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics