Skip to main content

Document Decomposition for XML Compression: A Heuristic Approach

  • Conference paper
Database Systems for Advanced Applications (DASFAA 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3882))

Included in the following conference series:

Abstract

Sharing of common subtrees has been reported useful not only for XML compression but also for main-memory XML query processing. This method compresses subtrees only when they exhibit identical structure. Even slight irregularities among subtrees dramatically reduce the performance of compression algorithms of this kind. Furthermore, when XML documents are large, the chance of having large number of identical subtrees is inherently low. In this paper, we proposed a method of decomposing XML documents for better compression. We proposed a heuristic method of locating minor irregularities in XML documents. The irregularities are then projected out from the original XML document. We refered this process to as document decomposition. We demonstrated that better compression can be achieved by compressing the decomposed documents separately. Experimental results demonstrated that the compressed skeletons, for all real-world datasets, to our knowledge, fit comfortably into main memory of commodity computers nowadays. Preliminary results on querying compressed skeletons validate the effectiveness our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Babu, S., Garofalakis, M.N., Rastogi, R.: Spartan: A model-based semantic compression system for massive data tables. In: SIGMOD, pp. 283–294 (2001)

    Google Scholar 

  2. Berchtold, S., Bohm, C., Keim, D.A., Kriegel, H.-P.: A cost model for nearest neighbor search in high-dimensional data space. In: PODS, pp. 78–86 (1997)

    Google Scholar 

  3. Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R., Viglas, S.: Vectorizing and querying large xml repositories. In: ICDE, pp. 261–272 (2005)

    Google Scholar 

  4. Buneman, P., Grohe, M., Koch, C.: Path Queries on Compressed XML. In: Aberer, K., Koubarakis, M., Kalogeraki, V. (eds.) VLDB 2003. LNCS, vol. 2944, pp. 141–152. Springer, Heidelberg (2004)

    Google Scholar 

  5. Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: Data Compression Conference, pp. 163–172 (2001)

    Google Scholar 

  6. Cheng, J., Ng, W.: Xqzip: Querying compressed xml using structural indexing. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 219–236. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  7. Deutsch, A., Fernandez, M.F., Suciu, D.: Storing semistructured data with STORED. In: SIGMOD, pp. 431–442. ACM Press, New York (1999)

    Google Scholar 

  8. Gray, J., Slutz, D., Szalay, A., Thakar, A.,, J.: vandenBerg, P. Kunszt, and C. Stoughton. Data mining the SDSS Skyserver database. Technical Report MSR-TR-2002-01, Microsoft (2002)

    Google Scholar 

  9. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, pp. 119–130. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  10. Jagadish, H.V., Madar, J., Ng, R.T.: Semantic compression and pattern extraction with fascicles. In: VLDB, pp. 186–198 (1999)

    Google Scholar 

  11. Jagadish, H.V., Ng, R.T., Ooi, B.C., Tung, A.K.H.: Itcompress: An iterative semantic compression algorithm. In: ICDE, pp. 646–657 (2004)

    Google Scholar 

  12. Language and Information in Computation at Penn. Penn treebank project, Available at: http://www.cis.upenn.edu/~treebank/

  13. Ley, M.: Dblp bibliography (March 2005), Available at: http://www.informatik.uni-trier.de/~ley/db/

  14. Liefke, H., Suciu, D.: XMill: an efficient compressor for XML data. In: SIGMOD, pp. 153–164 (2000)

    Google Scholar 

  15. Miller, E., Swick, R., Brickley, D., McBride, B., Hendler, J., Schreiber, G., Connolly, D.: Semantic Web. W3C Working Group (August 2005), http://www.w3.org/2001/sw/

  16. Min, J.-K., Park, M.-J., Chung, C.-W.: Xpress: a queriable compression for xml data. In: SIGMOD, pp. 122–133 (2003)

    Google Scholar 

  17. Schmidt, A., Waas, F., Kersten, M., Carey, M.J., Manolescu, I., Busse, R.: XMark: A benchmark for XML data management. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 974–985. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  18. Tolani, P.M., Haritsa, J.R.: Xgrind: A query-friendly xml compressor. In: ICDE, pp. 225–234 (2002)

    Google Scholar 

  19. U.S. National Library of Medicine. MEDLINE distributed in XML format., Available at: http://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html

  20. Valduriez, P.: Join indices. TODS 12(2), 218–246 (1987)

    Article  Google Scholar 

  21. Wang, K., Liu, H.: Discovering typical structures of documents: a road map approach. In: SIGIR, pp. 146–154 (1998)

    Google Scholar 

  22. Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Choi, B. (2006). Document Decomposition for XML Compression: A Heuristic Approach. In: Li Lee, M., Tan, KL., Wuwongse, V. (eds) Database Systems for Advanced Applications. DASFAA 2006. Lecture Notes in Computer Science, vol 3882. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11733836_16

Download citation

  • DOI: https://doi.org/10.1007/11733836_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33337-1

  • Online ISBN: 978-3-540-33338-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics