Skip to main content

Permutation Based XML Compression

  • Conference paper
  • First Online:
  • 647 Accesses

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 226))

Abstract

An XML document D often has a regular structure, i.e., it is composed of many similarly named and structured subtrees. Therefore, the entropy of a trees structuredness should be relatively low and thus the trees should be highly compressible by transforming them to an intermediate form. In general, this idea is used in permutation based XML-conscious compressors. An example of such a compressor is called XSAQCT, where the compressible form is called an annotated tree. While XSAQCT proved to be useful for various applications, it was never shown that it is a lossless compressor. This paper provides the formal background for the definition of an annotated tree, and a formal proof that the compression is lossless. It also shows properties of annotated trees that are useful for various applications, and discusses a measure of compressibility using this approach, followed by the experimental results showing compressibility of annotated trees.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    gzip -9 FILE.

  2. 2.

    xz -9 -e FILE.

  3. 3.

    bzip2 -9 FILE.

  4. 4.

    ppmonstr -m1700 -o64 FILE.

  5. 5.

    zpaq add FILE.zpaq FILE -method 69 -noattributes.

  6. 6.

    paq8pxd_v7 -8 FILE.

References

  1. XML: Extensible markup language (XML) 1.0 (Fifth edition) (2013). http://www.w3.org/tr/rec-xml/. Assessed October 2013

  2. Busatto, G., Lohrey, M., Maneth, S.: Efficient memory representation of XML documents. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 199–216. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  3. Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1), 4:1–4:33 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  4. Busatto, G., Lohrey, M., Maneth, S.: Efficient memory representation of XML document trees. Inf. Syst. 33(4–5), 456–474 (2008)

    Article  MATH  Google Scholar 

  5. Arion, A., Bonifati, A., Manolescu, I., Pugliese, A.: XQueC: a query-conscious compressed XML database. ACM Trans. Internet Technol. 7(2), 1–32 (2007)

    Article  Google Scholar 

  6. GZIP: The gzip home page (2013). http://www.gzip.org. Assessed October 2013

  7. bzip2: bzip2 compression (2013). http://www.bzip.org/. Assessed October 2013

  8. Müldner, T., Fry, C., Miziołek, J., Durno, S.: XSAQCT: XML queryable compressor. In: Balisage: The Markup Conference 2009, Montreal, Canada, August 2009

    Google Scholar 

  9. Müldner, T., Miziołek, J., Corbin, T.: Annotated trees and their applications to XML compression. In: The Tenth International Conference on Web Information Systems and Technologies, WEBIST, Barcelona, Spain, pp. 27–39 (2014)

    Google Scholar 

  10. Müldner, T., Corbin, T., Miziołek, J., Fry, C.: Design and implementation of an online XML compressor for large XML files. Int. J. Adv. Internet Technol. 5(3), 115–118 (2012)

    Google Scholar 

  11. xmlgen: The benchmark data generator (2013). http://www.xml-benchmark.org/generator.html. Assessed October 2013

  12. Baseball.xml: baseball.xml (2013). http://rassyndrome.webs.com/cc/baseball.xml. Assessed October 2013

  13. Corpus, W.: Wratislavia XML corpus (2013). http://www.ii.uni.wroc.pl/~inikep/research/wratislavia/. Assessed October 2013

  14. Consortium, T.U.: Update on activities at the Universal Protein Resource (UniProt) in 2013 (January 2013). http://dx.doi.org/10.1093/nar/gks1068. Assessed on 20 June 2013

  15. enwiki dumps: enwiki-latest.xml (2013). http://dumps.wikimedia.org/enwiki/latest/. Assessed October 2013

  16. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23(3), 337–343 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  17. Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation (1994)

    Google Scholar 

  18. ZPAQ: Zpaq (2013). http://www.w3.org/tr/rec-xml/. Assessed October 2013

  19. Mahoney, M.: Large Text Compression Benchmark (2012). http://mattmahoney.net/dc/zpaq.html. Assessed October 2013

  20. Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing xpath queries. ACM Trans. Database Syst. 30(2), 444–491 (2005)

    Article  Google Scholar 

Download references

Acknowledgements

The work of the first and third authors are partially supported by the NSERC RGPIN grant and NSERC CSG-M (Canada Graduate Scholarship-Masters) grant respectively.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomasz Müldner .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Müldner, T., Miziołek, J.K., Corbin, T. (2015). Permutation Based XML Compression. In: Monfort, V., Krempels, KH. (eds) Web Information Systems and Technologies. WEBIST 2014. Lecture Notes in Business Information Processing, vol 226. Springer, Cham. https://doi.org/10.1007/978-3-319-27030-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27030-2_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27029-6

  • Online ISBN: 978-3-319-27030-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics