Searchable Compression of Office Documents by XML Schema Subtraction

  • Stefan Böttcher
  • Rita Hartel
  • Christian Messinger
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6309)


Starting with Microsoft Office 2007, the Office Open XML file formats have become the default file format of Microsoft Office. As each day a lot of office documents have to be stored and transferred, reducing the document size will yield a benefit when storing and transferring these files. We present a compressed format for XML-based office documents that omits that data from an office document that is already defined by the Office Open XML format. Our evaluation shows that our compressed format reduces the – already compressed – office documents to a data size down to 41% of the original document size. Furthermore, for search operations tested in our evaluation, searching is faster on our compressed office documents than it is on the original documents.


XML compression Microsoft Office document compression  efficient search on compressed Open Office XML documents 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Adiego, J., Navarro, G., de la Fuente, P.: Lempel-Ziv Compression of Structured Text. In: Data Compression Conference (2004)Google Scholar
  2. 2.
    Arion, Bonifati, A., Manolescu, I., Pugliese, A.: XQueC: A Query-Conscious Compressed XML Database. ACM Transactions on Internet Technology (2007)Google Scholar
  3. 3.
    Bayardo, R.J., Gruhl, D., Josifovski, V., Myllymaki, J.: An evaluation of binary xml encoding optimizations for fast stream based XML processing. In: Proc. of the 13th International Conference on World Wide Web (2004)Google Scholar
  4. 4.
    Böttcher, S., Steinmetz, R., Klein, N.: XML Index Compression by DTD Subtraction. In: 9th International Conference on Enterprise Information Systems, ICEIS (2007)Google Scholar
  5. 5.
    Böttcher, S., Hartel, R., Messinger, C.: SEPA. Queryable SEPA Message Compression by XML Schema Subtraction. In: 12th International Conference on Enterprise Information Systems, ICEIS (2010)Google Scholar
  6. 6.
    Buneman, P., Grohe, M., Koch, C.: Path Queries on Compressed XML. In: VLDB (2003)Google Scholar
  7. 7.
    Busatto, G., Lohrey, M., Maneth, S.: Efficient Memory Representation of XML Dokuments. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 199–216. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  8. 8.
    Cheney, J.: Compressing XML with multiplexed hierarchical models. In: Proceedings of the 2001 IEEE Data Compression Conference, DCC 2001 (2001)Google Scholar
  9. 9.
    Cheng, J., Ng, W.: XQzip, Querying Compressed XML Using Structural Indexing. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 219–236. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications 32(4), 396–402 (1984)CrossRefGoogle Scholar
  11. 11.
    Cormack, G., Horspool, N.: Data compression using adaptive coding and partial string matching. Computer Journal 30(6) (1987)Google Scholar
  12. 12.
    Fraenkel, A., Klein, S.: Robust universal complete codes for transmission and compresion. Discrete Applied Mathematics 64, 31–55 (1996)CrossRefzbMATHGoogle Scholar
  13. 13.
    Girardot, M., Sundaresan, N., Millau: An Encod¬ing Format for Efficient Representation and Exchange of XML over the Web. In: Proceedings of the 9th International WWW Conference (2000)Google Scholar
  14. 14.
    Golomb, S.W.: Run-length encodings. IEEE Trans Info Theory 12(3), 399 (1966)CrossRefzbMATHGoogle Scholar
  15. 15.
    Huffman, D.A.: A method for the construction of minimum-redundancy codes. In: Proc. of the I.R.E. (1952)Google Scholar
  16. 16.
    Liefke, H., Suciu, D.: XMill: An Efficient Compressor for XML Data. In: Proc. of ACM SIGMOD (2000)Google Scholar
  17. 17.
    Martin, G.N.N.: Range encoding: an algorithm for removing redundancy from a digitized message. In: Video and Data Recording Conference, Southampton (1979)Google Scholar
  18. 18.
    Min, J.K., Park, M.J., Chung, C.W.: XPRESS: A Queriable Compression for XML Data. In: Proceedings of SIGMOD (2003)Google Scholar
  19. 19.
    Ng, W., Lam, W.Y., Wood, P.T., Levene, M.: XCQ: A queriable XML compression system. Knowledge and Information Systems (2006)Google Scholar
  20. 20.
    Subramanian, H., Shankar, P.: Compressing XML Documents Using Recursive Finite State Automata. In: Farré, J., Litovsky, I., Schmitz, S. (eds.) CIAA 2005. LNCS, vol. 3845, pp. 282–293. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  21. 21.
    Tolani, P.M., Hartisa, J.R.: XGRIND: A query-friendly XML compressor. In: Proc. ICDE (2002)Google Scholar
  22. 22.
    Welch, T.A.: A technique for high-performance data compression. Computer Journal 17(6), 8–19 (1984)CrossRefGoogle Scholar
  23. 23.
    Werner, C., Buschmann, C., Brandt, Y., Fischer, S.: Compressing SOAP Messages by using Pushdown Automata. In: ICWS (2006)Google Scholar
  24. 24.
    Witten, H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Communcations of the ACM 30(6), 520–540 (1987)CrossRefGoogle Scholar
  25. 25.
    Zhang, N., Kacholia, V., Özsu, M.T.: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. In: ICDE (2004)Google Scholar
  26. 26.
    Ziv, Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Ziv, Lempel, A.: Compression on individual sequences via variable-rate coding. IEEE Transactions on Information Theory (1978)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Stefan Böttcher
    • 1
  • Rita Hartel
    • 1
  • Christian Messinger
    • 1
  1. 1.Computer ScienceUniversity of PaderbornPaderbornGermany

Personalised recommendations