Skip to main content

Visually Lossless HTML Compression

  • Conference paper
Book cover Web Information Systems Engineering - WISE 2009 (WISE 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5802))

Included in the following conference series:

Abstract

The verbosity of the Hypertext Markup Language (HTML) remains one of its main weaknesses. This problem can be solved with the aid of HTML specialized compression algorithms. In this work, we describe a visually lossless HTML transform that, combined with generally used compression algorithms, allows to attain high compression ratios. Its core is a transform featuring substitution of words in an HTML document using a static English dictionary, effective encoding of dictionary indexes, numbers, and specific patterns.

Visually lossless compression means that the HTML document layout will be modified, but the document displayed in a browser will provide the exact fidelity with the original. The experimental results show that the proposed transform improves the HTML compression efficiency of general purpose compressors on average by 21% in the case of gzip, achieving comparable processing speed. Moreover, we show that the compression ratio of gzip can be improved by up to 32% for the price of higher memory requirements and much slower processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adiego, J., de la Fuente, P.: Mapping Words into Codewords on PPM. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 181–192. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Adiego, J., de la Fuente, P., Navarro, G.: Using Structural Contexts to Compress Semistructured Text Collections. Information Processing and Management 43(3), 769–790 (2007)

    Article  Google Scholar 

  3. Burrows, M., Wheeler, D.J.: A block-sorting data compression algorithm. SRC Research Report 124. Digital Equipment Corporation, Palo Alto, CA, USA (1994)

    Google Scholar 

  4. Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: Proceedings of the IEEE Data Compression Conference, Snowbird, UT, USA, pp. 163–172 (2001)

    Google Scholar 

  5. Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. on Comm. 32(4), 396–402 (1984)

    Article  Google Scholar 

  6. Deutsch, P.: DEFLATE Compressed Data Format Specification version 1.3. RFC1951 (1996), http://www.ietf.org/rfc/rfc1951.txt

  7. Huffman, D.A.: A Method for the Construction of Minimum-Redundancy Codes. In: Proc. IRE 40.9, September 1952, pp. 1098–1101 (1952)

    Google Scholar 

  8. Lánský, J., Žemlička, M.: Text Compression: Syllables. In: Proceedings of the Dateso 2005 Annual International Workshop on DAtabases, TExts, Specifications and Objects. CEUR-WS, vol. 129, pp. 32–45 (2005)

    Google Scholar 

  9. Mahoney, M.: About the Test Data (2006), http://cs.fit.edu/~mmahoney/compression/textdata.html

  10. Mahoney, M.: Adaptive Weighing of Context Models for Lossless Data Compression. Technical Report TR-CS-2005-16, Florida Tech., USA (2005)

    Google Scholar 

  11. Nielsen, H.F.: HTTP Performance Overview (2003), http://www.w3.org/Protocols/HTTP/Performance/

  12. Radhakrishnan, S.: Speed Web delivery with HTTP compression (2003), http://www-128.ibm.com/developerworks/web/library/wa-httpcomp/

  13. Shkarin, D.: PPM: One Step to Practicality. In: Proceedings of the IEEE Data Compression Conference, Snowbird, UT, USA, pp. 202–211 (2002)

    Google Scholar 

  14. Skibiński, P.: Improving HTML Compression. To appear in Informatica (2009)

    Google Scholar 

  15. Skibiński, P., Grabowski, S.z.: Variable-length contexts for PPM. In: Proceedings of the IEEE Data Compression Conference, Snowbird, UT, USA, pp. 409–418 (2004)

    Google Scholar 

  16. Skibiński, P., Grabowski, S.z., Deorowicz, S.: Revisiting dictionary-based compression. Software – Practice and Experience 35(15), 1455–1476 (2005)

    Article  Google Scholar 

  17. Skibiński, P., Grabowski, S.z., Swacha, J.: Effective asymmetric XML compression. Software – Practice and Experience 38(10), 1027–1047 (2008)

    Article  Google Scholar 

  18. Sun, W., Zhang, N., Mukherjee, A.: Dictionary-based fast transform for text compression. In: Proceedings of international conference on Information Technology: Coding and Computing, ITCC, pp. 176–182 (2003)

    Google Scholar 

  19. Wan, R.: Browsing and Searching Compressed Documents. PhD dissertation, University of Melbourne (2003), http://www.bic.kyoto-u.ac.jp/proteome/rwan/docs/wan_phd_new.pdf

  20. Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Trans. Inform. Theory 23(3), 337–343 (1977)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Skibiński, P. (2009). Visually Lossless HTML Compression. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds) Web Information Systems Engineering - WISE 2009. WISE 2009. Lecture Notes in Computer Science, vol 5802. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04409-0_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04409-0_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04408-3

  • Online ISBN: 978-3-642-04409-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics