Knowledgeable Chunking

  • Bertil ChapuisEmail author
  • Benoît Garbinato
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9466)


Chunking algorithms are often used by storage solutions in order to factorize and deduplicate data. Such algorithms make the assumption that the consecutive versions of a file share a lot of similarities. Unfortunately, file formats often use compression algorithms and minor changes have the potential to completely reorganize the internal layout of a file. In consequence, chunking algorithms become less efficient in factorizing data. In this paper, we evaluate content-defined chunking with file formats that use data compression. We show how content-defined chunking algorithms can take the file format into account. Finally, we demonstrate that adding file format knowledge to a popular chunking algorithm significantly improves its performance.


File Format Compression Algorithm Storage Solution Cryptographic Hash Function Specific Header 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Asimov, I., Silverberg, R., Timmerman, H.: The Bicentennial Man. Millennium, Hyderabad (2000)Google Scholar
  2. 2.
    Drago, I., Mellia, M., Munafo, M.M., Sperotto, A., Sadre, R., Pras, A.: Inside dropbox: understanding personal cloud storage services. In: Proceedings of the 2012 ACM Conference on Internet Measurement Conference, pp. 481–494. ACM (2012)Google Scholar
  3. 3.
    Eshghi, K., Tang, H.K.: A framework for analyzing and improving content-based chunking algorithms. Hewlett-Packard Labs Technical report TR vol. 30 (2005)Google Scholar
  4. 4.
    Meyer, D.T., Bolosky, W.J.: A study of practical deduplication. ACM Trans. Storage (TOS) 7(4), 14 (2012)Google Scholar
  5. 5.
    Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: ACM SIGOPS Operating Systems Review, vol. 35, pp. 174–187. ACM (2001)Google Scholar
  6. 6.
    Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. FAST 2, 89–101 (2002)Google Scholar
  7. 7.
    Rabin, M.O.: Fingerprinting by random polynomials. Center for Research in Computing Techn., Aiken Computation Laboratory, University (1981)Google Scholar
  8. 8.
    Tridgell, A., Mackerras, P. et al.: The rsync algorithm (1996)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.University of LausanneLausanneSwitzerland

Personalised recommendations