Skip to main content

Abstract

The progress in sequencing technologies and the increasing availability of DNA sequences from extant and extinct organisms is shaping our knowledge about species origin and development, as well as originating an improvement of the computational methods for storage and analysis purposes. Given the large volume of DNA sequences, computational models that efficiently represent diverse DNA sequences using low computational resources are very welcome. Currently, for benchmarking compression algorithms there is absence of a standard corpus that enables a wide and fair comparison. This should be a corpus that reflects the main domains and kingdoms, without being exaggerated in size and number of sequences. In this paper, we provide such DNA sequence corpus, overviewing its elements and furnishing a comparison of some of the algorithms for DNA sequence compression. The corpus is available at https://tinyurl.com/DNAcorpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proceedings of the Data Compression Conference, DCC-1993, Snowbird, Utah, pp. 340–350 (1993)

    Google Scholar 

  2. Grumbach, S., Tahi, F.: A new challenge for compression algorithms: genetic sequences. Inf. Process. Manage. 30(6), 875–886 (1994)

    Article  Google Scholar 

  3. Rivals, E., Delgrange, O., Delahaye, J.P., Dauchet, M., Delorme, M.O., Hénaut, A., Ollivier, E.: Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. Comput. Appl. Biosci. 13, 131–136 (1997)

    Google Scholar 

  4. Chen, T., Sullivan, G.J., Puri, A.: H.263 (including H.263+) and other ITU-T video coding standards. In: Puri, A., Chen, T., (eds.) Multimedia Systems, Standards, and Networks pp. 55–85. Marcel Dekker (2000)

    Google Scholar 

  5. Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(12), 1696–1698 (2002)

    Article  Google Scholar 

  6. Tabus, I., Korodi, G., Rissanen, J.: DNA sequence compression using the normalized maximum likelihood model for discrete regression. In: Proceedings of the Data Compression Conference, DCC-2003, Snowbird, Utah, pp. 253–262 (2003)

    Google Scholar 

  7. Korodi, G., Tabus, I.: Normalized maximum likelihood model of order-1 for the compression of DNA sequences. In: Proceedings of the Data Compression Conference, DCC-2007, Snowbird, Utah, pp. 33–42, March 2007

    Google Scholar 

  8. Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proceedings of the Data Compression Conference, DCC-2007, Snowbird, Utah, pp. 43–52, March 2007

    Google Scholar 

  9. Pinho, A.J., Ferreira, P.J.S.G., Neves, A.J.R., Bastos, C.A.C.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6(6), e21588 (2011)

    Article  Google Scholar 

  10. Gupta, A., Agarwal, S.: A novel approach for compressing DNA sequences using semi-statistical compressor. Int. J. Comput. Appl. 33(3), 245–251 (2011)

    Google Scholar 

  11. Zhu, Z., Zhou, J., Ji, Z., Shi, Y.: DNA sequence compression using adaptive particle swarm optimization-based memetic algorithm. IEEE Trans. Evol. Comput. 15(5), 643–658 (2011)

    Article  Google Scholar 

  12. Bose, T., Mohammed, M.H., Dutta, A., Mande, S.S.: BIND-an algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37(4), 785–789 (2012)

    Article  Google Scholar 

  13. Dai, W., Xiong, H., Jiang, X., Ohno-Machado, L.: An adaptive difference distribution-based coding with hierarchical tree structure for DNA sequence compression. In: Proceedings of the Data Compression Conference, DCC-2013, pp. 371–380. IEEE (2013)

    Google Scholar 

  14. Li, P., Wang, S., Kim, J., Xiong, H., Ohno-Machado, L., Jiang, X.: DNA-COMPACT: DNA compression based on a pattern-aware contextual modeling technique. PLoS ONE 8(11), e80377 (2013)

    Article  Google Scholar 

  15. Guo, H., Chen, M., Liu, X., Xie, M.: Genome compression based on Hilbert space filling curve. In: Proceedings of the 3rd International Conference on Management, Education, Information and Control (MEICI 2015), Shenyang, China, pp. 29–31 (2015)

    Google Scholar 

  16. Xie, X., Zhou, S., Guan, J.: CoGI: towards compressing genomes as an image. IEEE/ACM Trans. Comput. Biol. Bioinf. 12(6), 1275–1285 (2015)

    Article  Google Scholar 

  17. Pratas, D., Pinho, A.J., Ferreira, P.J.S.G.: Efficient compression of genomic sequences. In: Proceedings of the Data Compression Conference, DCC-2016, Snowbird, Utah, 231–240, March 2016

    Google Scholar 

  18. Hosseini, M., Pratas, D., Pinho, A.J.: A survey on data compression methods for biological sequences. Information 7(4), 56 (2016)

    Article  Google Scholar 

  19. Manzini, G., Rastero, M.: A simple and fast DNA compressor. Software-Pract. Experience 34, 1397–1411 (2004)

    Article  Google Scholar 

  20. Pratas, D., Pinho, A.J.: On the approximation of the Kolmogorov complexity for DNA sequences. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 259–266. Springer (2017)

    Google Scholar 

  21. Pinho, A.J., Garcia, S.P., Pratas, D., Ferreira, P.J.S.G.: DNA sequences at a glance. PLoS ONE 8(11), e79922 (2013)

    Article  Google Scholar 

  22. Sales, E., Viruel, J., Domingo, C., Marqués, L.: Genome wide association analysis of cold tolerance at germination in temperate japonica rice (Oryza sativa L.) varieties. PLoS ONE 12(8), e0183416 (2017)

    Article  Google Scholar 

  23. Hudson, N., Hawken, R., Okimoto, R., Sapp, R., Reverter, A.: Data compression can discriminate broilers by selection line, detect haplotypes, and estimate genetic potential for complex phenotypes. Poult. Sci. 96(9), 3031–3038 (2017)

    Article  Google Scholar 

  24. Keck, V.A., Edgerton, D.S., Hajizadeh, S., Swift, L.L., Dupont, W.D., Lawrence, C., Boyd, K.L.: Effects of habitat complexity on pair-housed zebrafish. J. Am. Assoc. Lab. Anim. Sci. 54(4), 378–383 (2015)

    Google Scholar 

  25. Goldshmit, Y., Sztal, T.E., Jusuf, P.R., Hall, T.E., Nguyen-Chi, M., Currie, P.D.: Fgf-dependent glial cell bridges facilitate spinal cord regeneration in zebrafish. J. Neurosci. 32(22), 7477–7492 (2012)

    Article  Google Scholar 

  26. Bamberger, C., Martínez-Bartolomé, S., Montgomery, M., Lavallée-Adam, M., Yates, J.R.: Increased proteomic complexity in Drosophila hybrids during development. Sci. Adv. 4(2), eaao3424 (2018)

    Article  Google Scholar 

  27. Wood, V., et al.: The genome sequence of Schizosaccharomyces pombe. Nature 415(6874), 871–80 (2002)

    Article  Google Scholar 

  28. Pinho, A.J., Pratas, D., Ferreira, P.J.S.G.: Authorship attribution using relative compression. In: Proceedings of the Data Compression Conference, DCC-2016, Snowbird, Utah, March 2016

    Google Scholar 

  29. Rich, S.M., Leendertz, F.H., Xu, G., LeBreton, M., Djoko, C.F., Aminake, M.N., Takang, E.E., Diffo, J.L., Pike, B.L., Rosenthal, B.M., et al.: The origin of malignant malaria. Proc. Natl. Acad. Sci. 106(35), 14902–14907 (2009)

    Article  Google Scholar 

  30. Tenaillon, O., Skurnik, D., Picard, B., Denamur, E.: The population genetics of commensal Escherichia coli. Nat. Rev. Microbiol. 8(3), 207 (2010)

    Article  Google Scholar 

  31. Eusebi, L.H., Zagari, R.M., Bazzoli, F.: Epidemiology of Helicobacter pylori infection. Helicobacter 19(s1), 1–5 (2014)

    Article  Google Scholar 

  32. Nakagawa, S., Takai, K., Horikoshi, K., Sako, Y.: Aeropyrum camini sp. nov., a strictly aerobic, hyperthermophilic archaeon from a deep-sea hydrothermal vent chimney. Int. J. Syst. Evol. Microbiol. 54(2), 329–335 (2004)

    Article  Google Scholar 

  33. Liu, H., Wu, Z., Li, M., Zhang, F., Zheng, H., Han, J., Liu, J., Zhou, J., Wang, S., Xiang, H.: Complete genome sequence of Haloarcula hispanica, a model haloarchaeon for studying genetics, metabolism, and virus-host interaction. J. Bacteriol. 193(21), 6086–6087 (2011)

    Article  Google Scholar 

  34. Zhang, W., Zhou, J., Liu, T., Yu, Y., Pan, Y., Yan, S., Wang, Y.: Four novel algal virus genomes discovered from Yellowstone Lake metagenomes. Sci. Rep. 5, 15131 (2015)

    Article  Google Scholar 

  35. Silva, R.M., Pratas, D., Castro, L., Pinho, A.J., Ferreira, P.J.S.G.: Three minimal sequences found in Ebola virus genomes and absent from human DNA. Bioinformatics 31(15), 2421–2425 (2015)

    Article  Google Scholar 

  36. Wang, J., Gao, Y., Zhao, F.: Phage-bacteria interaction network in human oral microbiome. Environ. Microbiol. 18(7), 2143–2158 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

This work was partially funded by FEDER (Programa Operacional Factores de Competitividade - COMPETE) and by National Funds through the FCT, in the context of the projects UID/CEC/00127/2013 & PTCD/EEI-SII/6608/2014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diogo Pratas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pratas, D., Pinho, A.J. (2019). A DNA Sequence Corpus for Compression Benchmark. In: Fdez-Riverola, F., Mohamad, M., Rocha, M., De Paz, J., González, P. (eds) Practical Applications of Computational Biology and Bioinformatics, 12th International Conference. PACBB2018 2018. Advances in Intelligent Systems and Computing, vol 803. Springer, Cham. https://doi.org/10.1007/978-3-319-98702-6_25

Download citation

Publish with us

Policies and ethics