Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Genomic Data Compression

  • Kaiyuan Zhu
  • Ibrahim Numanagić
  • S. Cenk SahinalpEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_55

Abstract

Genomic sequence data obtained through high-throughput sequencing (HTS) technologies are commonly stored either as raw sequencing reads in FASTQ format or as reads mapped to a reference genome in SAM format. Both of these formats have large memory footprints. Worldwide increase of HTS data has prompted the development of specialized compression methods that aim to significantly reduce HTS data size. Below is a comparative overview of available lossless genomic data compression approaches, including their advantages and pitfalls.

This is a preview of subscription content, log in to check access.

References

  1. Benoit G et al (2015) Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics 16:288CrossRefGoogle Scholar
  2. Bonfield JK (2014) The scramble conversion tool. Bioinformatics 30(19):2818–2819CrossRefGoogle Scholar
  3. Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PloS one 8:e59190CrossRefGoogle Scholar
  4. Chandak S, Tatwawadi K, Weissman T (2018) Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34:558–567CrossRefGoogle Scholar
  5. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2009) The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res 38:1767–1771CrossRefGoogle Scholar
  6. Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the burrows–wheeler transform. Bioinformatics 28:1415–1419CrossRefGoogle Scholar
  7. CRAM format specification (version 3.0) (2017) https://samtools.github.io/hts-specs/CRAMv3.pdf
  8. Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27:860–862CrossRefGoogle Scholar
  9. Deutsch LP (1996) GZIP file format specification version 4.3. https://tools.ietf.org/html/rfc1952
  10. Dutta A, Haque MM, Bose T, Reddy CVSK, Mande SS (2015) FQC: a novel approach for efficient compression, archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13:1541003CrossRefGoogle Scholar
  11. Ewing B, Hillier L, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185CrossRefGoogle Scholar
  12. Fritz MHY, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740CrossRefGoogle Scholar
  13. Ginart AA et al (2018) Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 9:566CrossRefGoogle Scholar
  14. Grabowski S, Deorowicz S, Roguski Ł (2014) Disk-based compression of data from genome sequencing. Bioinformatics 31:1389–1395CrossRefGoogle Scholar
  15. Hach F, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28:3051–3057CrossRefGoogle Scholar
  16. Hach F, Numanagić I, Sahinalp SC (2014) DeeZ: reference-based compression by local assembly. Nat Methods 11:1082–1084CrossRefGoogle Scholar
  17. Holland RC, Lynch N (2013) Sequence squeeze: an open contest for sequence compression. GigaScience 2:5CrossRefGoogle Scholar
  18. Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 40:e171–e171CrossRefGoogle Scholar
  19. Josef E (2014)Fast, efficient, lossless compression of FASTQ files. https://github.com/Infinidat/slimfastq
  20. Kingsford C, Patro R (2015) Reference-based compression of short-read sequences using path encoding. Bioinformatics 31:1920–1928CrossRefGoogle Scholar
  21. Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration (2010) The sequence read archive. Nucleic Acids Res 39:D19–D21Google Scholar
  22. Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079CrossRefGoogle Scholar
  23. Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31:3276–3281CrossRefGoogle Scholar
  24. Numanagić I et al (2016) Comparison of high-throughput sequencing data compression tools. Nat Methods 13:1005–1008CrossRefGoogle Scholar
  25. Ochoa I, Hernaez M, Weissman T (2014) Aligned genomic data compression via improved modeling. J Bioinform Comput Biol 12:1442002CrossRefGoogle Scholar
  26. Patro R, Kingsford C (2015) Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 31:2770–2777CrossRefGoogle Scholar
  27. Picard Tools – By Broad Institute (2015) http://broadinstitute.github.io/picard/
  28. Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20:3363–3369CrossRefGoogle Scholar
  29. Roguski Ł, Deorowicz S (2014) DSRC 2industry-oriented compression of FASTQ files. Bioinformatics 30: 2213–2215CrossRefGoogle Scholar
  30. Sam/bam Format Specification Working Group et al (2014) Sequence alignment/map format specification. http://samtools.github.io/hts-specs/SAMv1.pdf
  31. Seward J (1998) bzip2. http://www.bzip.org/
  32. Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P (2015) Sambamba: fast processing of NGS alignment formats. Bioinformatics 31:2032–2034CrossRefGoogle Scholar
  33. Voges J, Munderloh M, Ostermann J (2016) Predictive coding of aligned next-generation sequencing data. In: Data compression conference (DCC 2016). IEEE, pp 241–250Google Scholar
  34. Zhang Y et al (2015) Light-weight reference-based compression of FASTQ data. BMC Bioinformatics 16:188CrossRefGoogle Scholar
  35. Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2016) A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579:75–81CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Kaiyuan Zhu
    • 1
  • Ibrahim Numanagić
    • 2
  • S. Cenk Sahinalp
    • 1
    Email author
  1. 1.Department of Computer ScienceIndiana University BloomingtonBloomingtonUSA
  2. 2.Computer Science and Artificial Intelligence LabMassachusetts Institute of TechnologyCambridgeUSA