Skip to main content

Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification

  • Conference paper
Research in Computational Molecular Biology (RECOMB 2014)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8394))

Abstract

It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence-based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets.

Availability: An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/ .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berger, B., Peng, J., Singh, M.: Computational solutions for omics data. Nature Reviews Genetics 14, 333–346 (2013)

    Article  Google Scholar 

  2. Kahn, S.D.: On the future of genomic data. Science 331(6018), 728–729 (2011)

    Article  Google Scholar 

  3. Apostolico, A., Lonardi, S.: Compression of biological sequences by greedy off-line textual substitution. In: Proceedings of the Data Compression Conference, DCC 2000, pp. 143–152. IEEE (2000)

    Google Scholar 

  4. Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V., Varghese, G.: Compressing genomic sequence fragments using SlimGene. Journal of Computational Biology 18(3), 401–413 (2011)

    Article  MathSciNet  Google Scholar 

  5. Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research 40(22), e171 (2012)

    Google Scholar 

  6. Fritz, M.H.Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research 21, 734–740 (2011)

    Article  Google Scholar 

  7. Deorowicz, S., Grabowski, S.: Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6), 860–862 (2011)

    Article  Google Scholar 

  8. Loh, P.R., Baym, M., Berger, B.: Compressive genomics. Nature Biotechnology 30, 627–630 (2012)

    Article  Google Scholar 

  9. Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM format sequencing data. PloS one 8(3), e59190 (2013)

    Google Scholar 

  10. Hach, F., Numanagic, I., Alkan, C., Sahinalp, S.C.: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23), 3051–3057 (2012)

    Article  Google Scholar 

  11. Tembe, W., Lowey, J., Suh, E.: G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17), 2192–2194 (2010)

    Article  Google Scholar 

  12. Popitsch, N., von Haeseler, A.: NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Research 41(1), e27 (2013)

    Google Scholar 

  13. Wan, R., Anh, V.N., Asai, K.: Transformations for the compression of FASTQ quality scores of next-generation sequencing data. Bioinformatics 28(5), 628–635 (2012)

    Article  Google Scholar 

  14. Christley, S., Lu, Y., Li, C., Xie, X.: Human genomes as email attachments. Bioinformatics 25(2), 274–275 (2009)

    Article  Google Scholar 

  15. Janin, L., Rosone, G., Cox, A.J.: Adaptive reference-free compression of sequence quality scores. Bioinformatics (2013)

    Google Scholar 

  16. Consortium, T.G.P.: An integrated map of genetic variation from 1,092 human genomes. Nature 491, 1 (2012)

    Google Scholar 

  17. Yang, X., Chockalingam, S.P., Aluru, S.: A survey of error-correction methods for next-generation sequencing. Briefings in Bioinformatics 14(1), 56–66 (2013)

    Article  Google Scholar 

  18. Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333 (2011)

    Article  Google Scholar 

  19. Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)

    Article  Google Scholar 

  20. Kelley, D.R., Schatz, M.C., Salzberg, S.L., et al.: Quake: quality-aware detection and correction of sequencing errors. Genome. Biol. 11(11), 116 (2010)

    Article  Google Scholar 

  21. Liu, Y., Schröder, J., Schmidt, B.: Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics 29(3), 308–315 (2013)

    Article  Google Scholar 

  22. Ilie, L., Molnar, M.: RACER: Rapid and accurate correction of errors in reads. Bioinformatics 29(19), 2490–2493 (2013)

    Article  Google Scholar 

  23. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., et al.: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29(7), 644–652 (2011)

    Article  Google Scholar 

  24. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., et al.: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009)

    Article  Google Scholar 

  25. Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26(5), 589–595 (2010)

    Article  Google Scholar 

  26. DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43(5), 491–498 (2011)

    Article  Google Scholar 

  27. Ochoa, I., Asnani, H., Bharadia, D., Chowdhury, M., Weissman, T., Yona, G.: QualComp: a new lossy compressor for quality scores based on rate distortion theory. BMC Bioinformatics 14, 187 (2013)

    Article  Google Scholar 

  28. Consortium, T.G.P.: A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Yu, Y.W., Yorukoglu, D., Berger, B. (2014). Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification. In: Sharan, R. (eds) Research in Computational Molecular Biology. RECOMB 2014. Lecture Notes in Computer Science(), vol 8394. Springer, Cham. https://doi.org/10.1007/978-3-319-05269-4_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-05269-4_31

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-05268-7

  • Online ISBN: 978-3-319-05269-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics