Skip to main content

QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering

  • Conference paper
Algorithms in Bioinformatics (WABI 2014)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8701))

Included in the following conference series:

Abstract

The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads.

Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15%). Thus it will be fundamental to exploit quality value information within the alignment-free framework.

In this paper we present a family of alignment-free measures, called D q-type, that incorporate quality value information and k-mers counts for the comparison of reads data. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. These measures are implemented in a software called QCluster ( http://www.dei.unipd.it/~ciompin/main/qcluster.html ).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Medini, D., Serruto, D., Parkhill, J., Relman, D., Donati, C., Moxon, R., Falkow, S., Rappuoli, R.: Microbiology in the post-genomic era. Nature Reviews Microbiology 6, 419–430 (2008)

    Google Scholar 

  2. Jothi, R., et al.: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 36, 5221–5231 (2008)

    Article  Google Scholar 

  3. Altschul, S., Gish, W., Miller, W., Myers, E.W., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)

    Article  Google Scholar 

  4. Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. PNAS 106(8), 2677–2682 (2009)

    Article  Google Scholar 

  5. Comin, M., Verzotto, D.: Whole-genome phylogeny by virtue of unic subwords. In: Proc. 23rd Int. Workshop on Database and Expert Systems Applications (DEXA-BIOKDD 2012), pp. 190–194 (2012)

    Google Scholar 

  6. Comin, M., Verzotto, D.: Alignment-free phylogeny of whole genomes using underlying subwords. BMC Algorithms for Molecular Biology 7(34) (2012)

    Google Scholar 

  7. Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F.: Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads. Journal of Computational Biology 20(2), 64–79 (2013)

    Article  MathSciNet  Google Scholar 

  8. Comin, M., Schimd, M.: Assembly-free Genome Comparison based on Next-Generation Sequencing Reads and Variable Length Patterns. Accepted at RECOMB-SEQ 2014: 4th Annual RECOMB Satellite Workshop at Massively Parallel Sequencing. Proceedings to appear in BMC Bioinformatics (2014)

    Google Scholar 

  9. Vinga, S., Almeida, J.: Alignment-free sequence comparison – a review. Bioinformatics 19(4), 513–523 (2003)

    Article  Google Scholar 

  10. Gao, L., Qi, J.: Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evolutionary Biology 7(1), 41 (2007)

    Article  Google Scholar 

  11. Qi, J., Luo, H., Hao, B.: CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Research 32 (Web Server Issue), 45–47 (2004)

    Google Scholar 

  12. Goke, J., Schulz, M.H., Lasserre, J., Vingron, M.: Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics 28(5), 656–663 (2012)

    Article  Google Scholar 

  13. Kantorovitz, M.R., Robinson, G.E., Sinha, S.: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 23(13), 249–255 (2007)

    Article  Google Scholar 

  14. Comin, M., Verzotto, D.: Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison. Accepted for presentation at The Twelfth Asia Pacific Bioinformatics Conference. Proceedings to appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics (2014)

    Google Scholar 

  15. Comin, M., Antonello, M.: Fast Computation of Entropic Profiles for the Detection of Conservation in Genomes. In: Ngom, A., Formenti, E., Hao, J.-K., Zhao, X.-M., van Laarhoven, T. (eds.) PRIB 2013. LNCS, vol. 7986, pp. 277–288. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  16. Comin, M., Antonello, M.: Fast Entropic Profiler: An Information Theoretic Approach for the Discovery of Patterns in Genomes. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11(3), 500–509 (2014)

    Article  Google Scholar 

  17. Comin, M., Verzotto, D.: Classification of protein sequences by means of irredundant patterns. Proceedings of the 8th Asia-Pacific Bioinformatics Conference (APBC), BMC Bioinformatics 11(Suppl.1), S16 (2010)

    Google Scholar 

  18. Comin, M., Verzotto, D.: The Irredundant Class method for remote homology detection of protein sequences. Journal of Computational Biology 18(12), 1819–1829 (2011)

    Article  Google Scholar 

  19. Hashimoto, W.S., Morishita, S.: Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. Genome Research 19(7), 1309–1315 (2009)

    Article  Google Scholar 

  20. Bao, E., Jiang, T., Kaloshian, I., Girke, T.: SEED: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)

    Google Scholar 

  21. Solovyov, A., Lipkin, W.I.: Centroid based clustering of high throughput sequencing reads based on n-mer counts. BMC Bioinformatics 14, 268 (2013)

    Article  Google Scholar 

  22. Heng, L., Jue, R., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851–1858 (2008)

    Article  Google Scholar 

  23. Albers, C., Lunter, G., MacArthur, D.G., McVean, G., Ouwehand, W.H., Durbin, R.: Dindel: accurate indel calls from short-read data. Genome Research 21(6), 961–973 (2011)

    Article  Google Scholar 

  24. Carneiro, M.O., Russ, C., Ross, M.G., Gabriel, S.B., Nusbaum, C., DePristo, M.A.: Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13, 375 (2012)

    Article  Google Scholar 

  25. Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. PNAS USA 83(14), 5155–5159 (1986)

    Article  MATH  Google Scholar 

  26. Lippert, R.A., Huang, H.Y., Waterman, M.S.: Distributional regimes for the number of k-word matches between two random sequences. Proceedings of the National Academy of Sciences of the United States of America 100(13), 13980–13989 (2002)

    Article  MathSciNet  Google Scholar 

  27. Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (I): statistics and power. Journal of Computational Biology 16(12), 1615–1634 (2009)

    Article  MathSciNet  Google Scholar 

  28. Wan, L., Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (II): theoretical power of comparison statistics. Journal of Computational Biology 17(11), 1467–1490 (2010)

    Article  MathSciNet  Google Scholar 

  29. Ewing, B., Green, P.: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8(3), 186–194 (1998)

    Article  Google Scholar 

  30. Holtgrewe, M.: Mason–a read simulator for second generation sequencing data. Technical Report FU Berlin (2010)

    Google Scholar 

  31. Birney, E.: Assemblies: the good, the bad, the ugly. Nature Methods 8, 59–60 (2011)

    Article  Google Scholar 

  32. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821–829 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Comin, M., Leoni, A., Schimd, M. (2014). QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering. In: Brown, D., Morgenstern, B. (eds) Algorithms in Bioinformatics. WABI 2014. Lecture Notes in Computer Science(), vol 8701. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44753-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-44753-6_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-44752-9

  • Online ISBN: 978-3-662-44753-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics