QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering

Comin, Matteo; Leoni, Andrea; Schimd, Michele

doi:10.1007/978-3-662-44753-6_1

Matteo Comin²⁰,
Andrea Leoni²⁰ &
Michele Schimd²⁰

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8701))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

1909 Accesses
2 Citations

Abstract

The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads.

Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15%). Thus it will be fundamental to exploit quality value information within the alignment-free framework.

In this paper we present a family of alignment-free measures, called D ^q-type, that incorporate quality value information and k-mers counts for the comparison of reads data. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. These measures are implemented in a software called QCluster ( http://www.dei.unipd.it/~ciompin/main/qcluster.html ).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Medini, D., Serruto, D., Parkhill, J., Relman, D., Donati, C., Moxon, R., Falkow, S., Rappuoli, R.: Microbiology in the post-genomic era. Nature Reviews Microbiology 6, 419–430 (2008)
Google Scholar
Jothi, R., et al.: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 36, 5221–5231 (2008)
Article Google Scholar
Altschul, S., Gish, W., Miller, W., Myers, E.W., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)
Article Google Scholar
Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. PNAS 106(8), 2677–2682 (2009)
Article Google Scholar
Comin, M., Verzotto, D.: Whole-genome phylogeny by virtue of unic subwords. In: Proc. 23rd Int. Workshop on Database and Expert Systems Applications (DEXA-BIOKDD 2012), pp. 190–194 (2012)
Google Scholar
Comin, M., Verzotto, D.: Alignment-free phylogeny of whole genomes using underlying subwords. BMC Algorithms for Molecular Biology 7(34) (2012)
Google Scholar
Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F.: Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads. Journal of Computational Biology 20(2), 64–79 (2013)
Article MathSciNet Google Scholar
Comin, M., Schimd, M.: Assembly-free Genome Comparison based on Next-Generation Sequencing Reads and Variable Length Patterns. Accepted at RECOMB-SEQ 2014: 4th Annual RECOMB Satellite Workshop at Massively Parallel Sequencing. Proceedings to appear in BMC Bioinformatics (2014)
Google Scholar
Vinga, S., Almeida, J.: Alignment-free sequence comparison – a review. Bioinformatics 19(4), 513–523 (2003)
Article Google Scholar
Gao, L., Qi, J.: Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evolutionary Biology 7(1), 41 (2007)
Article Google Scholar
Qi, J., Luo, H., Hao, B.: CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Research 32 (Web Server Issue), 45–47 (2004)
Google Scholar
Goke, J., Schulz, M.H., Lasserre, J., Vingron, M.: Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics 28(5), 656–663 (2012)
Article Google Scholar
Kantorovitz, M.R., Robinson, G.E., Sinha, S.: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 23(13), 249–255 (2007)
Article Google Scholar
Comin, M., Verzotto, D.: Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison. Accepted for presentation at The Twelfth Asia Pacific Bioinformatics Conference. Proceedings to appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics (2014)
Google Scholar
Comin, M., Antonello, M.: Fast Computation of Entropic Profiles for the Detection of Conservation in Genomes. In: Ngom, A., Formenti, E., Hao, J.-K., Zhao, X.-M., van Laarhoven, T. (eds.) PRIB 2013. LNCS, vol. 7986, pp. 277–288. Springer, Heidelberg (2013)
Chapter Google Scholar
Comin, M., Antonello, M.: Fast Entropic Profiler: An Information Theoretic Approach for the Discovery of Patterns in Genomes. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11(3), 500–509 (2014)
Article Google Scholar
Comin, M., Verzotto, D.: Classification of protein sequences by means of irredundant patterns. Proceedings of the 8th Asia-Pacific Bioinformatics Conference (APBC), BMC Bioinformatics 11(Suppl.1), S16 (2010)
Google Scholar
Comin, M., Verzotto, D.: The Irredundant Class method for remote homology detection of protein sequences. Journal of Computational Biology 18(12), 1819–1829 (2011)
Article Google Scholar
Hashimoto, W.S., Morishita, S.: Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. Genome Research 19(7), 1309–1315 (2009)
Article Google Scholar
Bao, E., Jiang, T., Kaloshian, I., Girke, T.: SEED: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)
Google Scholar
Solovyov, A., Lipkin, W.I.: Centroid based clustering of high throughput sequencing reads based on n-mer counts. BMC Bioinformatics 14, 268 (2013)
Article Google Scholar
Heng, L., Jue, R., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851–1858 (2008)
Article Google Scholar
Albers, C., Lunter, G., MacArthur, D.G., McVean, G., Ouwehand, W.H., Durbin, R.: Dindel: accurate indel calls from short-read data. Genome Research 21(6), 961–973 (2011)
Article Google Scholar
Carneiro, M.O., Russ, C., Ross, M.G., Gabriel, S.B., Nusbaum, C., DePristo, M.A.: Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13, 375 (2012)
Article Google Scholar
Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. PNAS USA 83(14), 5155–5159 (1986)
Article MATH Google Scholar
Lippert, R.A., Huang, H.Y., Waterman, M.S.: Distributional regimes for the number of k-word matches between two random sequences. Proceedings of the National Academy of Sciences of the United States of America 100(13), 13980–13989 (2002)
Article MathSciNet Google Scholar
Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (I): statistics and power. Journal of Computational Biology 16(12), 1615–1634 (2009)
Article MathSciNet Google Scholar
Wan, L., Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (II): theoretical power of comparison statistics. Journal of Computational Biology 17(11), 1467–1490 (2010)
Article MathSciNet Google Scholar
Ewing, B., Green, P.: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8(3), 186–194 (1998)
Article Google Scholar
Holtgrewe, M.: Mason–a read simulator for second generation sequencing data. Technical Report FU Berlin (2010)
Google Scholar
Birney, E.: Assemblies: the good, the bad, the ugly. Nature Methods 8, 59–60 (2011)
Article Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821–829 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, University of Padova, Padova, Italy
Matteo Comin, Andrea Leoni & Michele Schimd

Authors

Matteo Comin
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Leoni
View author publications
You can also search for this author in PubMed Google Scholar
Michele Schimd
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

David R. Cheriton School of Computer Science, University of Waterloo, ON, Canada
Dan Brown
Institute of Microbiology and Genetics, Department of Bioinformatics, University of Göttingen, Germany, Goldschmidtstr. 1, 37077, Göttingen, Germany
Burkhard Morgenstern

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Comin, M., Leoni, A., Schimd, M. (2014). QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering. In: Brown, D., Morgenstern, B. (eds) Algorithms in Bioinformatics. WABI 2014. Lecture Notes in Computer Science(), vol 8701. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44753-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-662-44753-6_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44752-9
Online ISBN: 978-3-662-44753-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics