Estimating sequence similarity from read sets for clustering next-generation sequencing data
Computing mutual similarity of biological sequences such as DNA molecules is essential for significant biological tasks such as hierarchical clustering of genomes. Current sequencing technologies do not provide the content of entire biological sequences; rather they identify a large number of small substrings called reads, sampled at random places of the target sequence. To estimate similarity of two sequences from their read-set representations, one may try to reconstruct each one first from its read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. Due to the nature of data, sequence assembly often cannot provide a single putative sequence that matches the true DNA. Therefore, we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases, avoiding the sequence assembly step. For low-coverage (i.e. small) read set samples, it yields a better approximation of the true sequence similarities. This in turn results in better clustering in comparison to the first-assemble-then-cluster approach. Put differently, for a fixed estimation accuracy, our approach requires smaller read sets and thus entails reduced wet-lab costs.
KeywordsRead sets Similarity Hierarchical clustering Biological sequences
The authors acknowledge the support of the OP VVV project CZ.02.1.01/0.0/0.0/16_019/0000765 “Research Center for Informatics”. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
- 1000 Genomes Project Consortium et al. (2015) A global reference for human genetic variation. Nature 526(7571):68–74Google Scholar
- Jalovec K, Železný F (2014) Binary classification of metagenomic samples using discriminative DNA superstrings. In: MLSB 2014: 8th International workshop on machine learning in systems biology, pp 44–47Google Scholar
- Kchouk M, Elloumi M(2016) A clustering approach for denovo assembly using next generation sequencing data. In: 2016 IEEE international conference on bioinformatics and biomedicine (BIBM), IEEE, pp 1909–1911Google Scholar
- Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Trraga A, Cheng Y, Cleland I, Faruque N, Goodgame N, Gibson R, Hoad G, Jang M, Pakseresht N, Plaister S, Radhakrishnan R, Reddy K, Sobhany S, Ten Hoopen P, Vaughan R, Zalunin V, Cochrane G (2011) The European Nucleotide Archive. Nucl Acids Res 39(suppl–1):D28–D31CrossRefGoogle Scholar
- Malhotra R, Elleder D, Bao L, Hunter DR, Acharya R, Poss M (2014) Clustering pipeline for determining consensus sequences in targeted next-generation sequencing. ArXiv preprintGoogle Scholar
- Monge AE, Elkan CP (1996) The field matching problem: algorithms and applications. In: Proceedings of the second international conference on knowledge discovery and data mining, KDD’96, AAAI Press, pp 267–270Google Scholar
- Nurk Sergey, Bankevich Anton, et al (2013) Assembling genomes and mini-metagenomes from highly chimeric reads. In: Deng M, Jiang R, Sun F, Zhang X, (eds) 17th Annual international conference on research in computational molecular biology, RECOMB 2013, Beijing, China, April 7–10, 2013. Proceedings, Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 158–170Google Scholar
- Ryšavý Petr, Železný Filip (2016) Estimating sequence similarity from read sets for clustering sequencing data. In: Boström H, Knobbe A, Soares C, Papapetrou P (eds) 15th International symposium on advances in intelligent data analysis XV, IDA 2016, Stockholm, Sweden, October 13–15, 2016, Proceedings, Cham, Springer International Publishing, pp 204–214Google Scholar
- Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4):406–425Google Scholar
- Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 38:1409–1438Google Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E (2008) Database resources of the national center for biotechnology information. Nucl Acids Res 36(suppl–1):D13–D21Google Scholar
- Železný F, Jalovec K, Tolar J (2014) Learning meets sequencing: a generality framework for read-sets. In: ILP 2014: 24th Internation conference on inductive logic programming, Late-Breaking PapersGoogle Scholar