Abstract
Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Byrne, A., et al.: Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nature Commun. 8, 16027 (2017)
Sahlin, K., Tomaszkiewicz, M., Makova, K.D., Medvedev, P.: Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon. Nature Commun. 9(1), 4601 (2018)
Tseng, E., Tang, H.T., AlOlaby, R.R., Hickey, L., Tassone, F.: Altered expression of the FMR1 splicing variants landscape in premutation carriers. Biochimica et Biophys. Acta (BBA)-Gene Regul. Mech. 1860(11), 1117–1126 (2017)
Nattestad, M., et al.: Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. 28(8), 1126–1135 (2018)
Kuo, R.I., Tseng, E., Eory, L., Paton, I.R., Archibald, A.L., Burt, D.W.: Normalized long read RNA sequencing in chicken reveals transcriptome complexity similar to human. BMC Genomics 18(1), 323 (2017)
Hoang, N.V., et al.: A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing. BMC Genomics 18(1), 395 (2017)
Gordon, S.P., et al.: Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PloS One 10(7), e0132628 (2015)
Tombácz, D., et al.: Long-read isoform sequencing reveals a hidden complexity of the transcriptional landscape of herpes simplex virus type 1. Front. Microbiol. 8, 1079 (2017)
Marchet, C., et al.: De novo clustering of long reads by gene from transcriptomics data. Nucleic Acids Res. 47(1), e1 (2018). https://doi.org/10.1093/nar/gky834
Workman, R.E., Myrka, A.M., Wong, G.W., Tseng, E., Welch Jr., K.C., Timp, W.: Single-molecule, full-length transcript sequencing provides insight into the extreme metabolism of the ruby-throated hummingbird Archilochus colubris. GigaScience 7(3), 1–12 (2018). https://doi.org/10.1093/gigascience/giy009
Li, J., et al.: Long read reference genome-free reconstruction of a full-length transcriptome from astragalus membranaceus reveals transcript variants involved in bioactive compound biosynthesis. Cell Discov. 3, 17031 (2017)
Liu, X., Mei, W., Soltis, P.S., Soltis, D.E., Barbazuk, W.B.: Detecting alternatively spliced transcript isoforms from single-molecule long-read sequences without a reference genome. Mol. Ecol. Resour. 17(6), 1243–1256 (2017)
Edgar, R.C.: Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19), 2460–2461 (2010)
Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)
James, B.T., Luczak, B.B., Girgis, H.Z.: MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res. 46(14), e83 (2018)
Ghodsi, M., Liu, B., Pop, M.: DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinform. 12(1), 271 (2011)
Paccanaro, A., Casbon, J.A., Saqi, M.A.: Spectral clustering of protein sequences. Nucleic Acids Res. 34(5), 1571–1580 (2006)
Steinegger, M., Söding, J.: Clustering huge protein sequence sets in linear time. Nat. Commun. 9(1), 2542 (2018)
Steinegger, M., Söding, J.: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35(11), 1026–1028 (2017)
Zorita, E., Cusco, P., Filion, G.J.: Starcode: sequence clustering based on all-pairs search. Bioinformatics 31(12), 1913–1919 (2015)
Bevilacqua, V., et al.: EasyCluster2: an improved tool for clustering and assembling long transcriptome reads. BMC Bioinform. 15, S7 (2014)
Dost, B., Wu, C., Su, A., Bafna, V.: TCLUST: a fast method for clustering genome-scale expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 8(3), 808–818 (2011)
Christoffels, A., Gelder, A.V., Greyling, G., Miller, R., Hide, T., Hide, W.: STACK: sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 29(1), 234–238 (2001)
Burke, J., Davison, D., Hide, W.: d2\_cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res. 9(11), 1135 (1999)
Chong, Z., Ruan, J., Wu, C.I.: Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads. Bioinformatics 28(21), 2732–2737 (2012)
Solovyov, A., Lipkin, W.I.: Centroid based clustering of high throughput sequencing reads based on n-mer counts. BMC Bioinform. 14(1), 268 (2013)
Bao, E., Jiang, T., Kaloshian, I., Girke, T.: SEED: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)
Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012)
Shimizu, K., Tsuda, K.: SlideSort: all pairs similarity search for short reads. Bioinformatics 27(4), 464–470 (2010)
Comin, M., Leoni, A., Schimd, M.: Clustering of reads with alignment-free measures and quality values. Algorithms Mol. Biol. 10(1), 4 (2015)
Alanko, J., Cunial, F., Belazzougui, D., Mäkinen, V.: A framework for space-efficient read clustering in metagenomic samples. BMC Bioinform. 18(3), 59 (2017)
Orabi, B., et al.: Alignment-free clustering of UMI tagged DNA molecules. Bioinformatics (2018). https://doi.org/10.1093/bioinformatics/bty888
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)
Davidson, N.M., Oshlack, A.: Corset: enabling differential gene expression analysis for de novoassembled transcriptomes. Genome Biol. 15(7), 410 (2014). https://doi.org/10.1186/s13059-014-0410-6
Malik, L., Almodaresi, F., Patro, R.: Grouper: graph-based clustering and annotation for improved de novo transcriptome analysis. Bioinformatics 34(19), 3265–3272 (2018)
Krishnakumar, R., et al.: Systematic and stochastic influences on the performance of the MinION nanopore sequencer across a range of nucleotide bias. Sci. Rep. 8(1), 3159 (2018)
Tseng, E.: Cogent: coding genome reconstruction using Iso-Seq data (2018). https://github.com/Magdoll/Cogent
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)
Au, K.F., Underwood, J.G., Lee, L., Wong, W.H.: Improving PacBio long read accuracy by short read alignment. PloS one 7(10), e46679 (2012)
Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)
Daily, J.: Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinform. 17(1), 81 (2016)
Sahlin, K., Medvedev, P.: Exprimental details appendix to “de novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm” (2019). https://github.com/ksahlin/isONclust/wiki/Paper-Appendix
Stöcker, B.K., Köster, J., Rahmann, S.: SimLoRD: simulation of long read data. Bioinformatics 32(17), 2704–2706 (2016)
Iso-Seq in house datasets. https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Iso-Seq-in-house-datasets. Accessed 24 Oct 2018
Direct RNA and cDNA sequencing of a human transcriptome on Oxford Nanopore MinION and GridION. https://github.com/nanopore-wgs-consortium/NA12878/blob/master/RNA.md. Accessed 24 Oct 2018
Korlach, J., et al.: De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads. GigaScience 6(10) (2017). https://doi.org/10.1093/gigascience/gix085
Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (2007)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Acknowledgements
This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sahlin, K., Medvedev, P. (2019). De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-17083-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17082-0
Online ISBN: 978-3-030-17083-7
eBook Packages: Computer ScienceComputer Science (R0)