De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Sahlin, Kristoffer; Medvedev, Paul

doi:10.1007/978-3-030-17083-7_14

Kristoffer Sahlin¹⁵ &
Paul Medvedev^15,16,17

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2552 Accesses
5 Citations
7 Altmetric

Abstract

Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Byrne, A., et al.: Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nature Commun. 8, 16027 (2017)
Article Google Scholar
Sahlin, K., Tomaszkiewicz, M., Makova, K.D., Medvedev, P.: Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon. Nature Commun. 9(1), 4601 (2018)
Article Google Scholar
Tseng, E., Tang, H.T., AlOlaby, R.R., Hickey, L., Tassone, F.: Altered expression of the FMR1 splicing variants landscape in premutation carriers. Biochimica et Biophys. Acta (BBA)-Gene Regul. Mech. 1860(11), 1117–1126 (2017)
Article Google Scholar
Nattestad, M., et al.: Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. 28(8), 1126–1135 (2018)
Article Google Scholar
Kuo, R.I., Tseng, E., Eory, L., Paton, I.R., Archibald, A.L., Burt, D.W.: Normalized long read RNA sequencing in chicken reveals transcriptome complexity similar to human. BMC Genomics 18(1), 323 (2017)
Article Google Scholar
Hoang, N.V., et al.: A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing. BMC Genomics 18(1), 395 (2017)
Article Google Scholar
Gordon, S.P., et al.: Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PloS One 10(7), e0132628 (2015)
Article Google Scholar
Tombácz, D., et al.: Long-read isoform sequencing reveals a hidden complexity of the transcriptional landscape of herpes simplex virus type 1. Front. Microbiol. 8, 1079 (2017)
Article Google Scholar
Marchet, C., et al.: De novo clustering of long reads by gene from transcriptomics data. Nucleic Acids Res. 47(1), e1 (2018). https://doi.org/10.1093/nar/gky834
Article Google Scholar
Workman, R.E., Myrka, A.M., Wong, G.W., Tseng, E., Welch Jr., K.C., Timp, W.: Single-molecule, full-length transcript sequencing provides insight into the extreme metabolism of the ruby-throated hummingbird Archilochus colubris. GigaScience 7(3), 1–12 (2018). https://doi.org/10.1093/gigascience/giy009
Article Google Scholar
Li, J., et al.: Long read reference genome-free reconstruction of a full-length transcriptome from astragalus membranaceus reveals transcript variants involved in bioactive compound biosynthesis. Cell Discov. 3, 17031 (2017)
Article Google Scholar
Liu, X., Mei, W., Soltis, P.S., Soltis, D.E., Barbazuk, W.B.: Detecting alternatively spliced transcript isoforms from single-molecule long-read sequences without a reference genome. Mol. Ecol. Resour. 17(6), 1243–1256 (2017)
Article Google Scholar
Edgar, R.C.: Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19), 2460–2461 (2010)
Article Google Scholar
Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)
Article Google Scholar
James, B.T., Luczak, B.B., Girgis, H.Z.: MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res. 46(14), e83 (2018)
Article Google Scholar
Ghodsi, M., Liu, B., Pop, M.: DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinform. 12(1), 271 (2011)
Article Google Scholar
Paccanaro, A., Casbon, J.A., Saqi, M.A.: Spectral clustering of protein sequences. Nucleic Acids Res. 34(5), 1571–1580 (2006)
Article Google Scholar
Steinegger, M., Söding, J.: Clustering huge protein sequence sets in linear time. Nat. Commun. 9(1), 2542 (2018)
Article Google Scholar
Steinegger, M., Söding, J.: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35(11), 1026–1028 (2017)
Google Scholar
Zorita, E., Cusco, P., Filion, G.J.: Starcode: sequence clustering based on all-pairs search. Bioinformatics 31(12), 1913–1919 (2015)
Article Google Scholar
Bevilacqua, V., et al.: EasyCluster2: an improved tool for clustering and assembling long transcriptome reads. BMC Bioinform. 15, S7 (2014)
Article Google Scholar
Dost, B., Wu, C., Su, A., Bafna, V.: TCLUST: a fast method for clustering genome-scale expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 8(3), 808–818 (2011)
Article Google Scholar
Christoffels, A., Gelder, A.V., Greyling, G., Miller, R., Hide, T., Hide, W.: STACK: sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 29(1), 234–238 (2001)
Article Google Scholar
Burke, J., Davison, D., Hide, W.: d2\_cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res. 9(11), 1135 (1999)
Article Google Scholar
Chong, Z., Ruan, J., Wu, C.I.: Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads. Bioinformatics 28(21), 2732–2737 (2012)
Article Google Scholar
Solovyov, A., Lipkin, W.I.: Centroid based clustering of high throughput sequencing reads based on n-mer counts. BMC Bioinform. 14(1), 268 (2013)
Article Google Scholar
Bao, E., Jiang, T., Kaloshian, I., Girke, T.: SEED: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)
Article Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012)
Article Google Scholar
Shimizu, K., Tsuda, K.: SlideSort: all pairs similarity search for short reads. Bioinformatics 27(4), 464–470 (2010)
Article Google Scholar
Comin, M., Leoni, A., Schimd, M.: Clustering of reads with alignment-free measures and quality values. Algorithms Mol. Biol. 10(1), 4 (2015)
Article Google Scholar
Alanko, J., Cunial, F., Belazzougui, D., Mäkinen, V.: A framework for space-efficient read clustering in metagenomic samples. BMC Bioinform. 18(3), 59 (2017)
Article Google Scholar
Orabi, B., et al.: Alignment-free clustering of UMI tagged DNA molecules. Bioinformatics (2018). https://doi.org/10.1093/bioinformatics/bty888
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)
Article Google Scholar
Davidson, N.M., Oshlack, A.: Corset: enabling differential gene expression analysis for de novoassembled transcriptomes. Genome Biol. 15(7), 410 (2014). https://doi.org/10.1186/s13059-014-0410-6
Article Google Scholar
Malik, L., Almodaresi, F., Patro, R.: Grouper: graph-based clustering and annotation for improved de novo transcriptome analysis. Bioinformatics 34(19), 3265–3272 (2018)
Article Google Scholar
Krishnakumar, R., et al.: Systematic and stochastic influences on the performance of the MinION nanopore sequencer across a range of nucleotide bias. Sci. Rep. 8(1), 3159 (2018)
Article Google Scholar
Tseng, E.: Cogent: coding genome reconstruction using Iso-Seq data (2018). https://github.com/Magdoll/Cogent
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)
Article Google Scholar
Au, K.F., Underwood, J.G., Lee, L., Wong, W.H.: Improving PacBio long read accuracy by short read alignment. PloS one 7(10), e46679 (2012)
Article Google Scholar
Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)
Article Google Scholar
Daily, J.: Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinform. 17(1), 81 (2016)
Article MathSciNet Google Scholar
Sahlin, K., Medvedev, P.: Exprimental details appendix to “de novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm” (2019). https://github.com/ksahlin/isONclust/wiki/Paper-Appendix
Stöcker, B.K., Köster, J., Rahmann, S.: SimLoRD: simulation of long read data. Bioinformatics 32(17), 2704–2706 (2016)
Article Google Scholar
Iso-Seq in house datasets. https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Iso-Seq-in-house-datasets. Accessed 24 Oct 2018
Direct RNA and cDNA sequencing of a human transcriptome on Oxford Nanopore MinION and GridION. https://github.com/nanopore-wgs-consortium/NA12878/blob/master/RNA.md. Accessed 24 Oct 2018
Korlach, J., et al.: De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads. GigaScience 6(10) (2017). https://doi.org/10.1093/gigascience/gix085
Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (2007)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article Google Scholar

Download references

Acknowledgements

This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Pennsylvania State University, State College, USA
Kristoffer Sahlin & Paul Medvedev
Department of Biochemistry and Molecular Biology, Pennsylvania State University, State College, USA
Paul Medvedev
Center for Computational Biology and Bioinformatics, Pennsylvania State University, State College, USA
Paul Medvedev

Authors

Kristoffer Sahlin
View author publications
You can also search for this author in PubMed Google Scholar
Paul Medvedev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kristoffer Sahlin .

Editor information

Editors and Affiliations

Tufts University, Cambridge, MA, USA
Lenore J. Cowen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sahlin, K., Medvedev, P. (2019). De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-17083-7_14
Published: 02 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17082-0
Online ISBN: 978-3-030-17083-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics