Skip to main content

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

  • Conference paper
  • First Online:
Book cover Research in Computational Molecular Biology (RECOMB 2019)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Abstract

Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Byrne, A., et al.: Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nature Commun. 8, 16027 (2017)

    Article  Google Scholar 

  2. Sahlin, K., Tomaszkiewicz, M., Makova, K.D., Medvedev, P.: Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon. Nature Commun. 9(1), 4601 (2018)

    Article  Google Scholar 

  3. Tseng, E., Tang, H.T., AlOlaby, R.R., Hickey, L., Tassone, F.: Altered expression of the FMR1 splicing variants landscape in premutation carriers. Biochimica et Biophys. Acta (BBA)-Gene Regul. Mech. 1860(11), 1117–1126 (2017)

    Article  Google Scholar 

  4. Nattestad, M., et al.: Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. 28(8), 1126–1135 (2018)

    Article  Google Scholar 

  5. Kuo, R.I., Tseng, E., Eory, L., Paton, I.R., Archibald, A.L., Burt, D.W.: Normalized long read RNA sequencing in chicken reveals transcriptome complexity similar to human. BMC Genomics 18(1), 323 (2017)

    Article  Google Scholar 

  6. Hoang, N.V., et al.: A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing. BMC Genomics 18(1), 395 (2017)

    Article  Google Scholar 

  7. Gordon, S.P., et al.: Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PloS One 10(7), e0132628 (2015)

    Article  Google Scholar 

  8. Tombácz, D., et al.: Long-read isoform sequencing reveals a hidden complexity of the transcriptional landscape of herpes simplex virus type 1. Front. Microbiol. 8, 1079 (2017)

    Article  Google Scholar 

  9. Marchet, C., et al.: De novo clustering of long reads by gene from transcriptomics data. Nucleic Acids Res. 47(1), e1 (2018). https://doi.org/10.1093/nar/gky834

    Article  Google Scholar 

  10. Workman, R.E., Myrka, A.M., Wong, G.W., Tseng, E., Welch Jr., K.C., Timp, W.: Single-molecule, full-length transcript sequencing provides insight into the extreme metabolism of the ruby-throated hummingbird Archilochus colubris. GigaScience 7(3), 1–12 (2018). https://doi.org/10.1093/gigascience/giy009

    Article  Google Scholar 

  11. Li, J., et al.: Long read reference genome-free reconstruction of a full-length transcriptome from astragalus membranaceus reveals transcript variants involved in bioactive compound biosynthesis. Cell Discov. 3, 17031 (2017)

    Article  Google Scholar 

  12. Liu, X., Mei, W., Soltis, P.S., Soltis, D.E., Barbazuk, W.B.: Detecting alternatively spliced transcript isoforms from single-molecule long-read sequences without a reference genome. Mol. Ecol. Resour. 17(6), 1243–1256 (2017)

    Article  Google Scholar 

  13. Edgar, R.C.: Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19), 2460–2461 (2010)

    Article  Google Scholar 

  14. Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)

    Article  Google Scholar 

  15. James, B.T., Luczak, B.B., Girgis, H.Z.: MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res. 46(14), e83 (2018)

    Article  Google Scholar 

  16. Ghodsi, M., Liu, B., Pop, M.: DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinform. 12(1), 271 (2011)

    Article  Google Scholar 

  17. Paccanaro, A., Casbon, J.A., Saqi, M.A.: Spectral clustering of protein sequences. Nucleic Acids Res. 34(5), 1571–1580 (2006)

    Article  Google Scholar 

  18. Steinegger, M., Söding, J.: Clustering huge protein sequence sets in linear time. Nat. Commun. 9(1), 2542 (2018)

    Article  Google Scholar 

  19. Steinegger, M., Söding, J.: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35(11), 1026–1028 (2017)

    Google Scholar 

  20. Zorita, E., Cusco, P., Filion, G.J.: Starcode: sequence clustering based on all-pairs search. Bioinformatics 31(12), 1913–1919 (2015)

    Article  Google Scholar 

  21. Bevilacqua, V., et al.: EasyCluster2: an improved tool for clustering and assembling long transcriptome reads. BMC Bioinform. 15, S7 (2014)

    Article  Google Scholar 

  22. Dost, B., Wu, C., Su, A., Bafna, V.: TCLUST: a fast method for clustering genome-scale expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 8(3), 808–818 (2011)

    Article  Google Scholar 

  23. Christoffels, A., Gelder, A.V., Greyling, G., Miller, R., Hide, T., Hide, W.: STACK: sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 29(1), 234–238 (2001)

    Article  Google Scholar 

  24. Burke, J., Davison, D., Hide, W.: d2\_cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res. 9(11), 1135 (1999)

    Article  Google Scholar 

  25. Chong, Z., Ruan, J., Wu, C.I.: Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads. Bioinformatics 28(21), 2732–2737 (2012)

    Article  Google Scholar 

  26. Solovyov, A., Lipkin, W.I.: Centroid based clustering of high throughput sequencing reads based on n-mer counts. BMC Bioinform. 14(1), 268 (2013)

    Article  Google Scholar 

  27. Bao, E., Jiang, T., Kaloshian, I., Girke, T.: SEED: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)

    Article  Google Scholar 

  28. Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012)

    Article  Google Scholar 

  29. Shimizu, K., Tsuda, K.: SlideSort: all pairs similarity search for short reads. Bioinformatics 27(4), 464–470 (2010)

    Article  Google Scholar 

  30. Comin, M., Leoni, A., Schimd, M.: Clustering of reads with alignment-free measures and quality values. Algorithms Mol. Biol. 10(1), 4 (2015)

    Article  Google Scholar 

  31. Alanko, J., Cunial, F., Belazzougui, D., Mäkinen, V.: A framework for space-efficient read clustering in metagenomic samples. BMC Bioinform. 18(3), 59 (2017)

    Article  Google Scholar 

  32. Orabi, B., et al.: Alignment-free clustering of UMI tagged DNA molecules. Bioinformatics (2018). https://doi.org/10.1093/bioinformatics/bty888

  33. Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)

    Article  Google Scholar 

  34. Davidson, N.M., Oshlack, A.: Corset: enabling differential gene expression analysis for de novoassembled transcriptomes. Genome Biol. 15(7), 410 (2014). https://doi.org/10.1186/s13059-014-0410-6

    Article  Google Scholar 

  35. Malik, L., Almodaresi, F., Patro, R.: Grouper: graph-based clustering and annotation for improved de novo transcriptome analysis. Bioinformatics 34(19), 3265–3272 (2018)

    Article  Google Scholar 

  36. Krishnakumar, R., et al.: Systematic and stochastic influences on the performance of the MinION nanopore sequencer across a range of nucleotide bias. Sci. Rep. 8(1), 3159 (2018)

    Article  Google Scholar 

  37. Tseng, E.: Cogent: coding genome reconstruction using Iso-Seq data (2018). https://github.com/Magdoll/Cogent

  38. Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)

    Article  Google Scholar 

  39. Au, K.F., Underwood, J.G., Lee, L., Wong, W.H.: Improving PacBio long read accuracy by short read alignment. PloS one 7(10), e46679 (2012)

    Article  Google Scholar 

  40. Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)

    Article  Google Scholar 

  41. Daily, J.: Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinform. 17(1), 81 (2016)

    Article  MathSciNet  Google Scholar 

  42. Sahlin, K., Medvedev, P.: Exprimental details appendix to “de novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm” (2019). https://github.com/ksahlin/isONclust/wiki/Paper-Appendix

  43. Stöcker, B.K., Köster, J., Rahmann, S.: SimLoRD: simulation of long read data. Bioinformatics 32(17), 2704–2706 (2016)

    Article  Google Scholar 

  44. Iso-Seq in house datasets. https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Iso-Seq-in-house-datasets. Accessed 24 Oct 2018

  45. Direct RNA and cDNA sequencing of a human transcriptome on Oxford Nanopore MinION and GridION. https://github.com/nanopore-wgs-consortium/NA12878/blob/master/RNA.md. Accessed 24 Oct 2018

  46. Korlach, J., et al.: De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads. GigaScience 6(10) (2017). https://doi.org/10.1093/gigascience/gix085

  47. Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (2007)

    Google Scholar 

  48. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kristoffer Sahlin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sahlin, K., Medvedev, P. (2019). De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17083-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17082-0

  • Online ISBN: 978-3-030-17083-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics