On Optimal Read Trimming in Next Generation Sequencing and Its Complexity

Hedtke, Ivo; Lemnian, Ioana; Müller-Hannemann, Matthias; Grosse, Ivo

doi:10.1007/978-3-319-07953-0_7

Ivo Hedtke^20,21,
Ioana Lemnian²¹,
Matthias Müller-Hannemann²¹ &
…
Ivo Grosse^21,22

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8542))

Included in the following conference series:

International Conference on Algorithms for Computational Biology

1203 Accesses
1 Citations

Abstract

Read trimming is a fundamental first step of the analysis of next generation sequencing (NGS) data. Traditionally, read trimming is performed heuristically, and algorithmic work in this area has been neglected. Here, we address this topic and formulate three constrained optimization problems for block-based trimming, i.e., truncating the same low-quality positions at both ends for all reads and removing low-quality truncated reads. We find that the three problems are \(\mathcal{NP}\)-hard. However, the non-random distribution of quality scores in NGS data sets makes it tempting to speculate that quality constraints for read positions are typically satisfied by fulfilling quality constraints for reads. Based on this speculation, we propose three relaxed problems and develop efficient polynomial-time algorithms for them. We find that (i) the omitted constraints are indeed almost always satisfied and (ii) the algorithms for the relaxed problems typically yield a higher number of untrimmed bases than traditional heuristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bardet, A.F., He, Q., Zeitlinger, J., Stark, A.: A computational pipeline for comparative ChIP-seq analyses. Nature Protocols 7(1), 45–61 (2012)
Article Google Scholar
Bhargava, V., Head, S.R., Ordoukhanian, P., Mercola, M., Subramaniam, S.: Technical variations in low-input RNA-seq methodologies. Scientific Reports 4(3678) (2014)
Google Scholar
Del Fabbro, C., Scalabrin, S., Morgante, M., Giorgi, F.: An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS ONE 8(12), e85024 (2013)
Google Scholar
Edgar, R., Domrachev, M., Lash, A.: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30(1), 207–210 (2002), http://www.ncbi.nlm.nih.gov/geo
Article Google Scholar
Ewing, B., Hillier, L., Wendl, M., Green, P.: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research 8(3), 175–185 (1998)
Article Google Scholar
Ewing, B., Green, P.: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8(3), 186–194 (1998)
Article Google Scholar
Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38, 293–306 (1985)
Article MATH Google Scholar
Hannon Lab: FASTX Toolkit, http://hannonlab.cshl.edu/fastx_toolkit/
Koboldt, D., Steinberg, K., Larson, D., Wilson, R., Mardis, E.R.: The next-generation sequencing revolution and its impact on genomics. Cell 155(1), 27–38 (2013), http://www.sciencedirect.com/science/article/pii/S0092867413011410
Article Google Scholar
NCBI – SRA Toolkit, http://eutils.ncbi.nih.gov/Traces/sra/?view=software
Patel, R.K., Jain, M.: NGS QC Toolkit: A toolkit for quality control of next generation sequencing data. PLoS ONE 7(2), e30619+ (2012)
Google Scholar
Schmieder, R., Edwards, R.: Quality control and preprocessing of metagenomic datasets. Bioinformatics 27(6), 863–864 (2011)
Article Google Scholar
UC Davis Bioinformatics Core: sickle - Windowed Adaptive Trimming for fastq files using quality, http://hannonlab.cshl.edu/fastx_toolkit/

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Osnabrück University, Albrechtstrasse 28, 49076, Osnabrück, Germany
Ivo Hedtke
Institute of Computer Science, Martin-Luther-University Halle-Wittenberg, Von-Seckendorff-Platz 1, 06120, Halle, Germany
Ivo Hedtke, Ioana Lemnian, Matthias Müller-Hannemann & Ivo Grosse
German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Deutscher Platz 5e, 04103, Leipzig, Germany
Ivo Grosse

Authors

Ivo Hedtke
View author publications
You can also search for this author in PubMed Google Scholar
Ioana Lemnian
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Müller-Hannemann
View author publications
You can also search for this author in PubMed Google Scholar
Ivo Grosse
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistics, Rovira i Virgili University, Avinguda Catalunya, 35, 43002, Tarragona, Spain
Adrian-Horia Dediu & Carlos Martín-Vide &
Fachbereich 07, Institut für Informatik, Justus-Liebig-Universität, Arndtstraße 2, 35392, Gießen, Germany
Bianca Truthe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hedtke, I., Lemnian, I., Müller-Hannemann, M., Grosse, I. (2014). On Optimal Read Trimming in Next Generation Sequencing and Its Complexity. In: Dediu, AH., Martín-Vide, C., Truthe, B. (eds) Algorithms for Computational Biology. AlCoB 2014. Lecture Notes in Computer Science(), vol 8542. Springer, Cham. https://doi.org/10.1007/978-3-319-07953-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-07953-0_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07952-3
Online ISBN: 978-3-319-07953-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics