Abstract
Read trimming is a fundamental first step of the analysis of next generation sequencing (NGS) data. Traditionally, read trimming is performed heuristically, and algorithmic work in this area has been neglected. Here, we address this topic and formulate three constrained optimization problems for block-based trimming, i.e., truncating the same low-quality positions at both ends for all reads and removing low-quality truncated reads. We find that the three problems are \(\mathcal{NP}\)-hard. However, the non-random distribution of quality scores in NGS data sets makes it tempting to speculate that quality constraints for read positions are typically satisfied by fulfilling quality constraints for reads. Based on this speculation, we propose three relaxed problems and develop efficient polynomial-time algorithms for them. We find that (i) the omitted constraints are indeed almost always satisfied and (ii) the algorithms for the relaxed problems typically yield a higher number of untrimmed bases than traditional heuristics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bardet, A.F., He, Q., Zeitlinger, J., Stark, A.: A computational pipeline for comparative ChIP-seq analyses. Nature Protocols 7(1), 45–61 (2012)
Bhargava, V., Head, S.R., Ordoukhanian, P., Mercola, M., Subramaniam, S.: Technical variations in low-input RNA-seq methodologies. Scientific Reports 4(3678) (2014)
Del Fabbro, C., Scalabrin, S., Morgante, M., Giorgi, F.: An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS ONE 8(12), e85024 (2013)
Edgar, R., Domrachev, M., Lash, A.: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30(1), 207–210 (2002), http://www.ncbi.nlm.nih.gov/geo
Ewing, B., Hillier, L., Wendl, M., Green, P.: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research 8(3), 175–185 (1998)
Ewing, B., Green, P.: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8(3), 186–194 (1998)
Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38, 293–306 (1985)
Hannon Lab: FASTX Toolkit, http://hannonlab.cshl.edu/fastx_toolkit/
Koboldt, D., Steinberg, K., Larson, D., Wilson, R., Mardis, E.R.: The next-generation sequencing revolution and its impact on genomics. Cell 155(1), 27–38 (2013), http://www.sciencedirect.com/science/article/pii/S0092867413011410
NCBI – SRA Toolkit, http://eutils.ncbi.nih.gov/Traces/sra/?view=software
Patel, R.K., Jain, M.: NGS QC Toolkit: A toolkit for quality control of next generation sequencing data. PLoS ONE 7(2), e30619+ (2012)
Schmieder, R., Edwards, R.: Quality control and preprocessing of metagenomic datasets. Bioinformatics 27(6), 863–864 (2011)
UC Davis Bioinformatics Core: sickle - Windowed Adaptive Trimming for fastq files using quality, http://hannonlab.cshl.edu/fastx_toolkit/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Hedtke, I., Lemnian, I., Müller-Hannemann, M., Grosse, I. (2014). On Optimal Read Trimming in Next Generation Sequencing and Its Complexity. In: Dediu, AH., Martín-Vide, C., Truthe, B. (eds) Algorithms for Computational Biology. AlCoB 2014. Lecture Notes in Computer Science(), vol 8542. Springer, Cham. https://doi.org/10.1007/978-3-319-07953-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-07953-0_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07952-3
Online ISBN: 978-3-319-07953-0
eBook Packages: Computer ScienceComputer Science (R0)