Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data
- 87 Downloads
The emergence of next-generation sequencing technology has greatly influenced research in biology and clinical applications. This new technology allows millions of DNA fragments to be sequenced in parallel, reducing costs and increasing throughput. One of the most widely used DNA sequencing machines is the Illumina platform which contains a novel sequencing-by-synthesis method involving a series of chemical reactions and image processing. However, it suffers from biases inherent with the complex nature of the chemical processes involved. The process of converting the fluorescence intensity output of the sequencing-by-synthesis technology to the nucleotide bases is what is known as base-calling. The resulting DNA sequences are used in further downstream analyses such as in genome assemblies or variant detection in which the accuracy and quality of bases impact the results. In this paper, we introduce a random effects mixture model that captures the sequencing process and compare its performance to a model with fixed effects.
KeywordsBase-calling Illumina Random effects MCEM DNA sequencing
The authors thank the Institute for Integrative Genome Biology Bioinformatics Facility at University of California, Riverside, for providing the bioinformatics cluster. This material was based upon work partially supported by the National Science Foundation (DMS ATD-1222718) and the University of California, Riverside (AES- CE RSAP A01869) for X.C. and A.C.
- 1.Biscarat JC (1994) Almost sure convergence of a class of stochastic algorithms. Stoch Proc Appl 50:83–99Google Scholar
- 2.Cacho A, Smirnova E, Huzurbazar S, Cui X (2015) A comparison of base-calling algorithms for illumina sequencing technology. Brief Bioinform. doi: 10.1093/bib/bbv088
- 5.Illumina, Inc. Illumina sequencing technology: highest data accuracy, simple workflow, and a broad range of applications. Springer, New York. http://www.illumina.com/documents/products/ techspotlights/techspotlight_sequencing (2010)
- 10.Nielsen SF (2000) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6:457–489Google Scholar
- 15.Wetterstrand KA. DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). www.genome.gov/sequencingcosts. Accessed 20 Dec 2015
- 16.Ye C, Hsiao C, Corrada Bravo H (2014) BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution. Bioinformatics 30(9):1214–1219. doi: 10.1093/bioinformatics/btu010