Advertisement

Statistics in Biosciences

, Volume 10, Issue 1, pp 3–19 | Cite as

Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data

  • Ashley Cacho
  • Weixin Yao
  • Xinping CuiEmail author
Article
  • 87 Downloads

Abstract

The emergence of next-generation sequencing technology has greatly influenced research in biology and clinical applications. This new technology allows millions of DNA fragments to be sequenced in parallel, reducing costs and increasing throughput. One of the most widely used DNA sequencing machines is the Illumina platform which contains a novel sequencing-by-synthesis method involving a series of chemical reactions and image processing. However, it suffers from biases inherent with the complex nature of the chemical processes involved. The process of converting the fluorescence intensity output of the sequencing-by-synthesis technology to the nucleotide bases is what is known as base-calling. The resulting DNA sequences are used in further downstream analyses such as in genome assemblies or variant detection in which the accuracy and quality of bases impact the results. In this paper, we introduce a random effects mixture model that captures the sequencing process and compare its performance to a model with fixed effects.

Keywords

Base-calling Illumina Random effects MCEM DNA sequencing 

Notes

Acknowledgements

The authors thank the Institute for Integrative Genome Biology Bioinformatics Facility at University of California, Riverside, for providing the bioinformatics cluster. This material was based upon work partially supported by the National Science Foundation (DMS ATD-1222718) and the University of California, Riverside (AES- CE RSAP A01869) for X.C. and A.C.

References

  1. 1.
    Biscarat JC (1994) Almost sure convergence of a class of stochastic algorithms. Stoch Proc Appl 50:83–99Google Scholar
  2. 2.
    Cacho A, Smirnova E, Huzurbazar S, Cui X (2015) A comparison of base-calling algorithms for illumina sequencing technology. Brief Bioinform. doi: 10.1093/bib/bbv088
  3. 3.
    Corrada-Bravo H, Irizarry RA (2009) Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics 3:665–674MathSciNetzbMATHGoogle Scholar
  4. 4.
    Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8:186–194CrossRefGoogle Scholar
  5. 5.
    Illumina, Inc. Illumina sequencing technology: highest data accuracy, simple workflow, and a broad range of applications. Springer, New York. http://www.illumina.com/documents/products/ techspotlights/techspotlight_sequencing (2010)
  6. 6.
    Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10:R83.1–R83.9CrossRefGoogle Scholar
  7. 7.
    Ledergerber C, Dessimoz C (2011) Base-calling for next-generation sequencing platforms. Brief Bioinform 12:489–497CrossRefGoogle Scholar
  8. 8.
    Ma W, Wong WH (2011) The analysis of ChIP-Seq data. Methods Enzymol 497:51–73CrossRefGoogle Scholar
  9. 9.
    Massingham T, Goldman N (2012) All Your Base: a fast and accurate probabilistic approach to base calling. Genome Biol 13:R13CrossRefGoogle Scholar
  10. 10.
    Nielsen SF (2000) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6:457–489Google Scholar
  11. 11.
    Renaud G, Kircher M, Stenzel U et al (2013) freeIbis: an efficient basecaller with calibrated quality scores for Illumina sequencers. Bioinformatics 29:1208–1209CrossRefGoogle Scholar
  12. 12.
    Speed TP, Li L (1999) An estimate of the cross-talk matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis 20:1433–42CrossRefGoogle Scholar
  13. 13.
    Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63CrossRefGoogle Scholar
  14. 14.
    Wei G, Tanner M (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704CrossRefGoogle Scholar
  15. 15.
    Wetterstrand KA. DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). www.genome.gov/sequencingcosts. Accessed 20 Dec 2015
  16. 16.
    Ye C, Hsiao C, Corrada Bravo H (2014) BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution. Bioinformatics 30(9):1214–1219. doi: 10.1093/bioinformatics/btu010

Copyright information

© International Chinese Statistical Association 2017

Authors and Affiliations

  1. 1.University of California RiversideRiversideUSA

Personalised recommendations