Improving DNA Sequencing Accuracy and Throughput

  • David O. Nelson
Conference paper
Part of the The IMA Volumes in Mathematics and its Applications book series (IMA, volume 81)


LLNL is beginning to explore statistical approaches to the problem of determining the DNA sequence underlying data obtained from fluorescence-based gel electrophoresis. Among the features of this problem that make it interesting to statisticians include:
  • • the underlying mechanics of electrophoresis is quite complex and still not completely understood;

  • • the yield of fragments of any given size can be quite small and variable;

  • • the mobility of fragments of a given size can depend on the terminating base;

  • • the data consists of samples from one or more continuous, non-stationary signals;

  • • boundaries between segments generated by distinct elements of the underlying sequence are ill-defined or nonexistent in the signal; and

  • • the sampling rate of the signal greatly exceeds the rate of evolution of the underlying discrete sequence.

Current approaches to base calling address only some of these issues, and usually in a heuristic, ad hoc way. In this article we describe some of our initial efforts towards increasing base calling accuracy and throughput by providing a rational, statistical foundation to the process of deducing sequence from signal.


Original Signal Fredholm Integral Equation Base Calling Lawrence Livermore National Laboratory Reverse Complement 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Alan Agresti, Categorical data analysis, John Wiley and Sons, New York, 1990.MATHGoogle Scholar
  2. [2]
    Applied Biosystems, 373 DNA sequencing analysis software user’s manual. May 1994, Part Number 903205, Rev. A.Google Scholar
  3. [3]C. B. Begg and R. Gray, Calculation of polytomous logistic regression parameters using individualized regressions, Biometrika 71 (1984), 11–18.MathSciNetMATHCrossRefGoogle Scholar
  4. [4]A. Benveniste, M. Métivier, and P. Priouret, Adaptive algorithms and stochastic approximations, Springer-Verlag, Berlin, 1990.MATHCrossRefGoogle Scholar
  5. [5]
    N. Best, E. Arriaga, D. Y. Chen, and N. Dovichi, Separation of fragments up to 510 bases in length by use of 6% T non-cross-linked polyacrylamide for DNA sequencing in capillary electrophoresis, Anal. Chem. 66 (1994), 4063–4067.CrossRefGoogle Scholar
  6. [6]J. M. Bowling, K. L. Bruner, J. L. Cmarik, and C. Tibbetts, Neighboring nucleotide interactions during DNA sequencing gel electrophoresis Nucleic Acids Research 19 (1991), 3089–3097.CrossRefGoogle Scholar
  7. [7]
    Francis Collins and David Galas, Anew five-year plan for the U. S. Human Genome Project, Science 262 (1993), 43–46.CrossRefGoogle Scholar
  8. [8]
    H. A. Drury, K. W. Clark, R. E. Hermes, et al., A graphical user interface for quantitative imaging and analysis of electrophoretic gels and autoradiograms, BioTechniques 12 (1992), no. 6, 892–901.Google Scholar
  9. [9]R. J. Elliott, L. Aggoun, and J. B. Moore, Hidden markov models: Estimation and control, Springer-Verlag, New York, 1995.MATHGoogle Scholar
  10. [10]J. C. Giddings, Dynamics of chromatography, Marcel Dekker, New York, 1965.Google Scholar
  11. [11]
    M. C. Giddings, R. L. Brumley, M. Haker, and L. M. Smith, An adaptive, object-oriented strategy for base calling in DNA sequence analysis, Nucleic Acids Research 21 (1993), no. 19, 4530–4540.CrossRefGoogle Scholar
  12. [12]
    J. B. Golden III, D. Torgersen, and C. Tibbetts, Pattern recognition for automated DNA sequencing I: on-line signal conditioning and feature extraction for basecalling, Proceedings of the First International Conference on Intelligent Systems for Molecular Biology (Menlo Park, CA) (L. Hunter, D. Searls, and J. Shavlik, eds.), AAAI Press, 1994, pp. 136–144.Google Scholar
  13. [13]
    I. J. Good and M. L. Deaton, Recent advances in bump hunting (with discussion), Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface (William F. Eddy, ed.), vol. 13, Springer, New York, 1981, pp. 92–104.Google Scholar
  14. [14]
    P. D. Grossman, S. Menchen, and D. Hershey, Quantitative analysis of DNA-sequencing electrophoresis, GATA 9 (1992), 9–16.Google Scholar
  15. [15]
    B. F. Koop, L. Rowan, W-Q. Chen, et al., Sequence length and error analysis of Sequenase and automated Taq cycle sequencing methods, BioTechniques 14 (1993), no. 3, 442–447.Google Scholar
  16. [16]
    L. Landweber, An iteration formula for Fredholm integral equations of the first kind, Am. J. Math. 73 (1951), 615–624.MathSciNetMATHCrossRefGoogle Scholar
  17. [17]
    Ta-Hsin Li, Blind identification and deconvolution of linear systems driven by binary random sequences, IEEE Trans. Inf. Theory 38 (1992), 26–38.MATHCrossRefGoogle Scholar
  18. [18]
    J. A. Lucky, T. B. Norris, and L. M. Smith, Analysis of resolution in DNA sequencing by capillary gel electrophoresis, J. Phys. Chem. 97 (1993), 3067–3075.CrossRefGoogle Scholar
  19. [19]
    P. McCullagh and J. A. Nelder, Generalized linear models, second ed., Chapman and Hall, London, 1989.MATHGoogle Scholar
  20. [20]
    L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 257-285 (1989), 77.Google Scholar
  21. [21]
    C. K. Rushforth, Signal restoration, functional analysis, and fredholm integral equations of the first kind, Image Recovery: Theory and Application (Henry Stark, ed.), Academic Press, New York, 1987, pp. 1–27.Google Scholar
  22. [22]
    J. Z. Sanders, A. A. Petterson, P. J. Hughes, et al., Imaging as a tool for improving length and accuracy of sequence analysis in automated fluorescence-based DNA sequencing, Electrophoresis 12 (1991), 3–11.CrossRefGoogle Scholar
  23. [23]
    V. Seshadri, The inverse gaussian distribution: A case study in exponential families, Oxford University Press, Oxford, 1993.Google Scholar
  24. [24]
    E. O. Shaffer II and M. Olvera de la Cruz, Dynamics of gel electrophoresis, Macro-molecules 22 (1989), 1351–1355.CrossRefGoogle Scholar
  25. [25]
    B. W. Silverman, Using kernel density estimates to investigate multimodality, J. R. Statist. Soc. B 43 (1981), 97–99.Google Scholar
  26. [26]
    G. W. Slater and G. Drouin, Why can we not sequence thousands of DNA bases on a polyacrylamide gel, Electrophoresis 13 (1992), 574–582.CrossRefGoogle Scholar
  27. [27]
    L. M. Smith, J. Z. Sanders, R. J. Kaiser, P. Hughes, C. Dodd, C. R. Connell, C. Heiner, S. B. H. Kent, and L. E. Hood, Fluorescence detection in automated DNA sequence analysis, Nature 321 (1986), 674–679.CrossRefGoogle Scholar
  28. [28]
    C. Tibbetts, J. M. Bowling, and J. B. Golden, III, Neural networks for automated basecalling of gel-based DNA sequencing ladders, Automated DNA Sequencing and Analysis Techniques (J. C. Venter, ed.), Academic Press, New York 1994, pp. 219–229Google Scholar
  29. [29]
    S. Twomey, On the numerical solution of Fredholm integral equations of the first kind by inversion of the linear system produced by quadrature., J. ACM 10 (1963), 97–101.MATHCrossRefGoogle Scholar
  30. [30]
    Y. Vardi and D. Lee, From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems, J. R. Statist. Soc. B 55 (1993), 569–612.MathSciNetMATHGoogle Scholar
  31. [31]
    E. Weinstein, M. Feder, and A. V. Oppenheim, Sequential algorithms for parameter estimation based on the Kullback-Leibler information measure, IEEE Trans. Acoustics, Speech, and Signal Processing 38 (1990), 1652–1654.MATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 1996

Authors and Affiliations

  • David O. Nelson
    • 1
    • 2
  1. 1.Lawrence Livermore National LaboratoryLivermoreUSA
  2. 2.Statistics DepartmentUniversity of CaliforniaBerkeleyUSA

Personalised recommendations