Advertisement

A Faster Reliable Algorithm to Estimate the p-Value of the Multinomial llr Statistic

  • Uri Keich
  • Niranjan Nagarajan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3240)

Abstract

The subject of estimating the p-value of the log-likelihood ratio statistic for multinomial distribution has been studied extensively in the statistical literature. Nevertheless, bioinformatics laid new challenges before that research by often concentrating its interest on the “thin tail” of the distribution where classical statistical approximation typically fails. Hence, some of the more recent development in this area have come from the bioinformatics community ([5], [3]).

Since algorithms for computing the exact p-value have an exponential complexity, the only generally applicable algorithms for reliably estimating the p-value are lattice based. In particular, Hertz and Stormo have a dynamic programming algorithm whose complexity is O(QKN 2), where Q is the size of the lattice, K is the size of the alphabet and N is the size of the sample. We present a new algorithm that is practically as reliable as Hertz and Stormo’s and has a complexity of O(QKNlog N). An interesting feature of our algorithm is that it can guarantee the quality of its estimated p-value.

Keywords

Numerical Error Dynamic Programming Algorithm Multinomial Distribution Roundoff Error Runtime Comparison 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baglivo, J., Olivier, D., Pagano, M.: Methods for exact goodness-of-fit tests. Journal of the American Statistical Association 87(418), 464–469 (1992)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, Menlo Park, California, pp. 28–36 (1994)Google Scholar
  3. 3.
    Bejerano, G.: Efficient exact value computation and applications to biosequence analysis. In: Vingron, M., Istrail, S., Pevzner, P.A., Waterman, M.S. (eds.) Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003), Berlin, Germany, pp. 38–47. ACM Press, New York (2003)CrossRefGoogle Scholar
  4. 4.
    Cressie, N., Read, T.R.C.: Person’s χ2 and the loglikelihood ratio statistic g2: A comparative review. International Statistical Review 57(1), 19–43 (1989)zbMATHCrossRefGoogle Scholar
  5. 5.
    Hertz, G.Z., Stormo, G.D.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)CrossRefGoogle Scholar
  6. 6.
    Hirji, K.A.: A comparison of algorithms for exact goodness-of-fit tests for multinomial data. Communications in Statistics-Simulation and Computations 26(3), 1197–1227 (1997)zbMATHCrossRefGoogle Scholar
  7. 7.
    Hoeffding, W.: Asymptotically optimal tests for multinomial distributions. Annals of Mathematical Statistics 36, 369–408 (1965)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Kallenberg, W.C.M.: On moderate and large deviations in multinomial distributions. Annals of Statistics 13(4), 1554–1580 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Keich, U.: Efficiently computing the p-value of the entropy score. Journal of Computational Biology (in press)Google Scholar
  10. 10.
    Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical recipes in C. The art of scientific computing, 2nd edn. Cambridge University Press, Cambridge (1992)zbMATHGoogle Scholar
  11. 11.
    Rahmann, S.: Dynamic programming algorithms for two statistical problems in computational biology. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS (LNBI), vol. 2812, pp. 151–164. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  12. 12.
    Rice, J.A.: Mathematical Statistics and Data Analysis, 2nd edn. Duxbury Press, Boston (1995)zbMATHGoogle Scholar
  13. 13.
    Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16(1), 16–23 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Uri Keich
    • 1
  • Niranjan Nagarajan
    • 1
  1. 1.Department of Computer ScienceCornell UniversityIthacaUSA

Personalised recommendations