Estimating Pairwise Statistical Significance of Protein Local Alignments Using a Clustering-Classification Approach Based on Amino Acid Composition

  • Ankit Agrawal
  • Arka Ghosh
  • Xiaoqiu Huang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4983)


A central question in pairwise sequence comparison is assessing the statistical significance of the alignment. The alignment score distribution is known to follow an extreme value distribution with analytically calculable parameters K and λ for ungapped alignments with one substitution matrix. But no statistical theory is currently available for the gapped case and for alignments using multiple scoring matrices, although their score distribution is known to closely follow extreme value distribution and the corresponding parameters can be estimated by simulation. Ideal estimation would require simulation for each sequence pair, which is impractical. In this paper, we present a simple clustering-classification approach based on amino acid composition to estimate K and λ for a given sequence pair and scoring scheme, including using multiple parameter sets. The resulting set of K and λ for different cluster pairs has large variability even for the same scoring scheme, underscoring the heavy dependence of K and λ on the amino acid composition. The proposed approach in this paper is an attempt to separate the influence of amino acid composition in estimation of statistical significance of pairwise protein alignments. Experiments and analysis of other approaches to estimate statistical parameters also indicate that the methods used in this work estimate the statistical significance with good accuracy.


Clustering Classification Pairwise local alignment Statistical significance 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Research 25(17), 3389–3402 (1997)CrossRefGoogle Scholar
  2. 2.
    Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)CrossRefGoogle Scholar
  3. 3.
    Sellers, P.H.: Pattern Recognition in Genetic Sequences by Mismatch Density. Bulletin of Mathematical Biology 46(4), 501–514 (1984)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Pearson, W.R.: Effective Protein Sequence Comparison. Methods in Enzymology 266, 227–259 (1996)CrossRefGoogle Scholar
  5. 5.
    Pearson, W.R.: Flexible Sequence Similarity Searching with the FASTA3 Program Package. Methods in Molecular Biology 132, 185–219 (2000)Google Scholar
  6. 6.
    Huang, X., Chao, K.M.: A Generalized Global Alignment Algorithm. Bioinformatics 19(2), 228–233 (2003)CrossRefGoogle Scholar
  7. 7.
    Huang, X., Brutlag, D.L.: Dynamic Use of Multiple Parameter Sets in Sequence Alignment. Nucleic Acids Research 35(2), 678–686 (2007)CrossRefGoogle Scholar
  8. 8.
    Karlin, S., Altschul, S.F.: Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes. Proceedings of the National Academy of Sciences, USA 87(6), 2264–2268 (1990)CrossRefzbMATHGoogle Scholar
  9. 9.
    Pearson, W.R.: Empirical Statistical Estimates for Sequence Similarity Searches. Journal of Molecular Biology 276, 71–84 (1998)CrossRefGoogle Scholar
  10. 10.
    Mott, R., Tribe, R.: Approximate Statistics of Gapped Alignments. Journal of Computational Biology 6(1), 91–112 (1999)CrossRefGoogle Scholar
  11. 11.
    Mott, R.: Accurate Formula for P-values of Gapped Local Sequence and Profile Alignments. Journal of Molecular Biology 300, 649–659 (2000)CrossRefGoogle Scholar
  12. 12.
    Altschul, S.F., Bundschuh, R., Olsen, R., Hwa, T.: The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research 29(2), 351–361 (2001)CrossRefGoogle Scholar
  13. 13.
    Schäffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., Altschul, S.F.: Improving the Accuracy of PSI-BLAST Protein Database Searches with Composition-based Statistics and Other Refinements. Nucleic Acids Research 29(14), 2994–3005 (2001)CrossRefGoogle Scholar
  14. 14.
    Bundschuh, R.: Rapid Significance Estimation in Local Sequence Alignment with Gaps. In: RECOMB 2001: Proceedings of the fifth annual International Conference on Computational biology, pp. 77–85. ACM, New York (2001)CrossRefGoogle Scholar
  15. 15.
    Poleksic, A., Danzer, J.F., Hambly, K., Debe, D.A.: Convergent Island Statistics: A Fast Method for Determining Local Alignment Score Significance. Bioinformatics 21(12), 2827–2831 (2005)CrossRefGoogle Scholar
  16. 16.
    Kschischo, M., Lässig, M., Yu, Y.: Toward an Accurate Statistics of Gapped Alignments. Bulletin of Mathematical Biology 67, 169–191 (2004)CrossRefGoogle Scholar
  17. 17.
    Grossmann, S., Yakir, B.: Large Deviations for Global Maxima of Independent Superadditive Processes with Negative Drift and an Application to Optimal Sequence Alignments. Bernoulli 10(5), 829–845 (2004)CrossRefMathSciNetzbMATHGoogle Scholar
  18. 18.
    Pearson, W.R., Wood, T.C.: Statistical Significance in Biological Sequence Comparison. In: Balding, D.J., Bishop, M., Cannings, C. (eds.) Handbook of Statistical Genetics, pp. 39–66. Wiley, Chichester (2001)Google Scholar
  19. 19.
    Mott, R.: Alignment: Statistical Significance. Encyclopedia of Life Sciences (2005),
  20. 20.
    Mitrophanov, A.Y., Borodovsky, M.: Statistical Significance in Biological Sequence Analysis. Briefings in Bioinformatics 7(1), 2–24 (2006)CrossRefGoogle Scholar
  21. 21.
    Eddy, S.R.: Multiple Alignment Using Hidden Markov Models. In: Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T., Wodak, S. (eds.) Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pp. 114–120. AAAI Press, Menlo Park (1995)Google Scholar
  22. 22.
    Eddy, S.R.: Maximum Likelihood Fitting of Extreme Value Distributions (1997), unpublished manuscript,
  23. 23.
    Agrawal, A., Brendel, V., Huang, X.: Pairwise Statistical Significance Versus Database Statistical Significance for Local Alignment of Protein Sequences. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS(LNBI), vol. 4983, pp. 50–61. Springer, Heidelberg (in press, 2008)Google Scholar
  24. 24.
    Olsen, R., Bundschuh, R., Hwa, T.: Rapid Assessment of Extremal Statistics for Gapped Local Alignment. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 211–222. AAAI Press, Menlo Park (1999)Google Scholar
  25. 25.
    Anderson, T.W.: An Introduction to Multivariate Statistical Analysis, 2nd edn. Wiley-Interscience, Chichester (2003)zbMATHGoogle Scholar
  26. 26.
    Language, R.A.: Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2006)Google Scholar
  27. 27.
    Huang, X., Miller, W.: A Time-efficient Linear-space Local Similarity Algorithm. Advances in Applied Mathematics 12(3), 337–357 (1991)CrossRefMathSciNetzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Ankit Agrawal
    • 1
  • Arka Ghosh
    • 2
  • Xiaoqiu Huang
    • 1
  1. 1.Department of Computer ScienceIowa State UniversityAmesUSA
  2. 2.Department of StatisticsIowa State UniversityAmesUSA

Personalised recommendations