Pairwise Statistical Significance Versus Database Statistical Significance for Local Alignment of Protein Sequences

  • Ankit Agrawal
  • Volker Brendel
  • Xiaoqiu Huang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4983)


An important aspect of pairwise sequence comparison is assessing the statistical significance of the alignment. Most of the currently popular alignment programs report the statistical significance of an alignment in context of a database search. This database statistical significance is dependent on the database, and hence, the same alignment of a pair of sequences may be assessed different statistical significance values in different databases. In this paper, we explore the use of pairwise statistical significance, which is independent of any database, and can be useful in cases where we only have a pair of sequences and we want to comment on the relatedness of the sequences, independent of any database. We compared different methods and determined that censored maximum likelihood fitting the score distribution right of the peak is the most accurate method for estimating pairwise statistical significance. We evaluated this method in an experiment with a subset of CATH2.3, which had been previoulsy used by other authors as a benchmark data set for protein comparison. Comparison of results with database statistical significance reported by popular programs like SSEARCH and PSI-BLAST indicate that the results of pairwise statistical significance are comparable, indeed sometimes significantly better than those of database statistical significance (with SSEARCH). However, PSI-BLAST performs best, presumably due to its use of query-specific substitution matrices.


Database statistical significance Homologs Pairwise local alignment Pairwise statistical significance 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Pearson, W.R., Lipman, D.J.: Improved Tools for Biological Sequence Comparison. Proceedings of the National Academy of Sciences, USA 85(8), 2444–2448 (1988)CrossRefGoogle Scholar
  2. 2.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. Journal of Molecular Biology 215(3), 403–410 (1990)Google Scholar
  3. 3.
    Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Research 25(17), 3389–3402 (1997)CrossRefGoogle Scholar
  4. 4.
    Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)CrossRefGoogle Scholar
  5. 5.
    Sellers, P.H.: Pattern Recognition in Genetic Sequences by Mismatch Density.. Bulletin of Mathematical Biology 46(4), 501–514 (1984)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Pearson, W.R.: Effective Protein Sequence Comparison. Methods in Enzymology 266, 227–259 (1996)CrossRefGoogle Scholar
  7. 7.
    Pearson, W.R.: Flexible Sequence Similarity Searching with the FASTA3 Program Package.. Methods in Molecular Biology 132, 185–219 (2000)Google Scholar
  8. 8.
    Huang, X., Chao, K.-M.: A Generalized Global Alignment Algorithm. Bioinformatics 19(2), 228–233 (2003)CrossRefGoogle Scholar
  9. 9.
    Huang, X., Brutlag, D.L.: Dynamic Use of Multiple Parameter Sets in Sequence Alignment. Nucleic Acids Research 35(2), 678–686 (2007)CrossRefGoogle Scholar
  10. 10.
    Karlin, S., Altschul, S.F.: Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes. Proceedings of the National Academy of Sciences, USA 87(6), 2264–2268 (1990)CrossRefzbMATHGoogle Scholar
  11. 11.
    Mott, R., Tribe, R.: Approximate Statistics of Gapped Alignments. Journal of Computational Biology 6(1), 91–112 (1999)Google Scholar
  12. 12.
    Mott, R.: Accurate Formula for P-values of Gapped Local Sequence and Profile Alignments. Journal of Molecular Biology 300, 649–659 (2000)CrossRefGoogle Scholar
  13. 13.
    Altschul, S.F., Bundschuh, R., Olsen, R., Hwa, T.: The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research 29(2), 351–361 (2001)CrossRefGoogle Scholar
  14. 14.
    Schäffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., Altschul, S.F.: Improving the Accuracy of PSI-BLAST Protein Database Searches with Composition-based Statistics and Other Refinements. Nucleic Acids Research 29(14), 2994–3005 (2001)CrossRefGoogle Scholar
  15. 15.
    Yu, Y.K., Gertz, E.M., Agarwala, R., Schäffer, A.A., Altschul, S.F.: Retrieval Accuracy, Statistical Significance and Compositional Similarity in Protein Sequence Database Searches. Nucleic Acids Research 34(20), 5966–5973 (2006)CrossRefGoogle Scholar
  16. 16.
    Kschischo, M., Lässig, M., Yu, Y.: Toward an Accurate Statistics of Gapped Alignments. Bulletin of Mathematical Biology 67, 169–191 (2004)CrossRefGoogle Scholar
  17. 17.
    Grossmann, S., Yakir, B.: Large Deviations for Global Maxima of Independent Superadditive Processes with Negative Drift and an Application to Optimal Sequence Alignments. Bernoulli 10(5), 829–845 (2004)CrossRefMathSciNetzbMATHGoogle Scholar
  18. 18.
    Pearson, W.R., Wood, T.C.: Statistical Significance in Biological Sequence Comparison. In: Balding, D.J., Bishop, M., Cannings, C. (eds.) Handbook of Statistical Genetics, pp. 39–66. Wiley, Chichester, UK (2001)Google Scholar
  19. 19.
    Mott, R.: Alignment: Statistical Significance. Encyclopedia of Life Sciences (2005),
  20. 20.
    Mitrophanov, A.Y., Borodovsky, M.: Statistical Significance in Biological Sequence Analysis. Briefings in Bioinformatics 7(1), 2–24 (2006)CrossRefGoogle Scholar
  21. 21.
    Pearson, W.R.: Empirical Statistical Estimates for Sequence Similarity Searches. Journal of Molecular Biology 276, 71–84 (1998)CrossRefGoogle Scholar
  22. 22.
    Eddy, S.R.: Multiple Alignment Using Hidden Markov Models. In: Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T., Wodak, S. (eds.) Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pp. 114–120. AAAI Press, Menlo Park (1995)Google Scholar
  23. 23.
    Eddy, S.R.: Maximum Likelihood Fitting of Extreme Value Distributions (1997), unpublished manuscript,
  24. 24.
    Sierk, M.L., Pearson, W.R.: Sensitivity and Selectivity in Protein Structure Comparison. Protein Science 13(3), 773–785 (2004)CrossRefGoogle Scholar
  25. 25.
    Wolfsheimer, S., Burghardt, B., Hartmann, A.K.: Local Sequence Alignments Statistics: Deviations from Gumbel Statistics in the Rare-event Tail. Algorithms for Molecular Biology 2(9) (2007),
  26. 26.
    Altschul, S.F., Gish, W.: Local Alignment Statistics. Methods in Enzymology 266, 460–480 (1996)CrossRefGoogle Scholar
  27. 27.
    Olsen, R., Bundschuh, R., Hwa, T.: Rapid Assessment of Extremal Statistics for Gapped Local Alignment. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 211–222. AAAI Press, Menlo Park (1999)Google Scholar
  28. 28.
    Huang, X., Miller, W.: A Time-efficient Linear-space Local Similarity Algorithm. Advances in Applied Mathematics 12(3), 337–357 (1991)CrossRefMathSciNetzbMATHGoogle Scholar
  29. 29.
    Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH - A Hierarchic Classification of Protein Domain Structures. Structure 28(1), 1093–1108 (1997)CrossRefGoogle Scholar
  30. 30.
    Agrawal, A., Ghosh, A., Huang, X.: Estimating Pairwise Statistical Significance of Protein Local Alignments Using a Clustering-Classification Approach Based on Amino Acid Composition. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS(LNBI), vol. 4983, pp. 62–73. Springer, Heidelberg (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Ankit Agrawal
    • 1
  • Volker Brendel
    • 2
  • Xiaoqiu Huang
    • 1
  1. 1.Department of Computer ScienceIowa State UniversityAmesUSA
  2. 2.Department of Genetics, Development, and Cell Biology and Department of StatisticsIowa State UniversityAmesUSA

Personalised recommendations