Abstract
Amino acid sequence comparisons to find similarities between proteins are fundamental sequence information analyses for inferring protein structure and function. In this study, we improve amino acid substitution matrices to identify distantly related proteins. We systematically sampled and benchmarked substitution matrices generated from the principal component analysis (PCA) subspace based on a set of typical existing matrices. Based on the benchmark results, we identified a region of highly sensitive matrices in the PCA subspace using kernel density estimation (KDE). Using the PCA subspace, we were able to deduce a novel sensitive matrix, called MIQS, which shows better detection performance for detecting distantly related proteins than those of existing matrices. This approach to derive an efficient amino acid substitution matrix might influence many fields of protein sequence analysis. MIQS is available at http://csas.cbrc.jp/Ssearch/.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsAbbreviations
- AUC :
-
Area under the ROC curve
- BLOSUM :
-
Block substitution matrix
- KDE :
-
Kernel density estimation
- MIQS :
-
Matrix to improve quality in similarity search
- PCA :
-
Principal component analysis
- ROC :
-
Receiver operating characteristic
- VTML :
-
Variable time maximum likelihood
References
Tomii K, Kanehisa K (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng 9(1):27–36
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure. National Biomedical Research Foundation, Washington, DC, pp 345–352, Vol 5 (Suppl. 3)
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8(3):275–282
Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256(5062):1443–1445
Benner SA, Cohen MA, Gonnet GH (1994) Amino acid substitution during functionally constrained divergent evolution of protein sequences. Protein Eng 7(11):1323–1332
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89(22):10915–10919
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402
Pearson WR (1991) Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11(3):635–650
Henikoff S, Henikoff JG (1993) Performance evaluation of amino acid substitution matrices. Proteins 17(1):49–61
Price GA, Crooks GE, Green RE et al (2005) Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap. Bioinformatics 21(20):3824–3831
Müller T, Spang R, Vingron M (2002) Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 19(1):8–13
Yamada K, Tomii K (2014) Revisiting amino acid substitution matrices for identifying distantly related proteins. Bioinformatics 30(3):317–325. doi:10.1093/bioinformatics/btt694
Tan YH, Huang H, Kihara D (2006) Statistical potential-based amino acid similarity matrices for aligning distantly related protein sequences. Proteins 64(3):587–600
Dosztányi Z, Torda AE (2001) Amino acid similarity matrices based on force fields. Bioinformatics 17(8):686–699
Andreeva A, Howorth D, Chandonia J-M et al (2008) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36(Database issue):D419–D425
Angermüller C, Biegert A, Söding J (2012) Discriminative modelling of context-specific amino acid substitution probabilities. Bioinformatics 28(24):3240–3247. doi:10.1093/bioinformatics/bts622
Sillitoe I, Lewis TE, Cuff A et al (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43(Database issue):D376–D381. doi:10.1093/nar/gku947
Remmert M, Biegert A, Hauser A et al (2011) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9(2):173–175. doi:10.1038/nmeth.1818
Petersen TN, Kauppinen S, Larsen S (1997) The crystal structure of rhamnogalacturonase A from Aspergillus aculeatus: a right-handed parallel beta helix. Structure 5(4):533–544
Pickersgill R, Smith D, Worboys K et al (1998) Crystal structure of polygalacturonase from Erwinia carotovora ssp. carotovora. J Biol Chem 273(38):24660–24664
Styczynski MP, Jensen KL, Rigoutsos I et al (2008) BLOSUM62 miscalculations improve search performance. Nat Biotechnol 26(3):274–275. doi:10.1038/nbt0308-274
Pearson WR (2013) Selecting the right similarity-scoring matrix. Curr Protoc Bioinformatics Suppl. 43:3.5.1–3.5.9
Kinjo AR, Nishikawa K (2004) Eigenvalue analysis of amino acid substitution matrices reveals a sharp transition of the mode of sequence conservation in proteins. Bioinformatics 20(16):2504–2508
Overington J, Donnelly D, Johnson MS et al (1992) Environment-specific amino acid substitution tables: Tertiary templates and prediction of protein folds. Protein Sci 1(2):216–226
Prlic A, Bliven S, Rose PW et al (2010) Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics 26(23):2983–2985. doi:10.1093/bioinformatics/btq572
Acknowledgments
This work was partially supported by Platform Project for Supporting in Drug Discovery and Life Science Research (Platform for Drug Discovery, Informatics, and Structural Life Science) from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) and Japan Agency for Medical Research and Development (AMED). We thank Drs. Somlata Gupta, Kumiko Nakada-Tsukui, and Tomoyoshi Nozaki of NIID for discussions related to IMD/I-BAR domains in E. histolytica. We thank Toshiyuki Oda for conducting the HHblits search.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this protocol
Cite this protocol
Tomii, K., Yamada, K. (2016). Systematic Exploration of an Efficient Amino Acid Substitution Matrix: MIQS. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_11
Download citation
DOI: https://doi.org/10.1007/978-1-4939-3572-7_11
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3570-3
Online ISBN: 978-1-4939-3572-7
eBook Packages: Springer Protocols