Decoding the Structural Keywords in Protein Structure Universe
- 15 Downloads
Although the protein sequence-structure gap continues to enlarge due to the development of high-throughput sequencing tools, the protein structure universe tends to be complete without proteins with novel structural folds deposited in the protein data bank (PDB) recently. In this work, we identify a protein structural dictionary (Frag-K) composed of a set of backbone fragments ranging from 4 to 20 residues as the structural “keywords” that can effectively distinguish between major protein folds. We firstly apply randomized spectral clustering and random forest algorithms to construct representative and sensitive protein fragment libraries from a large scale of high-quality, non-homologous protein structures available in PDB. We analyze the impacts of clustering cut-offs on the performance of the fragment libraries. Then, the Frag-K fragments are employed as structural features to classify protein structures in major protein folds defined by SCOP (Structural Classification of Proteins). Our results show that a structural dictionary with ~400 4- to 20-residue Frag-K fragments is capable of classifying major SCOP folds with high accuracy.
Keywordsprotein fragment fold recognition protein structure universe
Unable to display preview. Download preview PDF.
- Sillitoe I, Cuff A L, Dessailly B H, Dawson D L, Furnham N, Lee D, Lees J G, Lewis T E, Studer R A, Rentzsch R, Yeats C, Thornton J M, Orengo C A. New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Research, 2013, 41(D1): D490-D498.CrossRefGoogle Scholar
- Chen D. Structural genomics: Exploring the 3D protein landscape, 2010. Biomedical Computation Review. http://biomedicalcomputationreview.org/content/structural-genomics-exploring-3d-protein-landscape, Nov. 2018.
- Kolinski A. Protein modeling and structure prediction with a reduced representation. Acta Biochimica Polonica, 2004, 51(2): 349-371.Google Scholar
- Li Y. Conformational sampling in template-free protein loop structure modeling: An overview. Computational and Structural Biotechnology Journal, 2013, 5: Article No. e201302003.Google Scholar
- Li Y, Rata I, Jakobsson E. Integrating multiple scoring functions to improve protein loop structure conformation space sampling. In Proc. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, May 2010.Google Scholar
- Li Y, Rata I, Chiu S W, Jakobsson E. Improving predicted protein loop structure ranking using a Pareto-optimality consensus method. BMC Structural Biology, 2010, 10: Article No. 22.Google Scholar
- Ji H, Yu W, Li Y. A rank revealing randomized singular value decomposition (R3SVD) algorithm for low-rank matrix approximations. arXiv:1605.08134, 2016. https://ar xiv.org/ftp/arxiv/papers/1605/1605.08134.pdf, September 2018.
- Elhefnawy W, Li M, Wang J, Li Y. Construction of protein backbone fragments libraries on large protein sets using a randomized spectral clustering algorithm. In Proc. the 13th International Symposium on Bioinformatics Research and Applications, May 2016, pp.108-119.Google Scholar
- Ng A Y, Jordan M I, Weiss Y. On spectral clustering: Analysis and an algorithm. In Proc. the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, December 2001, pp.849-856.Google Scholar
- Gu Y, Yu W, Li J, Liu S, Li Y. Single-pass PCA of large high-dimensional data. In Proc. the 26th International Joint Conference on Artificial Intelligence, August 2017, pp.3350-3356.Google Scholar
- Li Y, YuW. A fast implementation of singular value thresholding algorithm using recycling rank revealing randomized singular value decomposition. arXiv:1704.05528, 2017. https://arxiv.org/pdf/1704.05528.pdf, September 2018.
- Strobl C, Boulesteix A L, Zeileis A, Hothorn T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 2007, 8: Article No. 25.Google Scholar