Abstract
The concept of structural clusters defining the vocabulary of protein structure is one of the central concepts in the modern theory of protein folding. Typically clusters are found by a variation of the K-means or K-NN algorithm. In this paper we study approaches to estimating the number of clusters in data. The optimal number of clusters is believed to result in a reliable clustering. Stability with respect to bootstrap sampling was adapted as the cluster validation measure for estimating the reliable clustering. In order to test this algorithm, six random subsets were drawn from the unique chains in the PDB. The algorithm converged in each case to unique set of reliable clusters. Since these clusters were drawn randomly from the total current set of chains, counting the number of coincidences and using basic sampling theory provides a rigorous statistical estimate of the number of unique clusters in the dataset.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, pp. 6–17 (2002)
Bryan, J.: Problems in gene clustering based on gene expression data. Journal of Multivariate Analyis 90, 67–89 (2004)
Dudoit, S., Fridlyand, J.: A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology 3, 0036.1–0036.21 (2002)
Chen, B., et al.: FIK model: A Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery. In: IEEE BIBE 2006 proceeding, pp. 20–26 (2006)
Zhong, W., et al.: Improved K-Means Clustering algorithm for Exploring Local Protein Sequence motifs Representing Common Structural Property. IEEE transactions on Nanobioscience 4(3), 255–265 (2005)
Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78(383), 553–584 (1983)
Pena, J., Lozano, J., Larranaga, P.: An Empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters 20, 1027–1040 (1999)
Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence-culling server. Bioinformatics 19(12), 1589–1591 (2003)
Sander, C., Schneider, R.: Database of similarity derived protein structures and the structure meaning of sequence alignment. Proteins: Struct. Funct. Genet. 9(1), 56–68 (1991)
Berman, H.M., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)
Kabsch, W., Sander, C.: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fu, X., Chen, B., Pan, Y., Harrison, R.W. (2007). Statistical Estimate for the Size of the Protein Structural Vocabulary. In: Măndoiu, I., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science(), vol 4463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72031-7_48
Download citation
DOI: https://doi.org/10.1007/978-3-540-72031-7_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72030-0
Online ISBN: 978-3-540-72031-7
eBook Packages: Computer ScienceComputer Science (R0)