Statistical Estimate for the Size of the Protein Structural Vocabulary

Fu, Xuezheng; Chen, Bernard; Pan, Yi; Harrison, Robert W.

doi:10.1007/978-3-540-72031-7_48

Statistical Estimate for the Size of the Protein Structural Vocabulary

Xuezheng Fu¹,
Bernard Chen¹,
Yi Pan¹ &
…
Robert W. Harrison^1,2

Conference paper

823 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4463))

Abstract

The concept of structural clusters defining the vocabulary of protein structure is one of the central concepts in the modern theory of protein folding. Typically clusters are found by a variation of the K-means or K-NN algorithm. In this paper we study approaches to estimating the number of clusters in data. The optimal number of clusters is believed to result in a reliable clustering. Stability with respect to bootstrap sampling was adapted as the cluster validation measure for estimating the reliable clustering. In order to test this algorithm, six random subsets were drawn from the unique chains in the PDB. The algorithm converged in each case to unique set of reliable clusters. Since these clusters were drawn randomly from the total current set of chains, counting the number of coincidences and using basic sampling theory provides a rigorous statistical estimate of the number of unique clusters in the dataset.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, pp. 6–17 (2002)
Google Scholar
Bryan, J.: Problems in gene clustering based on gene expression data. Journal of Multivariate Analyis 90, 67–89 (2004)
Article MathSciNet Google Scholar
Dudoit, S., Fridlyand, J.: A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology 3, 0036.1–0036.21 (2002)
Google Scholar
Chen, B., et al.: FIK model: A Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery. In: IEEE BIBE 2006 proceeding, pp. 20–26 (2006)
Google Scholar
Zhong, W., et al.: Improved K-Means Clustering algorithm for Exploring Local Protein Sequence motifs Representing Common Structural Property. IEEE transactions on Nanobioscience 4(3), 255–265 (2005)
Article Google Scholar
Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78(383), 553–584 (1983)
Article MATH Google Scholar
Pena, J., Lozano, J., Larranaga, P.: An Empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters 20, 1027–1040 (1999)
Article Google Scholar
Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence-culling server. Bioinformatics 19(12), 1589–1591 (2003)
Article Google Scholar
Sander, C., Schneider, R.: Database of similarity derived protein structures and the structure meaning of sequence alignment. Proteins: Struct. Funct. Genet. 9(1), 56–68 (1991)
Article Google Scholar
Berman, H.M., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)
Article Google Scholar
Kabsch, W., Sander, C.: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Georgia State University, Atlanta, GA, 30303,
Xuezheng Fu, Bernard Chen, Yi Pan & Robert W. Harrison
Department of Biology, Georgia State University, Atlanta, GA, 30303,
Robert W. Harrison

Authors

Xuezheng Fu
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yi Pan
View author publications
You can also search for this author in PubMed Google Scholar
Robert W. Harrison
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Ion Măndoiu Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, X., Chen, B., Pan, Y., Harrison, R.W. (2007). Statistical Estimate for the Size of the Protein Structural Vocabulary. In: Măndoiu, I., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science(), vol 4463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72031-7_48

Download citation

DOI: https://doi.org/10.1007/978-3-540-72031-7_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72030-0
Online ISBN: 978-3-540-72031-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics