Skip to main content

Statistical Estimate for the Size of the Protein Structural Vocabulary

  • Conference paper
  • 823 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4463))

Abstract

The concept of structural clusters defining the vocabulary of protein structure is one of the central concepts in the modern theory of protein folding. Typically clusters are found by a variation of the K-means or K-NN algorithm. In this paper we study approaches to estimating the number of clusters in data. The optimal number of clusters is believed to result in a reliable clustering. Stability with respect to bootstrap sampling was adapted as the cluster validation measure for estimating the reliable clustering. In order to test this algorithm, six random subsets were drawn from the unique chains in the PDB. The algorithm converged in each case to unique set of reliable clusters. Since these clusters were drawn randomly from the total current set of chains, counting the number of coincidences and using basic sampling theory provides a rigorous statistical estimate of the number of unique clusters in the dataset.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)

    Article  Google Scholar 

  2. Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, pp. 6–17 (2002)

    Google Scholar 

  3. Bryan, J.: Problems in gene clustering based on gene expression data. Journal of Multivariate Analyis 90, 67–89 (2004)

    Article  MathSciNet  Google Scholar 

  4. Dudoit, S., Fridlyand, J.: A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology 3, 0036.1–0036.21 (2002)

    Google Scholar 

  5. Chen, B., et al.: FIK model: A Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery. In: IEEE BIBE 2006 proceeding, pp. 20–26 (2006)

    Google Scholar 

  6. Zhong, W., et al.: Improved K-Means Clustering algorithm for Exploring Local Protein Sequence motifs Representing Common Structural Property. IEEE transactions on Nanobioscience 4(3), 255–265 (2005)

    Article  Google Scholar 

  7. Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78(383), 553–584 (1983)

    Article  MATH  Google Scholar 

  8. Pena, J., Lozano, J., Larranaga, P.: An Empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters 20, 1027–1040 (1999)

    Article  Google Scholar 

  9. Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence-culling server. Bioinformatics 19(12), 1589–1591 (2003)

    Article  Google Scholar 

  10. Sander, C., Schneider, R.: Database of similarity derived protein structures and the structure meaning of sequence alignment. Proteins: Struct. Funct. Genet. 9(1), 56–68 (1991)

    Article  Google Scholar 

  11. Berman, H.M., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)

    Article  Google Scholar 

  12. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Ion Măndoiu Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fu, X., Chen, B., Pan, Y., Harrison, R.W. (2007). Statistical Estimate for the Size of the Protein Structural Vocabulary. In: Măndoiu, I., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science(), vol 4463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72031-7_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72031-7_48

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72030-0

  • Online ISBN: 978-3-540-72031-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics