Searching Protein 3-D Structures in Linear Time

  • Tetsuo Shibuya
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5541)


Finding similar structures from 3-D structure databases of proteins is becoming more and more important issue in the post-genomic molecular biology. To compare 3-D structures of two molecules, biologists mostly use the RMSD (root mean square deviation) as the similarity measure. We propose new theoretically and practically fast algorithms for the fundamental problem of finding all the substructures of structures in a structure database of chain molecules (such as proteins), whose RMSDs to the query are within a given constant threshold. We first propose a breakthrough linear-expected-time algorithm for the problem, while the previous best-known time complexity was O(Nlogm), where N is the database size and m is the query size. For the expected time analysis, we propose to use the random-walk model (or the ideal chain model) as the model of average protein structures. We furthermore propose a series of preprocessing algorithms that enable faster queries. We checked the performance of our linear-expected-time algorithm through computational experiments over the whole PDB database. According to the experiments, our algorithm is 3.6 to 28 times faster than previously known algorithms for ordinary queries. Moreover, the experimental results support the validity of our theoretical analyses.


Time Complexity Chain Molecule Query Size Fast Query Searching Protein 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arun, K.S., Huang, T.S., Blostein, S.D.: Least-squares fitting of two 3-D point sets. IEEE Trans. Pattern Anal. Machine Intell. 9, 698–700 (1987)CrossRefGoogle Scholar
  2. 2.
    Aung, Z., Tan, K.-L.: Rapid retrieval of protein structures from databases. Drug Discovery Today 12, 732–739 (2007)CrossRefPubMedGoogle Scholar
  3. 3.
    Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucl. Acids Res. 28, 235–242 (2000)CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Boyd, R.H., Phillips, P.J.: The Science of Polymer Molecules: An Introduction Concerning the Synthesis. In: Structure and Properties of the Individual Molecules That Constitute Polymeric Materials. Cambridge University Press, Cambridge (1996)Google Scholar
  5. 5.
    Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965)CrossRefGoogle Scholar
  6. 6.
    de Gennes, P.-G.: Scaling Concepts in Polymer Physics. Cornell University Press (1979)Google Scholar
  7. 7.
    Eggert, D.W., Lorusso, A., Fisher, R.B.: Estimating 3-D rigid body transformations: a comparison of four major algorithms. Machine Vision and Applications 9, 272–290 (1997)CrossRefGoogle Scholar
  8. 8.
    Eidhammer, I., Jonassen, I., Taylor, W.R.: Structure comparison and structure patterns. J. Comput. Biol. 7(5), 685–716 (2000)CrossRefPubMedGoogle Scholar
  9. 9.
    Flory, P.J.: Statistical Mechanics of Chain Molecules. Interscience, New York (1969)Google Scholar
  10. 10.
    Gerstein, M.: Integrative database analysis in structural genomics. Nat. Struct. Biol., 960–963 (2000)Google Scholar
  11. 11.
    Golub, G.H., Van Loan, C.F.: Matrix Computation, 3rd edn. John Hopkins University Press (1996)Google Scholar
  12. 12.
    Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Acta Cryst. A32, 922–923 (1976)CrossRefGoogle Scholar
  13. 13.
    Kabsch, W.: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Cryst. A34, 827–828 (1978)CrossRefGoogle Scholar
  14. 14.
    Kallenberg, O.: Foundations of Modern Probability. Springer, Heidelberg (1997)Google Scholar
  15. 15.
    Kramers, H.A.: The behavior of macromolecules in inhomogeneous flow. J. Chem. Phys. 14(7), 415–424 (1946)CrossRefGoogle Scholar
  16. 16.
    Schwartz, J.T., Sharir, M.: Identification of partially obscured objects in two and three dimensions by matching noisy characteristic curves. Intl. J. of Robotics Res. 6, 29–44 (1987)CrossRefGoogle Scholar
  17. 17.
    Shibuya, T.: Geometric suffix tree: a new index structure for protein 3-D structures. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 84–93. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  18. 18.
    Shibuya, T.: Prefix-shuffled geometric suffix tree. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 300–309. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  19. 19.
    Shibuya, T.: Efficient substructure RMSD query algorithms. J. Comput. Biol. 14(9), 1201–1207 (2007)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Tetsuo Shibuya
    • 1
  1. 1.Human Genome Center, Institute of Medical ScienceUniversity of TokyoTokyoJapan

Personalised recommendations