GCG: Database Searching

  • Reinhard Dölz
Part of the Methods in Molecular Biology book series (MIMB, volume 24)


Searches in databases require efficiency and speed. This cannot be achieved by using the same methods as described in the previous chapters on sequence-comparison. It would take much too long to calculate alignment path matrices between the database sequence and the query sequence. However, calculation precision is still needed, because searching a “small” database of 10,000 sequences can no longer be controlled interactively by the researcher. The computer should still be able to separate statistical noise from real “similarity.” This target, however, cannot be achieved in a realistic frame. In Fig. 1 A, you can see a typical score of alignment between a query sequence and the database sequences. The identities will be clearly separated. Interspecies homologies might be clearly visible, but the interesting sequences, the distantly related sequences, might well be hidden in the statistical noise. The “noise” is shown with arrows on top of the scorings to illustrate that the bars are extremely large. The problem is even greater if you are trying to identify distantly related sequences. Then, you will miss identity matches, and interspecies homology matches and the resulting plot will show a very broad statistical noise (see Fig. 1B). The following considerations will guide you in searching for a sequence in the database without being easily trapped.
Fig. 1.

Scormg histograms of typical database searches. The number of hits IS plotted vs the “score” this hit causes during the searching procedure. Subsequent alignment might change these scores because of gaps and homologies. A. Result of searching human calmodulin DNA in the EMBL database. The related protein, troponin C, is found in the steep descent of the statistical noise. B. Result of searching a randomized sequence (again, calmodulin) at precisely the same conditions. Note the random hits with low scores, and the change of scale in the X axis. C. Result of searching human calmodulin protein sequence with tfastu. Note the difference in scores relative to A. D. Result of searching an alignment of calmodulms using the profilesearch method. The reading frame of 10 calmodulins was extracted from the database and alignment as described in  Chapter 9. Note the difference m the scores relative to A.


Query Sequence Batch Mode Output File Command Line Virtual Memory 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Pearson, W. R (1989) Rapid and sensitive sequence-comparison with FASTP and FASTA, in Methods in Enzymology (Dayhoff, M O, ed.), vol. 183, Academic, San Diego, pp. 146–159.Google Scholar
  2. 2.
    Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12, 387–395.PubMedCrossRefGoogle Scholar
  3. 3.
    Gribskov, M and Eisenberg, D. (1989) Detection of structural patterns with profile analysis, in Techniques in Protein Chemistry (Hugh, T E., ed.), Academic, San Diego, pp. 108–117.Google Scholar

Copyright information

© Humana Press Inc., Totowa, NJ 1994

Authors and Affiliations

  • Reinhard Dölz
    • 1
  1. 1.BiocomputingBiozentrum der Universitiit BaselSwitzerland

Personalised recommendations