GCG: Database Searching
Part of the Methods in Molecular Biology book series (MIMB, volume 24)
Searches in databases require efficiency and speed. This cannot be achieved by using the same methods as described in the previous chapters on sequence-comparison. It would take much too long to calculate alignment path matrices between the database sequence and the query sequence. However, calculation precision is still needed, because searching a “small” database of 10,000 sequences can no longer be controlled interactively by the researcher. The computer should still be able to separate statistical noise from real “similarity.” This target, however, cannot be achieved in a realistic frame. In Fig. 1 A, you can see a typical score of alignment between a query sequence and the database sequences. The identities will be clearly separated. Interspecies homologies might be clearly visible, but the interesting sequences, the distantly related sequences, might well be hidden in the statistical noise. The “noise” is shown with arrows on top of the scorings to illustrate that the bars are extremely large. The problem is even greater if you are trying to identify distantly related sequences. Then, you will miss identity matches, and interspecies homology matches and the resulting plot will show a very broad statistical noise (see Fig. 1B). The following considerations will guide you in searching for a sequence in the database without being easily trapped.
- 1.Pearson, W. R (1989) Rapid and sensitive sequence-comparison with FASTP and FASTA, in Methods in Enzymology (Dayhoff, M O, ed.), vol. 183, Academic, San Diego, pp. 146–159.Google Scholar
- 3.Gribskov, M and Eisenberg, D. (1989) Detection of structural patterns with profile analysis, in Techniques in Protein Chemistry (Hugh, T E., ed.), Academic, San Diego, pp. 108–117.Google Scholar
© Humana Press Inc., Totowa, NJ 1994