Skip to main content

BLAST and FASTA Similarity Searching for Multiple Sequence Alignment

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1079))

Abstract

BLAST, FASTA, and other similarity searching programs seek to identify homologous proteins and DNA sequences based on excess sequence similarity. If two sequences share much more similarity than expected by chance, the simplest explanation for the excess similarity is common ancestry—homology. The most effective similarity searches compare protein sequences, rather than DNA sequences, for sequences that encode proteins, and use expectation values, rather than percent identity, to infer homology. The BLAST and FASTA packages of sequence comparison programs provide programs for comparing protein and DNA sequences to protein databases (the most sensitive searches). Protein and translated-DNA comparisons to protein databases routinely allow evolutionary look back times from 1 to 2 billion years; DNA:DNA searches are 5–10-fold less sensitive. BLAST and FASTA can be run on popular web sites, but can also be downloaded and installed on local computers. With local installation, target databases can be customized for the sequence data being characterized. With today’s very large protein databases, search sensitivity can also be improved by searching smaller comprehensive databases, for example, a complete protein set from an evolutionarily neighboring model organism. By default, BLAST and FASTA use scoring strategies target for distant evolutionary relationships; for comparisons involving short domains or queries, or searches that seek relatively close homologs (e.g. mouse–human), shallower scoring matrices will be more effective. Both BLAST and FASTA provide very accurate statistical estimates, which can be used to reliably identify protein sequences that diverged more than 2 billion years ago.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

Notes

  1. 1.

    The programs listed in Tables 1 and 2 are part of the NCBI BLAST+ distribution [1]. An earlier version of the BLAST distribution used the blastall program with the -p blastp option to specify the specific program.

  2. 2.

    Previous versions of BLAST provided bl2seq to compare two sequences in FASTA format. The BLAST programs now provide the “Blast2Sequences” mode by using -subject option, rather than the -db option.

  3. 3.

    The FASTA programs provide a variable scoring matrix option that shifts the scoring matrix for shorter query sequences. The BLAST programs provide the -task blastp-short or -task blastn-short for short protein:protein and DNA:DNA searches.

  4. 4.

    BLAST’s percent positive counts aligned residues with a score > 0; FASTA’s fraction similar includes aligned residues with scores ≥ 0.

  5. 5.

    The BLAST programs use a slightly different formulation of the Expect value; rather than using the number of entries in the database, BLAST uses the combined length of all the sequences in the database. For average length proteins, the result of the two calculations is identical.

  6. 6.

    There are more than 2.5 million E. coli protein sequences from 200+ genomes available from the NCBI protein databases.

  7. 7.

    Similar results are found with blastp with the PAM70 matrix, though the less stringent gap penalties used by blastp produce longer alignments.

References

  1. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) Blast+: architecture and applications. BMC Bioinformatics 10:421

    Article  PubMed  Google Scholar 

  2. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197

    Article  CAS  PubMed  Google Scholar 

  3. Li W, McWilliam H, Goujon M, Cowley A, Lopez R, Pearson WR (2012) PSI-Search: iterative HOE-reduced profile ssearch searching. Bioinformatics 28:1650–1651

    Article  CAS  PubMed  Google Scholar 

  4. Huang X, Hardison RC, Miller W (1990) A space-efficient algorithm for local similarities. Comput Appl Biosci 6:373–381

    CAS  PubMed  Google Scholar 

  5. Waterman MS, Eggert M (1987) A new algorithm for best subsequences alignment with application to tRNA–rRNA comparisons. J Mol Biol 197:723–728

    Article  CAS  PubMed  Google Scholar 

  6. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) A basic local alignment search tool. J Mol Biol 215:403–410

    CAS  PubMed  Google Scholar 

  7. Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268

    Article  CAS  PubMed  Google Scholar 

  8. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  CAS  PubMed  Google Scholar 

  9. Wootton JC, Federhen S (1993) Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem 17:149–163

    Article  CAS  Google Scholar 

  10. Yu Y, Wootton JC, Altschul SF (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci USA 100:15688–15693

    Article  CAS  PubMed  Google Scholar 

  11. Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, Nelson KE (2012) Cloud biolinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics 13:42

    Article  PubMed  Google Scholar 

  12. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994–3005

    Article  CAS  PubMed  Google Scholar 

  13. Altschul SF, Wootton JC, Gertz EM, Agarwala R, Morgulis A, Schaffer AA, Yu Y (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272:5101–5109

    Article  CAS  PubMed  Google Scholar 

  14. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919

    Article  CAS  PubMed  Google Scholar 

  15. Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441

    Article  CAS  PubMed  Google Scholar 

  16. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer ELL (2002) The pfam protein families database. Nucleic Acids Res 30:276–280

    Article  CAS  PubMed  Google Scholar 

  17. Gonzalez MW, Pearson WR (2010) Homologous over-extension: a challenge for iterative similarity searches. Nucleic Acids Res 38:2177–2189

    Article  CAS  PubMed  Google Scholar 

  18. Zhang Z, Berman P, Miller W (1998) Alignments without low-scoring regions. J Comput Biol 5:197–210

    Article  CAS  PubMed  Google Scholar 

  19. UniProt Consortium (2011) Ongoing and future developments at the universal protein resource. Nucleic Acids Res 39:D214–D219

    Article  Google Scholar 

  20. Altschul SF (1991) Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 219:555–565

    Article  CAS  PubMed  Google Scholar 

  21. Mueller T, Spang R, Vingron M (2002) Estimating amino acid substitution models: a comparison of dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 19:8–13

    Article  Google Scholar 

  22. Reese JT, Pearson WR (2002) Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 18:1500–1507

    Article  CAS  PubMed  Google Scholar 

  23. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448

    Article  CAS  PubMed  Google Scholar 

  24. Pearson WR (1996) Effective protein sequence comparison. Methods Enzymol 266:227–258

    Article  CAS  PubMed  Google Scholar 

  25. Pearson WR, Wood TC, Zhang Z, Miller W (1997) Comparison of DNA sequences with protein sequences. Genomics 46:24–36

    Article  CAS  PubMed  Google Scholar 

  26. Huang X, Miller W (1991) A time-efficient, linear-space local similarity algorithm. Adv Appl Math 12:337–357

    Article  Google Scholar 

  27. Mackey AJ, Haystead TAJ, Pearson WR (2002) Getting more from less: algorithms for rapid protein identification with multiple short peptide sequences. Mol Cell Proteomics 1:139–147

    Article  CAS  PubMed  Google Scholar 

  28. Damer CK, Partridge J, Pearson WR, Haystead TAJ (1998) Rapid identification of protein phosphatase 1-binding proteins by mixed peptide sequencing and data base searching. Characterization of a novel holoenzymic form of protein phosphatase 1. J Biol Chem 273:24396–24405

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Pearson, W.R. (2014). BLAST and FASTA Similarity Searching for Multiple Sequence Alignment. In: Russell, D. (eds) Multiple Sequence Alignment Methods. Methods in Molecular Biology, vol 1079. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-646-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-62703-646-7_5

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-62703-645-0

  • Online ISBN: 978-1-62703-646-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics