Abstract
What is an e-value for ungapped and gapped BLAST? What are the Karlin-Altschul parameters that affect e-value calculation? How nucleotide frequencies and match-mismatch matrices affect such parameters? What are the key algorithms for FASTA and BLAST? How do their differences affect sensitivity of sequence search? This chapter addresses these questions and illustrates applications of string matching in genomics, transcriptomics, and proteomics, as well as in drug discovery.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abraham EP, Chain E (1940) An enzyme from bacteria able to destroy penicillin. Rev Infect Dis 10(4):677–678
Abraham EP, Chain E, Fletcher CM, Florey HW, Gardner AD, Heatley NG, Jennings MA (1941) Further observations on penicillin. Lancet 238(6155):177–189
Alderwick LJ, Seidel M, Sahm H, Besra GS, Eggeling L (2006) Identification of a novel arabinofuranosyltransferase (AftA) involved in cell wall arabinan biosynthesis in Mycobacterium tuberculosis. J Biol Chem 281(23):15653–15661
Altschul SF (1996) Local alignment statistics. Meth Enzymol 274:460–480
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Arava Y, Wang Y, Storey JD, Liu CL, Brown PO, Herschlag D (2003) Genome-wide analysis of mRNA translation profiles in Saccharomyces cerevisiae. Proc Natl Acad Sci USA 100(7):3889–3894
Bastianelli G, Bouillon A, Nguyen C, Crublet E, Petres S, Gorgette O, Le-Nguyen D, Barale JC, Nilges M (2011) Computational reverse-engineering of a spider-venom derived peptide active against Plasmodium falciparum SUB1. PLoS One 6(7):e21812
Bennetzen JL, Hall BD (1982) Codon selection in yeast. J Biol Chem 257(6):3026–3031
Bergsten E, Uutela M, Li X, Pietras K, Ostman A, Heldin CH, Alitalo K, Eriksson U (2001) PDGF-D is a specific, protease-activated ligand for the PDGF beta-receptor. Nat Cell Biol 3(5):512–516
Bhatia B, Ponia SS, Solanki AK, Dixit A, Garg LC (2014) Identification of glutamate ABC-transporter component in Clostridium perfringens as a putative drug target. Bioinformation 10(7):401–405
Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE et al (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146):799–816
Blanchette M, Tompa M (2002) Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res 12(5):739–748
Blanchette M, Bataille AR, Chen X, Poitras C, Laganiere J, Lefebvre C, Deblois G, Giguere V, Ferretti V, Bergeron D et al (2006) Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res 6(5):656–668
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94
Chuang SE, Daniels DL, Blattner FR (1993) Global regulation of gene expression in Escherichia coli. J Bacteriol 175(7):2026–2036
Cox SS, van der Giezen M, Tarr SJ, Crompton MR, Tovar J (2006) Evidence from bioinformatics, expression and inhibition studies of phosphoinositide-3 kinase signalling in Giardia intestinalis. BMC Microbiol 6:45
David E, Tramontin T, Zemmel R (2009) Pharmaceutical R&D: the road to positive returns. Nat Rev Drug Discov 8(8):609–610
Deng W, Lee J, Wang H, Miller J, Reik A, Gregory PD, Dean A, Blobel GA (2012) Controlling long-range genomic interactions at a native locus by targeted tethering of a looping factor. Cell 149(6):1233–1244
Deng W, Rupon JW, Krivega I, Breda L, Motta I, Jahn KS, Reik A, Gregory PD, Rivella S, Dean A et al (2014b) Reactivation of developmentally silenced globin genes by forced chromatin looping. Cell 158(4):849–860
Doolittle RF, Hunkapiller MW, Hood LE, Devare SG, Robbins KC, Aaronson SA, Antoniades HN (1983) Simian sarcoma virus onc gene, v-sis, is derived from the gene (or genes) encoding a platelet-derived growth factor. Science 221(4607):275–277
Drews J, Ryser S (1997) The role of innovation in drug development. Nat Biotechnol 15(13):1318–1319
Ehnman M, Missiaglia E, Folestad E, Selfe J, Strell C, Thway K, Brodin B, Pietras K, Shipley J, Ostman A et al (2013) Distinct effects of ligand-induced PDGFRalpha and PDGFRbeta signaling in the human rhabdomyosarcoma tumor cell and stroma cell compartments. Cancer Res 73(7):2139–2149
Ezzell C (2002) Proteins rule. Sci Am 286(4):40–47
Fernandez-Pinar R, Lo Sciuto A, Rossi A, Ranucci S, Bragonzi A, Imperi F (2015) In vitro and in vivo screening for novel essential cell-envelope proteins in Pseudomonas aeruginosa. Sci Rep 5:17593
Figeys D (2002) Adapting arrays and lab-on-a-chip technology for proteomics. Proteomics 2(4):373–382
Figeys D (2003a) Novel approaches to map protein interactions. Curr Opin Biotechnol 14(1):119–125
Figeys D (2003b) Proteomics in 2002: a year of technical development and wide-ranging applications. Anal Chem 75(12):2891–2905
Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223):496–512
Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM et al (1995) The minimal gene complement of Mycoplasma genitalium. Science 270(5235):397–403
Frishman D, Mironov A, Mewes HW, Gelfand M (1998) Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res 26(12):2941–2947
Gal-Mor O, Finlay BB (2006) Pathogenicity islands: a molecular toolbox for bacterial virulence. Cell Microbiol 8(11):1707–1719
Gibbs JB (2000) Mechanism-based target identification and drug discovery in cancer research. Science 287(5460):1969–1973
Gilbert WV, Zhou K, Butler TK, Doudna JA (2007) Cap-independent translation is required for starvation-induced differentiation in yeast. Science 317(5842):1224–1227
Gumbel EJ (1958) Statistics of extremes. Columbia University Press, New York
Hacker J, Kaper JB (2000) Pathogenicity islands and the evolution of microbes. Annu Rev Microbiol 54:641–679
Hacker J, Blum-Oehler G, Muhldorfer I, Tschape H (1997) Pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution. Mol Microbiol 23(6):1089–1097
Hayes WS, Borodovsky M (1998) How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Res 8(11):1154–1171
Heath JR, Ribas A, Mischel PS (2016) Single-cell analysis tools for drug discovery and development. Nat Rev Drug Discov 15(3):204–216
Hofer A, Steverding D, Chabes A, Brun R, Thelander L (2001) Trypanosoma brucei CTP synthetase: a target for the treatment of African sleeping sickness. Proc Natl Acad Sci U S A 98(11):6412–6416
Hou C, Zhao H, Tanimoto K, Dean A (2008) CTCF-dependent enhancer-blocking by alternative chromatin loop formation. Proc Natl Acad Sci U S A 105(51):20398–20403
Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform 11:119
Ingolia NT, Ghaemmaghami S, Newman JRS, Weissman JS (2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324(5924):218–223
Ingolia NT, Lareau LF, Weissman JS (2011) Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147(4):789–802
Ingram VM (1956) A specific chemical difference between the globins of normal human and sickle-cell anaemia haemoglobin. Nature 178(4537):792–794
Ingram VM (1957) Gene mutations in human haemoglobin: the chemical difference between normal and sickle cell haemoglobin. Nature 180(4581):326–328
Jacob F, Monod J (1961) Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3:318–356
Kaneko T, Tanaka A, Sato S, Kotani H, Sazuka T, Miyajima N, Sugiura M, Tabata S (1995) Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb region from map positions 64% to 92% of the genome. DNA Res 2(4):153–166. 191-8
Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S et al (1996) Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res 3(3):109–136
Kioussis D, Vanin E, deLange T, Flavell RA, Grosveld FG (1983) Beta-globin gene inactivation by DNA translocation in gamma beta-thalassaemia. Nature 306(5944):662–666
Kozak M (1981) Possible role of flanking nucleotides in recognition of the AUG initiator codon by eukaryotic ribosomes. Nucleic Acids Res 9(20):5233–5252
Kozak M (1991) Effects of long 5′ leader sequences on initiation by eukaryotic ribosomes in vitro. Gene Expr 1(2):117–125
Kozak M (1999) Initiation of translation in prokaryotes and eukaryotes. Gene 234(2):187–208
Krasemann EW, Meier V, Korenke GC, Hunneman DH, Hanefeld F (1996) Identification of mutations in the ALD-gene of 20 families with adrenoleukodystrophy/adrenomyeloneuropathy. Hum Genet 97(2):194–197
Kutlar A (2007) Sickle cell disease: a multigenic perspective of a single gene disorder. Hemoglobin 31(2):209–224
Laemmli UK (1970) Cleavage of structural proteins during the assembly of the head of bacteriophage T4. Nat Biotechnol 227:680–685
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W et al (2001) Initial sequencing and analysis of the human genome. Nature 409(6822):860–921
Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO et al (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326(5950):289–293
Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227(4693):1435–1441
Liu X, Jiang H, Gu Z, Roberts JW (2013) High-resolution view of bacteriophage lambda gene expression by ribosome profiling. Proc Natl Acad Sci U S A 110(29):11928–11933
MacKay VL, Li X, Flory MR, Turcott E, Law GL, Serikawa KA, Xu XL, Lee H, Goodlett DR, Aebersold R et al (2004) Gene expression analyzed by high-resolution state array analysis and quantitative proteomics: response of yeast to mating pheromone. Mol Cell Proteomics 3(5):478–489
Madden SL, Galella EA, Zhu J, Bertelsen AH, Beaudry GA (1997) SAGE transcript profiles for p53-dependent growth regulation. Oncogene 15(9):1079–1085
Meyer IM, Durbin R (2004) Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res 32(2):776–783
Moffat JG, Rudolph J, Bailey D (2014) Phenotypic screening in cancer drug discovery – past, present and future. Nat Rev Drug Discov 13(8):588–602
Morita M, Shimozawa N, Kashiwayama Y, Suzuki Y, Imanaka T (2011) ABC subfamily D proteins and very long chain fatty acid metabolism as novel targets in adrenoleukodystrophy. Curr Drug Targets 12(5):694–706
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7):621–628
Needleman SB, Wunsch CD (1970) A general method applicable to the search of similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Noedl H, Se Y, Schaecher K, Smith BL, Socheat D, Fukuda MM (2008) Evidence of artemisinin-resistant malaria in western Cambodia. N Engl J Med 359(24):2619–2620
Noedl H, Socheat D, Satimai W (2009) Artemisinin-resistant malaria in Asia. N Engl J Med 361(5):540–541
Noedl H, Se Y, Sriwichai S, Schaecher K, Teja-Isavadharm P, Smith B, Rutvisuttinunt W, Bethell D, Surasri S, Fukuda MM et al (2010) Artemisinin resistance in Cambodia: a clinical trial designed to address an emerging problem in Southeast Asia. Clin Infect Dis 51(11):e82–e89
Palstra RJ, Tolhuis B, Splinter E, Nijmeijer R, Grosveld F, de Laat W (2003) The beta-globin nuclear compartment in development and erythroid differentiation. Nat Genet 35(2):190–194
Pauling L, Itano HA, Singer SJ, Wells IC (1949) Sickle cell anemia a molecular disease. Science 110(2865):543–548
Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183:63–98
Pearson WR (1994) Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol 24:307–331
Pearson WR (1998) Empirical statistical estimates for sequence similarity searches. J Mol Biol 276(1):71–84
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85:2444–2448
Pietras K, Sjoblom T, Rubin K, Heldin CH, Ostman A (2003) PDGF receptors as cancer drug targets. Cancer Cell 3(5):439–443
Poulos MG, Batra R, Charizanis K, Swanson MS (2011) Developments in RNA splicing and disease. Cold Spring Harb Perspect Biol 3(1):a000778
Press WH, Teukolsky SA, Tetterling WT, Flannery BP (1992) Numerical recipes in C: the art of scientifi computing. Cambridge University Press, Cambridge
Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A et al (2007) Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 4(8):651–657
Saadatpour A, Lai S, Guo G, Yuan GC (2015) Single-cell analysis in cancer genomics. Trends Genet 31(10):576–586
Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE (2002) Using the transcriptome to annotate the genome. Nat Biotechnol 20(5):508–512
Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26(2):544–548
Schena M (1996) Genome analysis with gene expression microarrays. BioEssays 18(5):427–431
Schena M (2003) Microarray analysis. Wiley-Liss, New York
Segurel L, Bon C (2017) On the evolution of lactase persistence in humans. Annu Rev Genomics Hum Genet 18:297–319
Shine J, Dalgarno L (1974a) The 3′-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc Natl Acad Sci U S A 71(4):1342–1346
Shine J, Dalgarno L (1974b) Identical 3′-terminal octanucleotide sequence in 18S ribosomal ribonucleic acid from different eukaryotes. A proposed role for this sequence in the recognition of terminator codons. Biochem J 141(3):609–615
Shine J, Dalgarno L (1975) Determinant of cistron specificity in bacterial ribosomes. Nature 254(5495):34–38
Shoemaker RH (2006) The NCI60 human tumour cell line anticancer drug screen. Nat Rev Cancer 6(10):813–823
Sloane AJ, Duff JL, Wilson NL, Gandhi PS, Hill CJ, Hopwood FG, Smith PE, Thomas ML, Cole RA, Packer NH et al (2002) High throughput peptide mass fingerprinting and protein macroarray analysis using chemical printing strategies. Mol Cell Proteomics 1(7):490–499
Smircich P, Eastman G, Bispo S, Duhagon MA, Guerra-Slompo EP, Garat B, Goldenberg S, Munroe DJ, Dallagiovanna B, Holetz F et al (2015) Ribosome profiling reveals translation control as a key mechanism generating differential gene expression in Trypanosoma cruzi. BMC Genomics 16:443
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Smyth RP, Davenport MP, Mak J (2012) The origin of genetic diversity in HIV-1. Virus Res 169(2):415–429
Smyth RP, Schlub TE, Grimm AJ, Waugh C, Ellenberg P, Chopra A, Mallal S, Cromer D, Mak J, Davenport MP (2014) Identifying recombination hot spots in the HIV-1 genome. J Virol 88(5):2891–2902
Steinberg MH, Rodgers GP (2001) Pathophysiology of sickle cell disease: role of cellular and genetic modifiers. Semin Hematol 38(4):299–306
Steitz JA, Jakes K (1975) How ribosomes select initiator regions in mRNA: base pair formation between the 3′ terminus of 16S rRNA and the mRNA during initiation of protein synthesis in Escherichia coli. Proc Natl Acad Sci U S A 72(12):4734–4738
Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982a) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10(9):2997–3011
Taniguchi T, Weissmann C (1978) Inhibition of Qbeta RNA 70S ribosome initiation complex formation by an oligonucleotide complementary to the 3′ terminal region of E. coli 16S ribosomal RNA. Nature 275(5682):770–772
Tao H, Bausch C, Richmond C, Blattner FR, Conway T (1999) Functional genomics: expression analysis of Escherichia coli growing on minimal and rich media. J Bacteriol 181(20):6425–6440
Taramelli R, Kioussis D, Vanin E, Bartram K, Groffen J, Hurst J, Grosveld FG (1986) Gamma delta beta-thalassaemias 1 and 2 are the result of a 100 kbp deletion in the human beta-globin cluster. Nucleic Acids Res 14(17):7017–7029
Tech M, Merkl R (2003) YACOP: enhanced gene prediction obtained by a combination of existing methods. In Silico Biol 3(4):441–451
Tolhuis B, Palstra RJ, Splinter E, Grosveld F, de Laat W (2002) Looping and interaction between hypersensitive sites in the active beta-globin locus. Mol Cell 10(6):1453–1465
Trudel MV, Vincent AT, Attere SA, Labbe M, Derome N, Culley AI, Charette SJ (2016) Diversity of antibiotic-resistance genes in Canadian isolates of Aeromonas salmonicida subsp. salmonicida: dominance of pSN254b and discovery of pAsa8. Sci Rep 6:35617
Vasilescu J, Figeys D (2006) Mapping protein-protein interactions by mass spectrometry. Curr Opin Biotechnol 17(4):394–399
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270(5235):484–487
Velculescu VE, Zhang L, Zhou W, Vogelstein J, Basrai MA, Bassett DE Jr, Hieter P, Vogelstein B, Kinzler KW (1997) Characterization of the yeast transcriptome. Cell 88(2):243–251
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA et al (2001) The sequence of the human genome. Science 291(5507):1304–1351
Vlasschaert C, Xia X, Coulombe J, Gray DA (2015) Evolution of the highly networked deubiquitinating enzymes USP4, USP15, and USP11. BMC Evol Biol 15:230
Washburn MP, Wolters D, Yates JR 3rd (2001) Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol 19(3):242–247
Waterfield MD, Scrace GT, Whittle N, Stroobant P, Johnsson A, Wasteson A, Westermark B, Heldin CH, Huang JS, Deuel TF (1983) Platelet-derived growth factor is structurally related to the putative transforming protein p28sis of simian sarcoma virus. Nature 304(5921):35–39
Waterman MS, Vingron M (1994) Rapid and accurate estimates of statistical significance for sequence data base searches. Proc Natl Acad Sci U S A 91(11):4625–4628
Weigert MG, Garen A (1965) Base composition of nonsense codons in E. coli. evidence from amino-acid substitutions at a tryptophan site in alkaline phosphatase. Nature 206(988):992–994
Wilson DS, Nock S (2002) Functional protein microarrays. Curr Opin Chem Biol 6(1):81–85
Wu J, Tzanakakis ES (2013) Deconstructing stem cell population heterogeneity: single-cell analysis and modeling approaches. Biotechnol Adv 31(7):1047–1062
Xia X (2013) DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution. Mol Biol Evol 30:1720–1728
Xia X (2017d) Self-organizing map for characterizing heterogeneous nucleotide and amino acid sequence motifs. Computation 5(4):43
Yates JR (2004a) Mass spectral analysis in proteomics. Annu Rev Biophys Biomol Struct 33:297–316
Yates JR (2004b) Mass spectrometry as an emerging tool for systems biology. BioTechniques 36(6):917–919
Yoon JH, De S, Srikantan S, Abdelmohsen K, Grammatikakis I, Kim J, Kim KM, Noh JH, White EJ, Martindale JL et al (2014) PAR-CLIP analysis uncovers AUF1 impact on target RNA fate and genome integrity. Nat Commun 5:5248
Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW (1997) Gene expression profiles in normal and cancer cells. Science 276(5316):1268–1272
Author information
Authors and Affiliations
Appendix: Being Colorful Is Not Enough—How Are Stop Codons Decoded?
Appendix: Being Colorful Is Not Enough—How Are Stop Codons Decoded?
In the old and not-so-good days of 1965, three stop codons were known to exist. At that time, almost all molecular biologists study bacteriophages (T4 and λ) and their hosts (typically E. coli ). Sometimes a phage would have a mutation from a sense codon to a stop codon, leading to a truncated protein. Such a phage can only grow in certain E. coli strains containing a suppressor mutation (which is typically a tRNA with a mutated anticodon that can base-pair with a stop codon). Such strains were then called suppressor strains. The first set of nonsense mutations were discovered and isolated by Richard Epstein and Charles Steinberg. The resulting stop codon from these nonsense mutations was named after their friend Harris Bernstein whose last name means “amber” in German. The associated suppressor was termed amber suppressor. Any phage with a nonsense mutation that can grow only in this particular set of suppressor strains was said to harbor an amber mutation. The observation that an amber stop codon could be suppressed (i.e., recognized as a sense codon) in suppressor strains but not in other strains suggests some ambiguity in the meaning of the stop codon, i.e., the stop codon is interpreted differently between the suppressor strains and non-suppressor strains.
Phages with another set of nonsense mutations can grow only in a different set of E. coli strains, and the stop codon resulting from these nonsense mutations was named ochre. The associated suppressor is termed ochre suppressor. The third stop codon was named opal. In short, there are three disjoint sets of suppressor strains corresponding to three stop codons, but biologists did not know which stop codon is associated with amber, ochre, or opal.
Martin Weigert and Alan Garen (1965) studied one particular amino acid site in the alkaline phosphatase gene in E. coli . This site is occupied by tryptophan coded by UGG. An amber mutation (i.e., a nonsense mutation that can be suppressed by an amber suppressor) occurred at this site. In several revertants (in which the amber mutation had reversed to a sense codon), the original amino acid site was found to be occupied by seven different amino acids, Glu, Lys, Leu, Gln, Ser, Trp, and Tyr (Fig. 1.5). Can we now match the amber codon to one of the stop codons?
Because mutation is rare, we may assume that the revertants differ from the amber mutant by a single nucleotide. These candidate codons are shown in Fig. 1.6 for each of the three stop codons. UGA is the least likely candidate because it cannot generate a codon for Glu, Lys, Gln, and Tyr by a single nucleotide change. UAA can also be excluded because (1) it cannot generate a Trp UGG codon by a single nucleotide change, and (2) it would need to have a double mutation from the original UGG codon. In contrast, UAG can mutate to one of the codons for all seven amino acids, and it can result from the original UGG through a single nucleotide replacement. So the amber codon is UAG. (I have made things easier because Weigert and Garen were not so sure about the sense codons which were only partially decoded in 1965).
This example illustrates several decision-making principles that are useful to us. The first is the KISS principle (“Keep it simple, stupid”). That is, to use simplifying assumptions to make decision-making easier. For example, if we assume that double mutation (i.e., mutations at two sites of the codon) is so rare as to be negligible, then we can immediately reject UAA and UGA because both require double mutations to generate codons for the seven amino acids. In terms of model selection, we state that only the UAG model fits the data given the model condition (the assumption of no double mutation).
The second is the parsimony principle. We have three alternative hypotheses corresponding to the three stop codons. The UAG hypothesis needs a minimum of seven single point mutations to generate the seven revertants each with a different amino acid. It also needs a point mutation to connect the original UGG codon. The UGA hypothesis needs at least two point mutations to change to a Glu, Lys, Gln, or Try codon, and one mutation to each of the other three amino acids. This comes to a minim number of 11 point mutations, plus a point mutation to connect the original UGG codon. The UAA hypothesis needs to have two point mutations to a UGG revertant as well as two point mutations from the original UGG codon, leading to a minimum of ten point mutations. The UAG hypothesis is therefore the most parsimonious and is chosen over the other two alternatives.
The third is the likelihood principle, but we will not go there without first learning substitution models which is covered in a later chapter.
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media LLC
About this chapter
Cite this chapter
Xia, X. (2018). String Mathematics, BLAST, and FASTA. In: Bioinformatics and the Cell. Springer, Cham. https://doi.org/10.1007/978-3-319-90684-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-90684-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-90682-9
Online ISBN: 978-3-319-90684-3
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)