1 Introduction

Computational approaches are first line tools for the assignment of function to protein sequences identified from genome sequence data. Widely available tools such as BLAST (1) allow searches of this type to be widely deployed by non-specialists in conjunction with comprehensive sequence databases such as EMBL (2) or UniProt (3). This allows for biological insights, such as how proteins are translocated across membranes, to be extended from knowledge gained in experimentally tractable systems such as the yeast Saccharomyces cerevisiae or bacterium Escherichia coli.

Underpinning the sequence database search is the problem of sequence comparison: to find similar sequences in the database one needs to compare every sequence in the database to the sequence at hand. Sequence comparison, in turn, implies the sequence alignment problem. In order to compare two sequences in a consistent and a non-biased way one first needs to best align the two sequences. The best alignment of two sequences throughout their entire length is achieved with the Needleman-Wunsch algorithm (4) (global alignment), while the best alignment of two sequences involving only portions of the two sequences is achieved with the Smith-Waterman algorithm (5) (local alignment).

The biological complexity associated with proteins is high and databases therefore contain a wide variety of sequences in terms of their properties, total length, etc. The Smith-Waterman algorithm, which is the appropriate means for such a database search, is relatively computationally expensive. For this reason routine searches of large sequence databases (such as GenBank at NCBI) typically rely on BLAST, a heuristic algorithm that approximates the results of the Smith-Waterman algorithm. In contrast to the Smith-Waterman algorithm BLAST does not guarantee to find the best match; however in most practical situations BLAST is accurate enough while it provides significant computational savings relative to the Smith-Waterman algorithm.

Similarly as the application of the Smith-Waterman algorithm for database search, BLAST is inherently a single sequence search: a given query sequence is compared to each of the sequences in the database with the aim of finding similar sequences. In practice one may have a family of related sequences at hand, and would like to identify additional homologous sequences within a given database. A typical example of this is searching a newly sequenced genome for a member of a known family of proteins. While BLAST can potentially be used to find the new member, its effectiveness is limited by the degree of pairwise sequence similarity between a single given query sequence and the target sought. In many biological scenarios, such as searches across large phylogenetic distances or searches involving organisms from cryptic environments where evolutionary pressure is high, this can be a critical limitation.

Here we demonstrate the use of hidden Markov models to search the E. histolytica genome for proteins from the Tom40 family. Tom40 is the channel through which imported proteins cross the mitochondrial outer membrane (68). Tom40 is predicted to be a β-barrel protein (9, 10), and thereby likely to be derived from an ancestral protein of bacterial origins. Tom40 has been found in a vast range of eukaryotes, leading to the suggestion that it was a fundamental component of the original protein import system installed in protomitochondria (11). The amoeba E. histolytica has a highly reduced compartment called a mitosome, an organelle whose relationship to mitochondria has been the subject of some controversy (12, 13). For some time, E. histolytica had been considered an “amitochondriate” organism, in large part because BLAST-based searches do not yield homologues of typical mitochondrial proteins like Tom40 from the genome sequence data of E. histolytica.

Figure 16.1 shows a portion of the multiple sequence alignment of known Tom40 proteins. The alignment shows islands of conservation among the non-conserved regions and insertions. This is typical in a family of divergent but homologous sequences: due to evolutionary pressures, mutations in functionally critical regions are not tolerated very well, while mutations in regions not essential for function may abound given sufficient evolutionary distance. A set of homologous sequences, such as the one shown in Fig. 16.1, provides significantly more information about inherent features of the protein family compared to any single sequence from the family taken in isolation. For example, the multiple sequence alignment shows which amino acid positions are relatively conserved and the positions where insertions and deletions are more frequent. Leveraging this information can be critical for detecting evolutionary distant members of the family. This is because the alignment of a protein family conveys the information as to which amino acid positions should match for a highly significant hits (e.g. the position 190 in Fig. 16.1 where G is highly conserved), and also positions that are probably not significant, that is where sequence variations or gaps in the alignment are evident. For conserved positions, one can estimate both the degree and the nature of conservation from a multiple sequence alignment: for example, in the position 190 in Fig. 16.1, the amino acid G would be highly significant, increasing the likelihood that the sequence is a true member of the family. However, one can also see that a true member of the family can tolerate S, F, or I in this position.

Fig. 16.1.
figure 16_1_152582_1_En

A portion of the multiple sequence alignment of Tom40 sequences used in this work (22 known Tom40 sequences were used in the alignment). The segment shows residue positions 160–318.

The information inherent in the family of sequences can be captured by the so-called “‘profile’” methods that convert a multiple sequence alignment into a position specific scoring matrix, which in turn allows a more sensitive database search (14). A more advanced method for capturing features inherent in a family of sequences is based on hidden Markov models (15). Hidden Markov models (HMMs) are general statistical models for certain types of pattern recognition problems, widely used in speech recognition for example. The HMM variants used to capture the inherent properties of a family of protein sequences are called profile HMMs. A profile HMM is “trained” on an aligned set of sequences, and subsequently any sequence from the database can be scored against the HMM to give the probability of it belonging to the family.

Figure 16.2 shows a simplified example of a hidden Markov model illustrating how a HMM is set up and trained on a family of sequences (in this case hypothetical DNA sequences are used for simplicity). Given an arbitrary sequence the trained HMM can give the probability that a sequence belongs to the family. The sequence of states shown in Fig. 16.2b is a Markov chain in mathematical terminology because the probability of the next state depends only on the present state and does not depend on the past states. The Markov model is “hidden” because the states are not observed directly, only the residues that the states generate are observed directly (see the caption of Fig. 16.2).

Fig. 16.2.
figure 16_2_152582_1_En

A simplified explanation of how hidden Markov models are set up. For simplicity the DNA sequence “family” was used, where each sequence has either 4 or 5 residues. The best alignment of sequences is shown in the top left corner. Panel (a): From the alignment the states of the HMM are constructed, which involve four main states and one insert state that models the residue insertion at position three. Panel (b): The probabilities for state transitions are set from the multiple sequence alignment. The probability for an insertion of a residue at position 3, modeled by an insert state, is set to 0.4. This is deduced from the multiple sequence alignment, where there is two residue inserted in five sequences (2/5 = 0.4). Therefore the probability of direct transition from the main state 2 to the main state 3 is 1–0.4 = 0.6. Panel (c): The residue probabilities are initialized. There are four residues (A, T, G and C) and the initial probability for each is set to 1/4 = 0.25. Panel (d): The residue probabilities are trained based on the multiple sequence alignment and the HMM topology. The residue probabilities are set based on the residue counts in each position of the sequence alignment. For example, in position 1 the residue A occurs three times out of five; therefore the probability for A is 3/5 = 0.60. The model in panel (d) can be used to predict the probability of any 4 or 5 letter DNA sequence. For example, the sequence ‘AGTA’ would have the probability of 0.6 (residue A in state 1) × 0.4 (residue G in state 2) × 0.6 (direct transition to state 3) × 0.2 (residue T in state 3) × 1.0 (residue A in state 4) = 0.028 (see Note 3). The sequence ‘CCTTA’ would have the probability of 0.0096. Therefore the sequence ‘AGTA’ is a better match to this model than the sequence ‘CCTTA’.

The example shown in Fig. 16.2 is grossly simplified, and profile HMMs useful in practice are significantly more complex. However freely available software for the application of HMMs can be used to shield the user from many details and one does not need to master the theory behind HMMs in full detail to use them effectively. In the example presented here we consider the family of 23 known Tom40 protein sequences, and ask the question: Is there a Tom40 in the E. histolytica genome (EhTom40)? A BLAST search of the E. histolytica genome with any individual sequence from the Tom40 training set failed to return a reasonable Tom40 candidate. Here we describe step-by-step the protocol used to build the Tom40 HMM, and demonstrate the ability of the resulting model to delineate the EhTom40 candidate protein import channel in E. histolytica.

2 Materials

2.1 Setup and Notation

  1. 1.

    The computer system. The HMM searches described here was performed on the computer running Red Hat Linux 5. The Red Hat 5 installation was default, with specific bioinformatics programs installed in addition, as described below.

  2. 2.

    Conventions. Program names, file and directory (folder) names are written in single quotes. The computer terminal outputs are written in Courier font. In Unix, folders are commonly called directories; throughout the text the term “folders” will be used. In the main text, folder names will be appended with a forward slash (but not always in the computer terminal output, which is copied verbatim). For each set of commands executed on the computer screen it is assumed that the starting folder is ‘workspace/’.

  3. 3.

    Folder for the project. For the purpose of the search examples a folder was created named ‘workspace/’, with the absolute path ‘/home/workspace/’ (Note: in all examples below, replace this with your own path). Four additional sub-folders were created in the folder ‘workspace/’:

    $cd workspace

    $mkdir clustalw hmmer search ehist

    The purpose of these folders is as follows:

    • ‘workspace/’ – overall workspace for this project

    • ‘clustalw/’ – installation folder for the multiple sequence alignment program Clustal-w

    • ‘hmmer/’ – installation folder for the HMMER software package

    • ‘search/’ – to contain Tom40 sequences, HMMs and search outputs

    • ‘ehist/’ – folder for the E. histolytica predicted hypothetical proteins

2.2 Additional Bioinformatics Programs

  1. 1.

    Clustal-w installation (the program for multiple sequence alignment). The program Clustal-w was installed in the folder ‘/home/workspace/clustalw/. Clustal-w was downloaded from the FTP site ‘http://ftp://ftp.ebi.ac.uk/pub/software/clustalw2’ as the file ‘clustalw-2.0.9-linux-i386-libcppstatic.tar.gz’. This file was placed in the folder ‘clustalw/’, and unpacked as follows:

    $ cd /home/workspace/clustalw $ ls clustalw-2.0.9-linux-i386-libcppstatic.tar.gz $ tar xvfz clustalw-2.0.9-linux-i386-libcppstatic.tar.gz

    The last command created the folder ‘clustalw-2.0.9-linux-i386-libcppstatic/’ which contained the executable ‘clustalw2’. For convenience this file was moved into the folder ‘workspace/clustalw’, and the original ‘clustalw-2.0.9-linux-i386-libcppstatic/’ folder was removed:

    $mv clustalw-2.0.9-linux-i386-libcppstatic/* . $rm -rf clustalw-2.0.9-linux-i386-libcppstatic $ls clustalw2 clustalw_help

  2. 2.

    HMMER installation (Sean Eddy’s program for Hidden Markov Models search (15)). The software package HMMER was downloaded from ‘http://hmmer.janelia.org/’; as the file ‘hmmer-2.3.2.tar.gz’. This file was placed in the folder ‘/home/workspace/hmmer/’. The installation:

    $cd workspace/hmmer $ls hmmer-2.3.2.tar.gz $tar xvfz hmmer-2.3.2.tar.gz $ls -CF hmmer-2.3.2/ hmmer-2.3.2.tar.gz $cd hmmer-2.3.2 $./configure --prefix=/home/workspace/hmmer [---output deleted---] $make [---output deleted--] $make install

    With this several HMMER components were installed in ‘/home/workspace/hmmer/’:

    $cd /home/workspace/hmmer $ls -CF bin/ hmmer-2.3.2/ hmmer-2.3.2.tar.gz man/

    The HMMER executables were installed in ‘bin/’ (note that HMMER consists of several programs):

    $ ls -CF bin hmmalign* hmmcalibrate* hmmemit* hmmindex* hmmsearch* hmmbuild* hmmconvert* hmmfetch* hmmpfam*

2.3 Downloading the E. histolytica Predicted Proteins

The E. histolytica conceptual proteome was downloaded from ‘ftp.tigr.org’ as follows:$ cd /home/workspace/ehist $ ftp ftp.tigr.org Connected to www.tigr.org. 220 JCVI FTP Server Name: anonymous 331 Anonymous login ok, send your complete email address as your password. Password:********** [---output deleted---] ftp> cd pub/data/Eukaryotic_Projects/e_histolytica/annotation_dbs ftp> get EHA1.pep ftp> quit $ ls EHA1.pep

3 Methods

All commands described in this section are performed in the folder ‘/home/workspace/search/’.

3.1 Preparation of Sequences for Hidden Markov Model Search

  1. 1.

    Preparation of Tom40 model sequences. Known Tom40 protein sequences were collected into a single file in preparations for building of the Tom40 HMM. The sequence file was named ‘Tom40.fas’, and contained sequences in the FASTA format. In FASTA format each sequence starts with the comment, designated with ‘>’ as the first character, and followed by the comment until the end of the line. The actual sequence starts in the line following the comment. The sequence is given in one letter code, until the new sequence comments is reached (indicated by another ‘>’ character, as first in the line) or the end of file. Our Tom40 training set contained 23 sequences of known Tom40 proteins from different organisms. The snippet of this file, showing only the first three sequences, is given below (see Note 1):

    >C.intestinalis Tom40 MGNAHAASWGWSSSTPAETAATPPPVEAPPPVVPVEPLPPSSPVDATPVHSKTATNSVGT FEEIHKPCKDIALQPFEGLRFIVNKGLSSHFQAQHTVHLNNEGSSYRFGSTYVGTKQPSP TEAYPVMIGEMSNEGNLQAQFIHQVTSRFKAKCIAQTLGSKLQSVQVGGDVVFNDSTLSV VCADPDLLNGTGMLIVHYLQAITPKLSIGSELLYQRGAARQQAIASIAGRYKTENWQAAG TIAAGGMHASFYRKANENVQVGVELEASLKNKESVTTFAYQMDLPKMNLLFKGMLTSEWT IGSALEKRLQPLPITLNLTGTYNIKKDKVAVGIGAVLG >A.thaliana Tom40 MADLLPPLTAAQVDAKTKVDEKVDYSNLPSPVPYEELHREALMSLKSDNFEGLRFDFTRA LNQKFSLSHSVMMGPTEVPAQSPETTIKIPTAHYEFGANYYDPKLLLIGRVMTDGRLNAR LKADLTDKLVVKANALITNEEHMSQAMFNFDYMGSDYRAQLQLGQSALIGATYIQSVTNH LSLGGEIFWAGVPRKSGIGYAARYETDKMVASGQVASTGAVVMNYVQKISDKVSLATDFM YNYFSRDVTASVGYDYMLRQARVRGKIDSNGVASALLEERLSMGLNFLLSAELDHKKKDY KFGFGLTVG >O.sativa Tom40 MGSAASAAAPPPPPTAQPHMAAPPYGAGLAGILPPKPDGEEEGKKKEVEKVDYLNLPCPV PFEEIQREALMSLKPELFEGLRFDFTKGLNQKFSLSHSVFMGSLEVPSQSTETIKVPTSH YEFGANFIDPKLILVGRVMTDGRLNARVKCDLTDDLTLKINAQLTHEPHYSQGMFNFDYK GTDYRAQFQIGNNAFYGANYIQSVTPNLSMGTEIFWLGHQRKSGIGFASRYNSDKMVGTL QVASTGIVALSYVQKVSEKVSLASDFMYNHMSRDVTSSFGYDYMLRQCRLRGKFDSNGVV AAYLEERLNMGVNFLLSAEIDHSKKNYKFGFGMTVGE [---other entries deleted---]

  2. 2.

    Building the Tom40 multiple sequence alignment. The building of the hidden Markov model requires sequences to be aligned in the regions of similarity. The sequence alignment can be achieved with different programs, such as Clustal-w and T-COFFEE. In this example we use Clustal-w. The input to the program Clustal-w is the set of unaligned sequences (in this case the ‘Tom40.fas’ FASTA file), and the output is multiple sequence alignment:

    $ cd /home/workspace/search $ ls Tom40.fas $ ../clustalw/clustalw2 -outfile=Tom40.gcg -output=gcg -infile=Tom40.fas CLUSTAL 2.0.9 Multiple Sequence Alignments Sequence format is Pearson Sequence 1: C.intestinalis  338 aa Sequence 2: A.thaliana      309 aa Sequence 3: O.sativa        337 aa [---output deleted---] $ ls Tom40.dnd Tom40.fas Tom40.gcg

The above command executed the program ‘clustalw2’ (installed in the section 2.2 step 1 above) in a non-interactive mode. In the command line we have specified that ‘Tom40.fas’ is the input file, ‘Tom40.gcg’ will be the output file with multiple sequence alignment (to be created), and that the output file should be in the GCG format. The above command has produced the alignment file ‘Tom40.gcg’ and also the dendrogram file ‘Tom40.dnd’. The latter will not be used and can be deleted: $rm Tom40.dnd $ls Tom40.fas Tom40.gcg

3.2 Building the Tom40 Hidden Markov Model

All commands described in this section are performed in the folder ‘/home/workspace/search/’.

  1. 1.

    Building the hidden Markov model. Building the Tom40 hidden Markov model requires two steps: creating the raw HMM and calibrating the HMM. The first step used the multiple sequence alignment in the input file (‘Tom40.gcg’) and produced a raw hidden Markov model file (‘Tom40g.hmm’):

    $../hmmer/bin/hmmbuild -n Tom40g Tom40g.hmm Tom40.gcg hmmbuild - build a hidden Markov model from an alignment HMMER 2.3.2 (Oct 2003) Copyright (C) 1992-2003 HHMI/Washington University School of Medicine Freely distributed under the GNU General Public License (GPL) - - - - - - - - - - - - - - - - - - - - - - - - - - - -- Alignment file:                 Tom40.gcg File format:                    MSF Search algorithm configuration: Multiple domain (hmmls) Model construction strategy:    MAP (gapmax hint: 0.50) Null model used:                (default) Prior used:                     (default) Sequence weighting method:      G/S/C tree weights New HMM file:                   Tom40g.hmm - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Alignment:           #1 Number of sequences: 23 Number of columns:   478 Determining effective sequence number ... done. [14] Weighting sequences heuristically     ... done. Constructing model architecture       ... done. Converting counts to probabilities    ... done. Setting model name, etc.              ... done. [Tom40g] Constructed a profile HMM (length 398) Average score: 500.08 bits Minimum score: 282.32 bits Maximum score: 608.28 bits Std. deviation: 99.39 bits Finalizing model configuration ... done. Saving model to file ... done.// $ls Tom40.fas Tom40.gcg Tom40g.hmm

    In the above command the argument ‘-n Tom40g’ specified that this hidden Markov model will be called ‘Tom40g’ (this information is recorded internally in the model). The two arguments Tom40g.hmm and Tom40.gcg are the hidden Markov model file (to be produced) and the input multiple sequence alignment (see Note 2). This command takes a few seconds to execute on a modern computer.

  2. 2.

    Calibration of the hidden Markov model. The next step is to calibrate the hidden Markov model ‘Tom40g.hmm’. This step is important to optimize the sensitivity of the hidden Markov model search. The empirical calibration is performed by fitting a distribution to the scores obtained from a Monte Carlo simulation. The calibration is performed with the HMMER program ‘hmmcalibrate’:

    $../hmmer/bin/hmmcalibrate --num 10000 Tom40g.hmm hmmcalibrate -- calibrate HMM search statistics HMMER 2.3.2 (Oct 2003) Copyright (C) 1992-2003 HHMI/Washington University School of Medicine Freely distributed under the GNU General Public License (GPL) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- HMM file: Tom40g.hmm Length distribution mean: 325 Length distribution s.d.: 200 Number of samples: 10000 random seed: 1215572532 histogram(s) saved to: [not saved] - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- HMM : Tom40g mu : -194.942474 lambda : 0.136655 max : -146.145996//

    The argument ‘- -num 10000’ specifies the number of samples: 5000 is the default in HMMER version 2.3.2. This step is computationally expensive, and calibration of a single hidden Markov model may require a few minutes on a modern computer.

3.3 Running the Tom40 Hidden Markov Model Search

  1. 1.

    Tom40 hidden Markov model search of E. histolytica predicted proteins. The HMM search for a single hidden Markov model is straightforward, performed with the program ‘hmmsearch’:

    $../hmmer/bin/hmmsearch -E 0.1 Tom40g.hmm ../ehist/EHA1.pep > Tom40g.OUT

    The command ‘hmmsearch’ can take several arguments, including the name of the hidden Markov model file (‘Tom40g.hmm’), and the name of the sequence database for the search (‘EHA1.pep’ in this case). The optional argument ‘-E 0.1’ specifies the E-value cutoff for reporting hits (the E-value is interpreted similarly as in BLAST searches). The lower E-value the more significant is the hit, and typically hits with E-values of 1 or larger are not significant. In the command above the output of the program ‘hmmsearch’ has been collected in the file ‘Tom40g.OUT’. The E. histolytica conceptual proteome contains 9772 sequences. To run the Tom40 hidden Markov model search against this database on a Xeon 3.2 GHz CPU required one minute (see Note 4).

    Inspection of the output file Tom40g.OUT shows one hit, with the E-value of 0.0039:

    Sequence Description Score E-value N -------- ------------ ------ -------- -- 38.m00236 hypothetical protein 38.t00034 AAFB01000158 74.0 0.0039 1 Parsed for domains: Sequence Domain seq-f seq-t hmm-f hmm-t score E-value -------- ------- ----- ----- ------ ----- ------ ------- 38.m00236 1/1 22 304 .. 1 398 [] -74.0 0.0039

    This is the best Tom40 candidate in E. histolytica based on the training set of Tom40 sequences used in this study. The E-value shows that the similarity of the sequence ‘38.m00236’ to the Tom40 model is well in the grey zone (a closely related sequence would have E-value of <10−100), yet the observed level of similarity is not very likely to occur by chance. A visual inspection of the sequence ‘38.m00236’ and a comparison with the training sequences showed some similarity typical to Tom40 throughout the entire sequence length, with some shortening relative to typical Tom40 sequences. Protein shortening is a general feature seen in many parasites, as a means of overall reduction in genome size, and makes pairwise (e.g. BLAST) sequence searches even less likely to succeed in scenarios such as identifying protein transport components in organisms like E. histolytica. Based on the clues provided by HMM searches such as the one documented here, the protein import machinery in the mitosomes of E. histolytica is being fully evaluated (16).

4 Notes

  1. 1.

    The full set of Tom40 sequences used in this work is available from the authors on request.

  2. 2.

    By default, the program ‘hmmbuild’ builds a model optimized for local comparison with respect to the sequence and global comparison with respect to the HMM. This is akin to the Needleman-Wunsch type sequence alignment, rather than the Smith-Waterman type sequence alignment. To build a model which is local with respect to the sequence and with respect to the HMM, use the -f switch (i.e. ‘hmmbuild -f’).

  3. 3.

    An immediately apparent limitation of the model shown in Fig. 16.2 is that any residue in state 4 other than ‘A’ will generate zero probability for the entire sequence. In practice residue probabilities are not set to zero even for residues that do not feature in a given position, but to a small background value deduced from statistical reasoning.

  4. 4.

    The example chosen for this “Methods” paper demonstrates the use of HMM searches against a single genome dataset as such searches can be easily performed on modern computers in a short time period. It is also possible to interrogate far larger databases with HMMs, for example the entire UniProt database. Searches on very large databases can take days to complete on desktop workstations. As the rate of released genome sequence data continues to increase rapidly and exceeds the rate of advances in computational power, routine searches of the largest databases increasingly require access to specialized supercomputing facilities.