Conserved and variable correlated mutations in the plant MADS protein network
- 3.5k Downloads
Plant MADS domain proteins are involved in a variety of developmental processes for which their ability to form various interactions is a key requisite. However, not much is known about the structure of these proteins or their complexes, whereas such knowledge would be valuable for a better understanding of their function. Here, we analyze those proteins and the complexes they form using a correlated mutation approach in combination with available structural, bioinformatics and experimental data.
Correlated mutations are affected by several types of noise, which is difficult to disentangle from the real signal. In our analysis of the MADS domain proteins, we apply for the first time a correlated mutation analysis to a family of interacting proteins. This provides a unique way to investigate the amount of signal that is present in correlated mutations because it allows direct comparison of mutations in various family members and assessing their conservation. We show that correlated mutations in general are conserved within the various family members, and if not, the variability at the respective positions is less in the proteins in which the correlated mutation does not occur. Also, intermolecular correlated mutation signals for interacting pairs of proteins display clear overlap with other bioinformatics data, which is not the case for non-interacting protein pairs, an observation which validates the intermolecular correlated mutations. Having validated the correlated mutation results, we apply them to infer the structural organization of the MADS domain proteins.
Our analysis enables understanding of the structural organization of the MADS domain proteins, including support for predicted helices based on correlated mutation patterns, and evidence for a specific interaction site in those proteins.
KeywordsProtein Pair Correlate Mutation Residue Pair Motif Pair Sequence Entropy
New mutations continually arise and are the source of genetic diversity. They provide the material on which selection acts; in large, sexual populations, beneficial mutations will reach fixation, and most deleterious mutations will be lost. However, in the case of deleterious mutations, a compensatory mutation may occur that renders the two mutations neutral or beneficial as a pair and causes them to be preserved by selection. In protein-coding sequences, coevolution of residues can occur as compensation of changes in e.g. volume or charge, or because of the simultaneous involvement of residues in e.g. ligand binding. This implies that residues which show such correlated mutations are expected to be located close to each other in the 3 D structure of a protein. An early observation of this kind was obtained in a set of virus sequences, where the positions in the sequence that showed an identical pattern of variation were in most cases close together in the 3 D structure . Several studies have reported similar observations and have made use of such information e.g. to engineer artificial domains , to predict interhelical contacts in transmembrane proteins , to analyze functional dependencies observed within HIV genes , to predict functionally important residues  or to distinguish between correct and incorrect models for the 3 D structure of proteins .
A number of methods have been developed to search for correlated mutations, and their results are mostly validated by comparing with distances between residues in crystal structures. A distinction can be made between pairwise correlation methods (which might be based on substitution matrix scores or related physicochemical characteristics) [7, 8] and information-theory based methods [9, 10, 11]. The former seem to outperform the latter when using enrichment of residue pairs at short distances as a criterion [12, 13]. Although several correlated mutation measurements yield reasonable accuracy for intramolecular contact map prediction, the accuracy level drops in intermolecular contact prediction .
On a higher level, similarity between phylogenetic trees is related to protein interactions in large sets of interacting families [15, 16, 17, 18, 19, 20, 21, 22]. However, it has been heavily debated whether this signal is due to true coevolution, i.e. compensatory mutations between residues in the binding partners . A number of factors affecting sets of proteins, such as similar expression patterns or functioning in a given biochemical pathway, can generate similarity in evolutionary rates . Families with similar evolutionary rates in different organisms will present similar trees, without the need for co-adaptation between the corresponding proteins. Although this confounding effect takes place at the level of phylogeny, residue-level correlated mutations also contain noise caused by evolutionary processes related to common ancestry, such as changes in codon usage or amino acid frequencies [25, 26]. Hence, misleading signal can be caused by phylogenetic correlations between homologous sequences and from correlation due to factors other than spatial proximity. This highlights the need to distinguish between observed "covariation", and true "coevolution", which is what we would like to infer based on those observed signals which do however contain noise.
Plant MADS domain transcription factors (TFs) are involved in regulation of a variety of developmental processes such as floral transition and flower development [27, 28]. They "do it together"  in the sense that they are engaged in protein interactions and form protein complexes that are required for binding DNA. An analysis of the interaction capacity of all members of the family in Arabidopsis revealed the ability to form 110 different dimers  among 27 members of the subfamily of MIKC-type (or type II) MADS domain proteins. These TFs have in addition to the MADS (M) domain an I, K and C-domain [31, 32].
A couple of structures are available for dimers of MADS domains (followed by a domain with some homology to the I domain) [33, 34, 35, 36, 37, 38], but structural information for the other domains is lacking. The structures show that two MADS domains extensively contact each other, but mutagenesis data indicate that also other parts of the MIKC proteins contact each other. In particular, the I-domain is involved in determining interaction specificity [39, 40] and the K-domain is important for dimerization [41, 42, 43, 44, 45]. A few computational studies previously analyzed plant MADS domain protein sequences in order to find functionally important regions, albeit without explicit reference to their role in interaction specificity [46, 47, 48]. Other computational studies focused on the evolution of the interaction network via duplications  or on simulating models for gene- and/or protein-interactions [50, 51, 52]. Recently, we developed a method aimed at predicting interaction sites using experimental interaction data and applied it to the MADS domain protein family  followed by experimental testing of sites governing interaction specificity .
Here, we present a novel approach to analyzing correlated mutations and testing their validity. We analyze correlated mutations in a family of interacting proteins. This provides a convenient way to compare correlated mutations between those proteins and assess whether correlated mutations are 'conserved' between them. Secondly, it allows comparison of correlated mutations observed between pairs of interacting proteins with those observed between pairs of non-interacting proteins, where the latter provide a unique background-model for assessment of significance of the observed intermolecular correlated mutations. Hence, our results contribute to the interpretation of correlated evolution signals.
We integrate our results with available structural, bioinformatics and experimental data for the plant MADS domain proteins and in this way we obtain clues about the structural organization of these proteins and their complexes.
We will first discuss sequence retrieval, followed by correlated mutation analysis and validation of the results using various types of independent data. Next, conservation of correlated mutations between homologous positions in various proteins will be analyzed, which provides a novel way to assess the amount of information correlated mutations contain. Finally, our results will be applied in prediction of protein interactions and scrutinized to obtain structural insight into the MADS proteins.
Combining the sequences with existing interaction data  allowed in total 34 different pairs of interacting Arabidopsis proteins to be analyzed, with a minimum of 30 ortholog pair sequences (Additional File 2). As background model, 34 pairs of non-interacting MADS pairs were used for which a minimum of 30 ortholog pairs were available. Because of the way we deal with co-orthologs (see Methods), there are cases of MADS domain proteins that pass the threshold of 30 sequences only in the intermolecular analysis and not in the intramolecular analysis.
Validation of correlated mutation analysis
Correlated mutations were obtained for intra- and inter-molecular sequence alignments using CAPS (see Methods). Additional Files 3 and 4 contain lists of these results. To validate the observed correlated mutation pairs, we compared them with available structural data (a crystal structure is available for the human MADS domain), previously predicted interaction motifs and Single Nucleotide Polymorphisms (SNPs).
Validation: structure data
For the intermolecular correlated mutation analysis, the analysis of interacting protein pairs using time correction (see Methods) shows an enrichment in residues within 15Å, compared to all residue pairs (Figure 2B). Such enrichment is not found for interacting protein pairs analyzed without time correction, nor for non-interacting pairs analyzed either with or without time correction (Figure 2B). Hence, these two background models strongly support the significance of the distance enrichment for the resulting residue pairs in the correlated mutation analysis of the interacting MADS domain proteins. Note that the correlated mutation analysis of non-interacting pairs results in a strikingly lower percentage of pairs of residues with small distance (Figure 2), an observation for which we miss a clear interpretation.
The enrichment of residues which are in contact (within 15Å) across the interface is reasonably strong (55% of the correlated mutation residue pairs are in contact vs. 39% for all residue pairs), but less so than what is seen for the intramolecular correlated mutation analysis. This is in line with what has been observed previously for intermolecular correlated mutation analysis (see introduction). One reason could be that the correlated mutation analysis will inherently focus on residues which are not conserved (because otherwise there will be no coevolution effect). For a large part, residues at the interface will be conserved, meaning that a lot of residue-pairs will not show up in the correlated mutation analysis. Another factor obviously is the assumption (inherent to intermolecular correlated mutation analysis) that orthologs will have similar interaction partners, a hypothesis for which evidence exists  but that also has been challenged . The clear difference between the interacting and non-interacting protein pairs does however strongly argue for the importance of the correlated residue pairs that we recover. The results presented here are for using a cutoff for the correlation coefficient of 0.4, but qualitatively they are similar for higher cutoffs (only the number of reported pairs is lower). Because enrichment of residue pairs at small distances was only observed for the analysis with time correction, in the sequel we use results from that analysis only. To further analyze the significance of the observed short distance enrichment for the intermolecular correlated mutations, a resampling analysis was performed. This is described in detail in Additional File 5; it clearly confirmed the significance of our results.
Validation: comparison with predicted interaction motifs
For the intermolecular correlated mutation results, a comparison was made with motif pairs which were previously predicted to determine MADS interaction specificity [53, 54]. The rationale behind this comparison is that both motifs and correlated mutations should contain information about interaction residues. Overall, there are large differences between different interacting protein pairs with respect to the number of correlated mutation positions and motifs that coincide. The lowest coincidence was found for the AGL12-AGL16 interaction for which only 10% of the residues involved in correlated mutation were overlapped by predicted interaction motifs. In contrast, three interacting protein pairs (ANR1-SOC1, AGL21-FUL, and SOC1-SVP) showed over 70% of their correlated mutation positions overlapped by predicted interaction motifs. However, there was a clear difference between the results for the interacting pairs and non-interacting pairs. For the interacting pairs, 55% of the motif positions was overlapped by at least one correlated mutation position, and 39% of the correlated mutation positions was covered by a motif, whereas for the non-interacting pairs, 42% of the motif positions was overlapped by at least one correlated mutation position, and 32% of the correlated mutation positions was covered by a motif. Comparison with randomly generated position pairs (see Methods) showed that the F-score (harmonic mean of coverage of correlated mutation positions and of predicted interaction motifs, 0.46 for the interacting pairs and 0.37 for the non-interacting pairs) was significantly different from random for the interacting protein pairs (p < 0.001), but not for the non-interacting protein pairs (p~0.5).
Validation: comparison with SNP data
Finally, we compared the intermolecular correlated mutation positions with available Arabidopsis SNP data . For the interacting pairs, we found 207 non-synonymous SNPs without overlap with a correlated mutation position, and 19 with overlap with a correlated mutation position. For the non-interacting pairs, these values are 581 and 74, respectively. This means that the fraction of non-synonymous SNPs covering a correlated mutation site is smaller for the interacting pairs (8.4%) than for the non-interacting pairs (11.3%). Of course at longer evolutionary distances one would expect a correlated mutation position to be variable (otherwise it would not be detected as a correlated mutation position), but if these sites are functional (i.e. in our context, important for the interaction) then at short evolutionary distances it is reasonable to expect that they are conserved, and the fact that they are more conserved for the interacting compared to the non-interacting protein pairs is additional validation of our results. These results are reinforced by the fact that for the synonymous SNPs, no such difference between interacting and non-interacting pairs is observed (both display an overlap of ~10% between synonymous SNPs and correlated mutations).
Validation: general trends
Overall, the comparison of correlated mutation positions with structural data, interaction motifs and SNPs show the same trend: correlated mutations from interacting pairs have enrichment in signals compared to non-interacting pairs. In addition, the intramolecular correlated mutations show clear distance enrichment. Hence, all observed trends, although sometimes weak, are consistent and point towards biological significance of the observed signals.
Conserved correlated mutations
An intriguing question is whether positions with correlated mutations in various protein subfamily members are conserved for being correlated or not, because this would give further insight into the mechanism behind correlated mutations. Note that the use of the term "conservation" here is somewhat different from its most common use to describe sequence conservation, but was chosen because it best describes the phenomenon of observing a feature (correlated mutation in this case) in multiple instances of a sequence alignment (such use is not unprecedented, compare for example with the use of "structure conservation"). To answer this question for the MADS proteins, we investigated for all intramolecular correlated mutation pairs in a given protein whether they were detected in other MADS proteins as well, in which case they were called "conserved" in these other proteins. We first analyzed whether there is more conservation of correlated mutations for pairs of proteins with higher sequence identity, but this was not the case. Overall, 63% of the correlated mutation pairs are conserved in at least one other MADS protein, and 37% are not (conserved intramolecular correlated mutations are listed in Additional File 7). For the non-conserved cases, there are two possibilities: either a correlated mutation is not conserved because the residues themselves at these positions are conserved, i.e. not varying, in other MADS domain proteins (which would support their functional importance) or there is variation at the positions in other MADS domain proteins but it is not correlated. To distinguish between these two possibilities, sequence entropy was calculated for each column in the multiple sequence alignments (see Methods). Next, homologous positions in various MADS domain protein alignments were divided into two groups, one with correlated mutation occurring at that position, and one without. Sequence entropy was compared between those groups. This showed that correlated mutation positions which were conserved in at least one other protein had on average a higher sequence entropy (2.2 +/- 0.5) than the homologous positions where the correlated mutations were not conserved (1.9 +/- 0.2). Indeed, in 74% of the cases conserved correlated mutation positions had a higher entropy than the homologous positions where no correlated mutation was detected. This means that no correlated mutation was observed in those homologous positions because they were less variable. Correlated mutation positions that were not conserved in any other protein did not show such difference in sequence entropy. Hence, for correlated mutations that are not conserved at all, the homologous positions in other proteins are as variable as the position where the correlated mutation occurs, but in these other proteins no compensatory correlated mutation occurs. These results fit within the framework of correlated mutations occurring when a second mutation compensates for an earlier deleterious one and indicate that this is most likely the case for correlated mutations which are conserved in at least one other protein. For those correlated mutations that are not conserved at all this interpretation is less likely because these positions show as much variation in other proteins as in the protein where the correlated mutation occurs.
Analysis of MADS domain protein and complex structure
Based on the analyses described above we conclude that the correlated mutation analysis results clearly contain biological signal. We now describe application of these results in order to obtain insight into the structural organization of MADS domain proteins and their complexes. In particular, we focus on the K-domain, because structure information is already available for the MADS and I domain (see above), and the C-terminal domain is predicted to be unstructured.
Intramolecular organization of K-domain helices
Next, we analyzed whether correlations were observed between helices, in order to infer their orientation with respect to each other. Because only a few intramolecular correlated mutation positions occur between predicted K-domain helices (15 pairs of positions, in 3 different proteins: AP1, SEP1 and SEP3; these predicted connections are listed in Additional File 9), our results suggest that these helices do not directly contact each other intramolecularly in most MADS domain proteins. This is in line with suggestions in the literature that these helices would be involved in intermolecular contact [43, 44]. This suggestion is reinforced by the fact that we do observe intramolecular correlated mutations between the K-domain helices and the MADS/I domain: 115 pairs of positions in 8 different proteins (Additional File 10). These predicted connections mainly involve the first K-domain helix, which is indeed expected to contact the MADS/I domain as it is directly connected via the primary sequence. Of these pairs, only 10 are showing conservation, which is quite low compared to the overall conservation for correlated mutation pairs (63%, see above); however, one reason might be that the I domain is more variable and less well alignable than the MADS or K-domain. These cases of conserved correlated mutations are shown in Additional File 10. Two examples of such conserved predicted contacts are between Val36 resp. Ser58 and two residues in the first predicted K-domain helix of SEP3, and the same positions in AP1. An interesting aspect here is that Val36 and Ser58 are located close to each other (~9 Å) in a structure model of SEP3 based on the available crystal structure of the MADS domain, and the residues in the K-domain helix which show correlated mutation with these two residues have a sequential distance of 6 residues, corresponding with almost two turns of a helix, which corresponds to ~3 Å. Taking into account that contacts will be made via side chains, which bridge several Å, these distances show a nice match (Figure 5C).
Analysis of intermolecular interactions
In a recent analysis of MADS interaction specificity  we in particular focused on one part of the I domain where we found a 'motif hotspot': experimental investigation with yeast-two-hybrid validated the importance of this region, and using available structural information we hypothesized that there would be an interaction between this region and a complementary region in the K-domain. As the motif in this region was specifically validated for the SOC1 protein, we analyzed correlated mutation pairs for SOC1 with interaction partners where the position in SOC1 overlapped with this 'hotspot' region. Several of the complementary correlated mutation pairs fall specifically in the first predicted helix in the K-region, providing additional validation for our original hypothesis (Figure 6B).
Our analysis of correlated mutations in the MADS domain protein family provides a unique way to investigate the amount of signal that such mutations leave in protein sequences. We studied correlated mutations in various family members in terms of their conservation, and were able to compare correlated mutations between interacting pairs of proteins and non-interacting pairs of proteins. The intramolecular correlated mutation results show a clear enrichment of residue pairs located close to each other in the MADS domain. There are some variations between proteins in the number of correlated mutation pairs and the percentage located close to each other. We did not observe a clear correlation between the number of sequences available for each protein and the number of correlated mutation pairs or the short distance enrichment. We also tested whether the number of predicted correlated mutation positions or the distance enrichment depended on quality measures of the alignments that were used (e.g. fraction of gaps in the alignment) but found no such correlation.
The majority of the intramolecular correlated mutations were observed in at least two MADS proteins, i.e. they showed conservation. We found that when such conserved correlated mutations were not observed in other MADS proteins, this is mostly because these positions are more conserved and not because of uncorrelated variability in these other proteins. This analysis gives additional support to the interpretation of correlated mutations as "one mutation followed by a compensatory mutation". Such support is important because of the need to infer "coevolution" based on observed "covariation", a process in which noise can be present, as discussed in the Introduction.
A possible confounding factor for intermolecular correlated mutation analysis is that we cannot be sure that the predicted orthologs in all the various species that we analyze do indeed interact. To get some further insight into this issue, we assembled a set of interacting MADS domain proteins from various species from literature [30, 61, 62, 63, 64, 65, 66, 67]. Using sequence identity with Arabidopsis proteins as criterium, orthology relationships were predicted, and next we assessed whether the interaction would have been correctly predicted based on the Arabidopsis interaction data. This was the case in over 60% of the interactions (data not shown). A random prediction would have much lower success rate because there are much more non-interacting than interacting pairs of Arabidopsis MADS domain proteins. Still, this number clearly illustrates a problem with which all intermolecular correlated mutation approaches have to deal, i.e. that many interactions will be missed and/or incorrectly assigned. Indeed, validation by for example structure information shows that the fraction of residue pairs in close contact is lower for the intermolecular correlated mutations than for the intramolecular correlated mutations.
Our approach is unique in using a set of interacting protein pairs and a set of related non-interacting protein pairs as a reference. As the latter would be expected not to have correlations with each other, they serve as negative controls. Using these, we found i) that the overrepresentation of intermolecular residues at short distances is higher for interacting protein pairs than for non-interacting pairs; ii) that there is more consistency between results from different interacting pairs than between results from different non-interacting pairs; iii) that there is a better overlap between correlated mutation results from interacting protein pairs and our previously predicted interaction motifs than between correlated mutation results from non-interacting protein pairs and those motifs; and iv) that they have less overlap with SNPs. Although some trends are weak on their own, they are all consistent.
Our results here are complementary to our previous analysis of sequence determinants of MADS protein interaction specificity . In particular, that analysis focused on using sequences from Arabidopsis MADS domain proteins in order to find motifs that are responsible for interaction specificity. In our current study, we use the large amount of sequence data that is available, in order to find correlated mutation pairs. There is no reason why these pairs should specifically contain information about interaction specificity, but rather one would expect that they contain information about interaction sites in general. As such, the predicted interaction motifs would be expected to form a subset of the correlated mutation sites, and in line with that, indeed the coverage of predicted interaction motifs by correlated mutation positions is higher than the coverage of correlated mutation positions by predicted interaction motifs. An important point is also that correlated mutation positions per definition are sites which are not conserved evolutionarily, whereas the motif positions are relatively conserved; this again limits the possible amount of overlap between these two analyses. Still, the fact that we do find significant overlap indicates that a combination of these two approaches might be particularly powerful.
Our results provide understanding of structural properties of the important plant MADS proteins. In particular, our correlated mutation analysis confirms predicted helices in the K-domain, and supports a specific organization of these helices in the MADS dimers. Also, we obtain further support for an interaction region in the I domain. Hence, in addition to obtaining general insight into coevolution signals at the protein level, we also demonstrate the use of these signals to test specific hypothesis about structural properties of proteins.
A set of type II MADS proteins was obtained as follows (Figure 1). Interpro  was used to obtain UniProtKB identifiers of sequences in various species that contained both a MADS domain and a K-domain (PFAM domains PF00319 and PF01486, respectively); these sequences were retrieved from UniProt . Secondly, the NCBI web_blast.pl script was used with in turn each Arabidopsis type II sequence as query, searching the NR database with blastp. Hmmsearch  was used to select sequences with both a MADS domain and a K-domain. Thirdly, the genome sequences of rice , poplar , grape vine , Physcomitrella patens , maize http://www.maizesequence.org, medicago http://www.medicago.org/genome, papaya  and sorghum  were scanned using hmmsearch  to obtain sequences with both a MADS domain and a K-domain.
Next, orthologs were assigned to the various Arabidopsis sequences. We used a "best hit" criterion, based on the value of the sequence identity (calculated using gaps as non-indentical residues) after separately aligning each of the obtained sequences with each of the Arabidopsis sequences, using Muscle . For the sequences obtained from the eight genomes (where we are relatively sure that all relevant sequences are obtained) this criterium was applied bi-directional, whereas for the other sequences it was only required that the respective Arabidopsis sequence was their best hit (and not that they were also the best hit of that Arabidopsis sequence). We tested however also the use of a bidirectional best-hit criterium for these other sequences, and found that it did not improve results. Note that a recent study suggested that it is beneficial to include both orthologs and paralogs in the multiple sequence alignment used as input for correlated mutation analysis . Hence, a more restrictive bi-directional best hit approach would not necessarily be expected to give better results.
Subsequently, in each species separately, blastclust with sequence identity cutoff of 95% was used for each group of sequences which simultaneously were "best hits" for a given Arabidopsis sequence (the cutoff of 95% was based on the observation that this keeps the Arabidopsis MADS proteins apart). A representative for each cluster was chosen randomly, except that preference was given to Interpro-based sequences compared to blast-based sequences and sequences from the genomes were preferred over both Interpro-based sequences and blast-based sequences. In addition, at least 25% sequence identity between a sequence and the Arabidopsis sequence which was it best hit, was required.
Our choice to detect orthologs using blast hits is a pragmatic one. A more elaborate and time-consuming approach would be to make use of phylogenetic trees, which however have their own degree of uncertainty. We tested how different the results would be upon application of phylogenetic relationships from previously published phylogenetic trees for the MADS domain proteins AP1 and FUL . When comparing with structure data, resulting correlated mutations for these cases did not contain more residue pairs at lower distances than what was obtained when using blast (data not shown). Hence we do not further discuss these results.
To analyze intramolecular correlated mutations, the only step to take next was to align the sequences of each Arabidopsis MADS domain protein with all its associated sequences, for which Muscle  was used. The alignments were used for the correlated mutation analysis if at least 30 sequences were present. For intermolecular correlated mutation analysis, interaction data from De Folter et al. were used . We combined for each pair of interacting Arabidopsis sequences their predicted orthologs within each species. If in one species multiple sequences were best hits with one of the two interacting sequences we combined each of those with the best hits in that species of its interaction partner. For example, if Arabidopsis protein X and Arabidopsis protein Y interact and both have two best hits in a given species, in that species there are 2 * 2 = 4 combinations.
After alignment, the resulting sets of interaction pairs (each consisting of one original Arabidopsis interaction pair and the ortholog pairs obtained for various other species) were used as input for the correlated mutation analysis if at least 30 pairs were present. As a background model, non-interacting pairs with at least 30 associated sequence pairs were used as input.
Note that the cutoff on the number of sequences (30) that we use is somewhat arbitrarily but such cutoff is clearly needed because the smaller the number of sequences, the less reliable the correlated mutation results are.
Correlated Mutation analysis with CAPS
CAPS  compares the correlated variance of the evolutionary rates at two sites in a multiple sequence alignment by comparing the transition probabilities between each pair of amino acids at the two sites, using the BLOSUM substitution matrix . Because sequences that diverged longer ago are more likely to fix mutations at two sites by chance, BLOSUM values are normalized by the time of divergence between sequences using Poisson corrected amino acid distances; we performed analysis both with and without this time correction. The coevolution between two sites is then estimated as the correlation in the pairwise amino acid variability, relative to the mean variability per site. Correlated mutation pairs are grouped based on their connectivity to each other; only those "correlated groups" were analyzed.
To determine significance of these correlations, re-sampling can be performed. However, because this is computationally expensive (keeping in mind that we perform correlated mutation analysis for various MADS domain proteins and various pairs of MADS domain proteins), we chose to use a cutoff on the value of the correlation coefficient, which we set to 0.4, in agreement with previous correlated mutation analyses . This is a conservative threshold as it is slightly above the lowest correlation coefficient values found to be significant in an earlier application of CAPS . We performed resampling afterwards for a number of MADS protein pairs to analyze the significance of the results obtained when comparing the correlated mutations with available structural data (see below, randomization trials). We also tested a previously described approach to remove spurious phylogenetic correlation by using subalignments where specific clades are removed . This approach was implemented by using small subunit ribosomal RNA sequences obtained from http://gobase.bcm.umontreal.ca/searches/gene.php to obtain distances between species and using Clustalw  to build a tree. As the results of this analysis did not improve compared to the results without this correction, we only present the latter results. This is in line with an analysis that showed that tree-aware correlated mutation methods did not outperform tree-ignorant methods .
Comparison with protein structure data and predicted interaction motifs
Although no structure for plant MADS domain proteins is available, a couple of structures of human MADS domains are available. Of these, 1tqe, 1n6j, 1egw and 3kov are crystal structures of MEF2-type MADS domains, which are most related to plant MIKC (type II) MADS domains . Because 1egw, the structure of human MEF2A  has the best resolution of these structures we chose this structure for comparison of the correlated mutation analysis results with structure data. The structure of human MEF2B, 1n6j , has the second-best resolution and we used this structure for comparison. Results of using this structure are almost indistinguishable from that of using 1egw, so we only report results for the latter.
Correlated mutation pairs were compared with protein structure data as follows. For all intra-and inter-molecular pairs of residues in the PDB structure 1egw, the shortest heavy-atom distance was obtained. Mapping of the Arabidopsis sequence to the structure was obtained via Muscle alignment. For this, residues 2-69 of the structure were used. For residues 2-59, which constitute the MADS domain, there is high overall sequence similarity with the plant MADS domain, and for residues 60-69 there is also reasonable sequence similarity with the first part of the plant I domain. For the various proteins, this similarity (amount of conservative substitutions) is at least 7 out of 10. However, the sequence identity with the plant I domain is lower than for the MADS domain, meaning that the results of comparison with this part of the structure could be more noisy.
In addition, correlated mutation pairs were compared with previously predicted interaction motifs . Because these interaction motifs are grouped into pairs of complementary motifs, correlated mutation positions were compared both to individual motifs and to pairs of complementary motifs.
To predict coiled coils in the K-domain, a method which compares sequences with sequences of known coiled-coil proteins  was used, which is available via http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_lupas.html. Default settings were used (scoring matrix 2-MTIDK, no upweighting of positions a and d), and a window length of 14, minimum coil probability of 0.5 and minimum length of 4 residues was applied to predict coiled coil helices based on the predicted coil probabilities. Helical wheel representations were generated with http://rzlab.ucr.edu/scripts/wheel/wheel.cgi
Modelling of the structure of the MADS domain of SEP3 was performed using Modeller . Out of 1000 generated models, the 10 best based on the objective score were used for docking. Modelling of a K-domain helix was performed in CNS . Dihedral angle restraints were defined for backbone angles phi, -65° ± 20° and psi -40° ± 20°, respectively, and hydrogen bond restraints were defined between each O(i)-N(i+4) pair (lower and upper bound 2.3 and 3.5 Å, respectively) and O(i)-HN(i+4) pair (lower and upper bound 1.7 and 2.5 Å, respectively). The anneal.inp CNS-script was used, which applies a high-temperature torsion-angle dynamics phase followed by a torsion angle dynamics cooling phase and a second cartesian dynamics cooling phase. Ten structures were calculated, and the lowest energy structure was used. Protein structure figures were prepared using Molscript  and Raster3 D .
Correlated mutations and sequence entropy
where Pjk is the frequency of amino acid j at position k.
Here we describe the various random trials that were performed in order to test for statistical significance. To assess the statistical significance of the intramolecular distance enrichment, 1000 random subsets of residue pairs were generated (with the size of the subset equal to the number of correlated mutation residue pairs). For these, the fraction of residues within 15Å of each other was calculated.
To assess the significance of observed intermolecular short distance enrichment for correlated mutation positions, we applied a randomization procedure where the original pairs of sequences that formed an input set for CAPS were randomly shuffled. This was repeated 1000 times.
To assess the significance of the observed overlap between correlated mutation residues and predicted interaction motifs, random 'correlated mutation' pairs were generated by replacing each position in an observed correlated mutation position pair with a randomly generated sequence position. In doing so, we took into account that a position could occur in several correlated mutation pairs; such position was replaced by the same random position in all its correlated mutation position pairs.
Finally, to assess the statistical significance of the observed preferred sequence-distance for correlated mutation positions within helices in the K-domain, we analyzed whether similar preferred sequence-distances occurred within randomly generated stretches of the sequence. The number and length distribution of these stretches was similar to that of the predicted K-domain coils, but their position within the sequence was randomized.
This work was supported by the BioRange programme (SP 2.3.1) of the Netherlands Bioinformatics Centre (NBIC), which is supported through the Netherlands Genomics Initiative (NGI), and by the Netherlands Organization for Scientific Research (NWO, NWO-VENI Grant 863.08.027 to ADJvD). We also thank Richard Immink for helpful discussions.
- 5.Kuipers RKP, Joosten HJ, Verwiel E, Paans S, Akerboom J, van der Oost J, Leferink NGH, van Berkel WJH, Vriend G, Schaap PJ: Correlated mutation analyses on super-family alignments reveal functionally important residues. Proteins-Structure Function and Bioinformatics. 2009, 76 (3): 608-616. 10.1002/prot.22374.CrossRefGoogle Scholar
- 10.Buslje CM, Santos J, Delfino JM, Nielsen M: Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics. 2009, 25 (9): 1125-1131. 10.1093/bioinformatics/btp135.PubMedCentralPubMedCrossRefGoogle Scholar
- 23.Hakes L, Lovell SC, Oliver SG, Robertson DL: Specificity in protein interactions and its relationship with sequence diversity and coevolution. Proceedings of the National Academy of Sciences of the United States of America. 2007, 104 (19): 7999-8004. 10.1073/pnas.0609962104.PubMedCentralPubMedCrossRefGoogle Scholar
- 27.Angenent G, de Folter S, Nougalli I, Immink R: Protein complexes make the flower. Comparative Biochemistry and Physiology a-Molecular & Integrative Physiology. 2006, 143 (4): S167-S167.Google Scholar
- 30.de Folter S, Immink RGH, Kieffer M, Parenicova L, Henz SR, Weigel D, Busscher M, Kooiker M, Colombo L, Kater MM, et al: Comprehensive interaction map of the Arabidopsis MADS box transcription factors. Plant Cell. 2005, 17 (5): 1424-1433. 10.1105/tpc.105.031831.PubMedCentralPubMedCrossRefGoogle Scholar
- 31.Parenicova L, de Folter S, Kieffer M, Horner DS, Favalli C, Busscher J, Cook HE, Ingram RM, Kater MM, Davies B, et al: Molecular and phylogenetic analyses of the complete MADS-box transcription factor family in Arabidopsis: New openings to the MADS world. Plant Cell. 2003, 15 (7): 1538-1551. 10.1105/tpc.011544.PubMedCentralPubMedCrossRefGoogle Scholar
- 36.Huang K, Louis JM, Donaldson L, Lim FL, Sharrocks AD, Clore GM: Solution structure of the MEF2A-DNA complex: structural basis for the modulation of DNA bending and specificity by MADS-box transcription factors. Embo J. 2000, 19 (11): 2615-2628. 10.1093/emboj/19.11.2615.PubMedCentralPubMedCrossRefGoogle Scholar
- 39.Krizek BA, Meyerowitz EM: Mapping the protein regions responsible for the functional specificities of the Arabidopsis MADS domain organ-identity proteins. Proceedings of the National Academy of Sciences of the United States of America. 1996, 93 (9): 4063-4070. 10.1073/pnas.93.9.4063.PubMedCentralPubMedCrossRefGoogle Scholar
- 40.Riechmann JL, Krizek BA, Meyerowitz EM: Dimerization specificity of Arabidopsis MADS domain homeotic proteins APETALA1, APETALA3, PISTILLATA, and AGAMOUS. Proceedings of the National Academy of Sciences of the United States of America. 1996, 93 (10): 4793-4798. 10.1073/pnas.93.10.4793.PubMedCentralPubMedCrossRefGoogle Scholar
- 46.Martinez-Castilla LP, Alvarez-Buylla ER: Adaptive evolution in the Arabidopsis MADS-box gene family inferred from its complete resolved phylogeny. Proceedings of the National Academy of Sciences of the United States of America. 2003, 100 (23): 13407-13412. 10.1073/pnas.1835864100.PubMedCentralPubMedCrossRefGoogle Scholar
- 47.Nam J, Kaufmann K, Theiben G, Nei M: A simple method for predicting the functional differentiation of duplicate genes and its application to MIKC-type MADS-box genes. Nucleic Acids Research. 2005, 33 (2): 10.1093/nar/gki978.Google Scholar
- 48.Hernandez-Hernandez T, Martinez-Castilla LP, Alvarez-Buylla ER: Functional diversification of B MADS-Box homeotic regulators of flower development: Adaptive evolution in protein-protein interaction domains after major gene duplication events. Molecular Biology and Evolution. 2007, 24 (2): 465-481. 10.1093/molbev/msl182.PubMedCrossRefGoogle Scholar
- 50.Lenser T, Theissen G, Dittrich P: Developmental Robustness by Obligate Interaction of Class B Floral Homeotic Genes and Proteins. Plos Computational Biology. 2009, 5 (1): 10.1371/journal.pcbi.1000264.Google Scholar
- 51.Espinosa-soto C, Padilla-Longoria P, Alvarez-Buylla ER: A gene regulatory network model for cell-fate determination during Arabidopsis thalianal flower development that is robust and recovers experimental gene expression profiles. Plant Cell. 2004, 16 (11): 2923-2939. 10.1105/tpc.104.021725.PubMedCentralPubMedCrossRefGoogle Scholar
- 54.van Dijk ADJ, Morabito G, Fiers M, Van Ham RCHJ, Angenent GC, Immink RGH: Sequence motifs in MADS transcription factors responsible for specificity and diversification of protein-protein interaction. Plos Computational Biology.Google Scholar
- 60.Immink RGH, Tonaco IAN, de Folter S, Shchennikova A, van Dijk ADJ, Busscher-Lange J, Borst JW, Angenent GC: SEPALLATA3: the 'glue' for MADS box transcription factor complex formation. Genome Biology. 2009, 10 (2): 10.1186/gb-2009-10-2-r24.Google Scholar
- 63.Fornara F, Parenicova L, Falasca G, Pelucchi N, Masiero S, Ciannamea S, Lopez-Dee Z, Altamura MM, Colombo L, Kater MM: Functional characterization of OsMADS18, a member of the AP1/SQUA subfamily of MADS box genes. Plant Physiology. 2004, 135 (4): 2207-2219. 10.1104/pp.104.045039.PubMedCentralPubMedCrossRefGoogle Scholar
- 64.Kane NA, Danyluk J, Tardif G, Ouellet F, Laliberte JF, Limin AE, Fowler DB, Sarhan F: TaVRT-2, a member of the StMADS-11 clade of flowering repressors, is regulated by vernalization and photoperiod in wheat. Plant Physiology. 2005, 138 (4): 2354-2363. 10.1104/pp.105.061762.PubMedCentralPubMedCrossRefGoogle Scholar
- 66.Shitsukawa N, Tahira C, Kassai KI, Hirabayashi C, Shimizu T, Takumi S, Mochida K, Kawaura K, Ogihara Y, Murai K: Genetic and epigenetic alteration among three homoeologous genes of a class E MADS box gene in hexaploid wheat. Plant Cell. 2007, 19 (6): 1723-1737. 10.1105/tpc.107.051813.PubMedCentralPubMedCrossRefGoogle Scholar
- 73.Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, Pruss D, Pindo M, FitzGerald LM, Vezzulli S, Reid J, et al: A High Quality Draft Consensus Sequence of the Genome of a Heterozygous Grapevine Variety. PLoS ONE. 2007, 2 (12): e1326-10.1371/journal.pone.0001326.PubMedCentralPubMedCrossRefGoogle Scholar
- 79.Shan HY, Zhan N, Liu CJ, Xu GX, Zhang J, Chen ZD, Kong HZ: Patterns of gene duplication and functional diversification during the evolution of the AP1/SQUA subfamily of plant MADS-box genes. Molecular Phylogenetics and Evolution. 2007, 44 (1): 26-41. 10.1016/j.ympev.2007.02.016.PubMedCrossRefGoogle Scholar
- 82.Thompson JD, Higgins DG, Gibson TJ: Clustal-W - Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice. Nucleic Acids Research. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.PubMedCentralPubMedCrossRefGoogle Scholar
- 87.Brunger AT, Adams PD, Clore GM, DeLano WL, Gros P, Grosse-Kunstleve RW, Jiang JS, Kuszewski J, Nilges M, Pannu NS, et al: Crystallography & NMR system: A new software suite for macromolecular structure determination. Acta Crystallographica Section D-Biological Crystallography. 1998, 54: 905-921. 10.1107/S0907444998003254.CrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.