The relationship between proteome size, structural disorder and organism complexity
- 12k Downloads
Sequencing the genomes of the first few eukaryotes created the impression that gene number shows no correlation with organism complexity, often referred to as the G-value paradox. Several attempts have previously been made to resolve this paradox, citing multifunctionality of proteins, alternative splicing, microRNAs or non-coding DNA. As intrinsic protein disorder has been linked with complex responses to environmental stimuli and communication between cells, an additional possibility is that structural disorder may effectively increase the complexity of species.
We revisited the G-value paradox by analyzing many new proteomes whose complexity measured with their number of distinct cell types is known. We found that complexity and proteome size measured by the total number of amino acids correlate significantly and have a power function relationship. We systematically analyzed numerous other features in relation to complexity in several organisms and tissues and found: the fraction of protein structural disorder increases significantly between prokaryotes and eukaryotes but does not further increase over the course of evolution; the number of predicted binding sites in disordered regions in a proteome increases with complexity; the fraction of protein disorder, predicted binding sites, alternative splicing and protein-protein interactions all increase with the complexity of human tissues.
We conclude that complexity is a multi-parametric trait, determined by interaction potential, alternative splicing capacity, tissue-specific protein disorder and, above all, proteome size. The G-value paradox is only apparent when plants are grouped with metazoans, as they have a different relationship between complexity and proteome size.
KeywordsAlternative Splice Structural Disorder Alternative Splice Event Biological Complexity Protein Disorder
proteome information content
Structural Classification of Proteins.
Biological complexity is a feature that increases during evolution, distinguishing us from more primitive forms of life. Whereas it has no straightforward definition, it is generally accepted that it can be measured by the number of different cell types in an organism ranging from 1 (bacteria) to about 200 (humans) [1, 2, 3, 4]. As complexity is apparently related to the amount of information an organism needs to function properly, and such information is contained in our genes, it was generally expected that the number of genes correlates with biological complexity. This was called into doubt and referred to as the G-value paradox . There have been numerous attempts to resolve the paradox, citing multifunctionality of proteins , microRNAs , non-protein-coding DNA  or alternative splicing . In this paper we set out to revisit this problem as the genomes of many more eukaryotes have been sequenced and new information has accumulated about their alternative splicing. In addition, we have paid special attention to the roles intrinsically disordered proteins (IDPs) might play in this respect in these organisms.
Intrinsically disordered proteins exist and function without a well-defined three-dimensional structure, typically carrying out signaling and regulatory functions [10, 11]. These functions are linked with complex responses to environmental stimuli and communication between cells, which raises the question of whether structural disorder can be linked to the complexity of species. This view is underscored by structural disorder being critical in protein-protein interactions (PPIs) [12, 13, 14], in the assembly of large protein complexes , and multiple activities of proteins . Compounded by the observation that the level of disorder is much higher in eukaryotes than prokaryotes , it is often implied that structural disorder increases with complexity [18, 19].
Here we carried out a systematic analysis of the possible correlation between proteome size, structural disorder, binding capacity and the complexity of 76 organisms ranging from bacteria to human, which cover the full complexity range of 1 to 200. We used the number of amino acids instead of the number of genes, as average protein length tends to vary a great deal among different organisms (from 282 to 814 in our collection). However, we found no overall, proteome-wide correlation between protein structural disorder and organism complexity, apart from the clear increase in disorder from prokaryotes to eukaryotes, with very large variations between different bacteria and also single-celled eukaryotes - for example, protozoa. When we looked at only those proteins that comprised domains associated with evolutionary expansion , we found that such proteins were significantly more disordered than the rest of the proteomes.
We analyzed another structural disorder-related feature, namely binary interactions in interactomes [12, 13, 14], and predicted interaction capacity of proteins in their disordered regions . We found that the total number of predicted binding sites correlated with the complexity of the organism but the average number of binding sites per protein did not. We extended these studies to human tissues, which are also thought to have different complexities. Analogously to the wide range of organisms appearing in this paper, we determined the complexity of human tissues as the number of different cell types they are composed of. We found significant differences in structural disorder and a clear-cut correlation with complexity of the tissue. We also analyzed protein binding sites and PPIs in the different human tissues and found a significant correlation between the two, following a power law distribution. The relationship was close to a quadratic one, signifying the prevalence of promiscuous, rather than one-on-one, protein binding. Alternative splicing also proved to be more prevalent in tissues that are regarded to be more complex and ranked similarly to other aspects, that is, disorder and protein binding.
Our overall conclusion is that complexity is a multi-parametric trait affected by interaction potential, alternative splicing capacity, tissue-specific protein disorder and, above all, proteome size, thus severely limiting the scope of the G-value paradox.
(Dis)solving the G-value paradox
We did distinguish plants from other eukaryotes in Figure 1 because they clearly differ in their biological complexity-proteome size relationship from other evolutionary clades in that they have relatively large proteomes for a smaller number of cell types. While there is a weak but significant overall relationship between biological complexity and proteome size when all eukaryotic organisms are considered in Figure 1 (assuming a Gaussian distribution R2 = 0.1333, P-value = 0.0072), if we leave out plants the correlation increases dramatically (R2 = 0.6326, P-value < 0.0001).
We calculated the relationship also for the alternative splice complement of those organisms (Figure 1) for which there are data about their splice variants, taking into account all variants, and found the same type of relationship but a noisier one as alternative splicing information is scarce and much less reliable than the proteome of the main isoforms (R2 = 0.262, P-value = 0.0357). Both human and fugu (Tru, Takifugu rubripes) have strikingly large alternative proteomes, although these are based only on RNA sequence information, which does not necessarily mean that viable proteins are produced from them .
We repeated the calculation using gene numbers in place of proteome sizes (Additional file 1) and found a weak but significant correlation between complexity and gene number only when we treated plants as a separate group; strictly speaking, therefore, the G-value paradox still holds up, but only when we consider plants together with other eukaryotes.
To make sure the results are not confounded by phylogenetic pseudoreplication  - that is, bias in statistics due to lack of statistical independence among closely related species we used in our studies - we grouped the 76 species into 6 large phylogenetic groups and used one-way analysis of variance (ANOVA) to see if they have significantly different values for the PIC proteome size (Additional file 2). We found, in accordance with the species-level studies, that (i) plants have larger proteomes than other metazoan groups, (ii) there is no significant difference in proteome size between fungi and protozoa, and (iii) all the other phylogenetic groups are significantly different from one another, increasing with complexity.
Overall, there is correlation between the two measures only if we consider all species (Spearman correlation coefficient r = 0.6545, P-value < 0.0001). For eukaryotes only, there is no significant correlation between biological complexity and proteome disorder (r = 0.1284, P-value = 0.3594; there is also no correlation for eukaryotes even when we exclude plants, r = 0.2449, P-value = 0.1329) whether considering individual species or combining them into the same large evolutionary clades as in Additional file 2 - protozoa, fungi, plants, protostomes and deuterostomes (Figure 2b). Within bacteria, the overall disorder is usually low, in the range of 2 to 10%, with some exceptions, such as Mycobacterium tuberculosis, Myxococcus xanthus and Streptomyces coelicolor. The high disorder of M. tuberculosis may be a result of its high GC content and preference towards amino acids Ala, Gly, Pro, Arg and Trp  and/or the result of the lifestyle of the bacterium . The disorder of protozoa and fungi shows large variations between 9% and 30%, spanning a range that covers disorder in all higher eukaryotes. In conclusion, these data show a significant difference between the intrinsic disorder of prokaryotes (complexity = 1) and eukaryotes only but no correlation with the complexity of large eukaryotic groups (for eukaryotes, complexity ranges from 3 to 169).
Next, we examined disorder in those groups of proteins where the evolutionary expansion of Structural Classification of Proteins (SCOP) domains correlated  with complexity (Pearson correlation coefficient > 0.8) and compared them to those that did not (correlation coefficient < 0.2), and also to the total sets of proteins that contained at least one SCOP domain of any kind. Average structural disorder was calculated for these three groups of proteins in every selected prokaryotic and eukaryotic proteome (Figure 2c; Additional file 4).
For eukaryotes, the average disorder of proteins with expanding type of SCOP domains (Figure 2c, red circles) is always higher than for proteins with domains of the non-expanding type (green squares). For prokaryotes, the number of expanding domains is too low to make a statistically meaningful comparison. The disorder of proteins that contain any-type SCOP domains was usually between these two values (Figure 2c, blue triangles). Because in eukaryotic genomes protein families with expanding SCOP domains are mostly involved in regulation or extracellular processes , the expansion of these domains is likely to have contributed to the emergence of new cell types and complex intercellular communication.
Regarding interspecies differences, we found again that apart from a significant jump in disorder between prokaryotes and eukaryotes, neither the expanding nor the non-expanding SCOP domain-containing proteins showed a significant correlation between complexity and disorder among the eukaryotic species. However, for the set of the any-type SCOP domain-containing proteins (blue triangles in Figure 2c) we did find a significant correlation between complexity and protein disorder (Spearman correlation r = 0.6597, P < 0.0001) among the eukaryotic species. This correlation may be caused by a lengthening of structural domain-containing proteins during evolution (Additional file 4) in concert with increased overall disorder, perhaps due to longer linker regions that might also increasingly serve as new disordered binding sites (see next section).
Intrinsically disordered proteins often function via specific binding to other proteins, DNA or RNA . The underlying process of coupled folding and binding have several advantages, such as the ability to bind multiple partners or uncoupling of specificity from binding strength, allowing weak but specific interactions. These features may also be directly linked with complexity because complexity intuitively may be related to the number of interactions proteins make, that is, the size of the interactome. This issue can be addressed in two different ways.
Because of these uncertainties, we used predictions to get an unbiased estimate of binding capacity of the different proteomes. We used ANCHOR, a method developed for the prediction of binding regions embedded in disordered regions . We carried out ANCHOR predictions for our selected species, and express the results as the mean and total number of disordered binding sites (Additional file 3).
Alternative splicing increases proteome size and the complexity of the transcriptome without an increase in genome size by generating multiple mRNA products from a single gene. By conservative estimations, at least 40 to 60% of human and other mammalian genes undergo alternative splicing [28, 29, 30], with more recent studies putting the number even higher [31, 32]. Less complex organisms have less alternative splicing, fission yeast having only a handful of known cases . To put these inferences on quantitative ground, we studied the extent of alternative splicing among animals with distinct complexity using the ASAP II database .
Complexity of tissues/organs
Alternative splicing, structural disorder and PPIs also tend to vary considerably among different tissues of the same organism, presumably reflecting the complexity of the tissue . Comparing tissues instead of different organisms also has the advantage of more coherent data, not so dependent on the number of studies an organism has been subjected to.
When we plot the two attributes (disorder and interacting partners) against each other (Additional file 5a) and fit a power function onto the values, we find that testis, ovary and liver show the highest deviations from the fitted curve: liver has more interactions than expected, which is explicable by the high number of enzymes it has, most of them known to be entirely globular. Testis and ovary on the other hand had more disorder than expected from the curve, probably due to the appearance of new, 100% disordered protein families, and also to the decrease in the number of PPIs taking place among germline-specific proteins .
Figure 6b shows the total number of binding sites predicted by ANCHOR and also the total number of PPIs recorded in STRING as a function of human tissue complexity. The fitted curve is calculated using linear regression, although the scale on the y-axis is logarithmic and the fitted lines therefore appear curved. Correlations between tissue complexity and the total number of PPIs, and between complexity and total number of predicted binding sites are both significant (r = 0.6978, P-value = 0.0038 and r = 0.7839, P-value = 0.0005, respectively). Plotting the predicted binding sites against the observed PPIs in STRING (Additional file 5b) will result in a close to quadratic function (y = 74.46 × x0.519), with a Spearman correlation value of r = 0.9412.
As there is a close link between alternative splicing and structural disorder , it is also of interest to see how this relationship plays out in tissues of different complexity. The median disorder of human proteins specifically expressed in different tissues (as annotated in Swissprot) is shown in Figure 6c as a function of the ratio of the alternatively spliced proteins for each tissue. The disorder of the splice variants is also shown for each tissue. Although both relations are noisy (r = 0.5912, P-value = 0.0159 for correlation between the alternative splicing ratio and disorder of the main isoforms; r = 0.6294, P-value = 0.009 between alternative splicing ratio and disorder of the alternative isoforms), they both implicitly reflect a positive correlation between complexity and both alternative splicing and structural disorder in the different tissues.
As both organismal complexity and structural disorder increase significantly from prokaryotes to eukaryotes, it is reasonable to assume that the two are related across a wider scope of evolution, especially if complexity (measured by the number of distinct cell types an organism has) cannot be related even to such a basic measure as the number of protein-coding genes of an organism, as reflected in the G-value paradox.
However, a simple comparison of proteome sizes measured in amino acids made it clear that if we treat plants as a separate group, complexity does correlate with the information content of the proteome (Figure 1) and less closely but still significantly even with gene number (Additional file 1). This finding weakens the G-value paradox, which holds up only when we include plants, which diverge considerably from the general trend between proteome size and complexity for metazoans.
On the other hand, there is no correlation between complexity and disorder beyond the already known significant increase between prokaryotes and eukaryotes, despite the critical roles disorder plays in PPIs [12, 13, 14], assembly of large protein complexes , and multiple activities of proteins . To prove this, we provided a statistically rigorous test by correlating structural disorder with complexity in 76 species. Whether we look at the disorder of individual proteomes (Figure 2a) or large evolutionary groups (clades; Figure 2b), it becomes clear that apart from the significant increase between bacteria and eukaryotes, there is no further systematic increase in disorder in the latter. This probably follows from the wide roles protein disorder play even in primitive organisms, reflecting their lifestyle and adaptation to environmental factors, just as much as functional density  and other evolution-related features.
While not on a proteomic level, disorder does increase over evolution in structural domain-containing proteins and correlates significantly with complexity (Figure 2c, blue triangles). This is somewhat paradoxical as structural domains are globular, but can be explained by the presence of inter-domain linker regions, which are usually disordered, and perhaps by other functional aspects associated with disorder, such as signaling and protein-binding capability. A further link between complexity and disorder is apparent in the higher disorder of proteins containing an expanding-type domain (Figure 2c, red circles) when compared to proteins containing any-type or non-expanding domains.
Of other protein features, significant correlation could be observed with predicted binding regions of proteins and also the observed number of alternative splice variants (Figures 4 and 5). Both these features suggest that organism complexity increases with increasing functional complexity of gene products, because they enable one gene to bind more partners and/or translate into several protein products with potentially different functions.
We also investigated a potential link between structural disorder and complexity in different human tissues. With regard to ranking tissues, the BRENDA compilation of cell types was our major source of information [36, 37]. Using their extensive listings of cell types found in different tissues, we could establish a three-way relationship between tissue complexity, PPIs and disorder (Figure 6a; Additional file 5a).
Similarly, we could also establish a three-way relationship between tissue complexity, the total number of recorded PPIs and the total number of predicted disordered binding sites (Figure 6b; Additional file 5b). As the latter follows closely a quadratic relationship (Additional file 5b) with a high Spearman correlation (r = 0.9412), this also means that twice as many binding sites result in about four times more PPIs; that is, protein binding tends to be promiscuous rather than one-on-one, at least for disordered binding sites that ANCHOR can predict.
We found that the G-value paradox, at least within the scope of the organisms we studied here, could not withstand scrutiny and remained valid only when we grouped metazoans and plants together, as the gene number, and especially the total number of amino acids in all proteins, tends to increase with the complexity of the organism. It has also become clear that complexity is a multi-parametric trait that has many components at the protein level. We conclude that proteome size, structural disorder, alternative splicing and protein binding capacity all contribute to it, albeit to various extents, providing a finely tuned network that will enable an organism carry out its functions properly on all levels of its complexity.
Materials and methods
Out of almost 2,000 species with sequenced genomes (approximately 150 eukaryotes; from the Genome OnLine Database ), we selected 76 for this study for which the measure of complexity is reported in the literature [1, 2, 3, 4]. Among these, 23 bacteria were selected to cover the full range of gene numbers ranging from 450 to 8,000; 53 eukaryotes were selected as in  complete with fully sequenced plant and other eukaryotic species to try to cover the full range of complexity, with gene numbers ranging from 3,800 to 92,000. The species and their number of different cell types (that is, complexity) are listed in Additional file 3. Sequences of prokaryotes were downloaded from the Expasy Proteomics Server . The proteomes of eukaryotes were downloaded from Ensembl  and the National Center for Biotechnology Information . If several splicing variants were present, we selected the longest transcript, except for the alternative splicing studies, where we considered them all.
Prediction of structural disorder
Structural disorder of all proteins in a proteome was predicted with the IUPred algorithm , available at . A residue was classified as locally disordered if its score was above the threshold of 0.5, and disorder of a protein was taken as the percentage of such residues in the protein. The average disorder of whole proteomes was calculated as the mean of the percentage of disordered residues of the proteins.
Analysis of disorder of proteins with SCOP domains
The entire SCOP domain database was downloaded from  (release 1.75 ), and domain sequences were downloaded from the ASTRAL database [45, 46]. We selected all domains belonging to superfamilies the expansion of which showed either good (Pearson correlation coefficient R ≥ 0.8) or poor (R ≤ 0.2) correlation with biological complexity, respectively, as reported in . This resulted in three sets of protein sequences: expanding (containing a SCOP domain showing good correlation, N = 19,326), non-expanding (with SCOP domains showing poor correlation, N = 32,990) and all (containing any type of SCOP domains, N = 110,800). We ran a BlastP search with sequences of SCOP domains in the three sets against studied prokaryotic and eukaryotic proteomes. Then, we predicted structural disorder of proteins with sequence identity to a SCOP domain above 50%.
Protein-protein interactions in different proteomes
We extracted data on binary PPIs in four prokaryotic (Mycoplasma genitalium, Neisseria meningitidis, Escherichia coli, Streptomyces coelicolor) and 12 eukaryotic proteomes (Dictyostelium discoideum, Schizosaccharomyces pombe, Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Ciona intestinalis, Danio rerio, Xenopus tropicalis, Gallus gallus, Mus musculus and Homo sapiens) from the STRING database [47, 48].
Prediction of binding regions of proteins
Protein binding regions in disordered proteins/segments were predicted by the ANCHOR algorithm . A binding site was predicted if there were at least three consecutive amino acids with a score above 0.5. Two adjacent binding sites were accepted as independent if they were separated by at least three residues with scores below 0.5. From the pattern of protein binding regions various measures were calculated, such as the average percentage of binding site residues in the proteome, the average number of binding sites per protein, and the total number of binding sites in the whole proteome.
Determining the complexity of human tissues
The complexity of human tissues was determined by extracting cell type information for each tissue compiled by BRENDA  for the OBO Foundry , a coordinated effort to develop open biological and biomedical ontologies.
Alternative splicing in different species and human tissues
We extracted alternative splicing information from the ASAP II database [9, 49], which contains alternative splicing data for 15 animal species. For our studies we used splicing information for Caenorhabditis elegans, Anopheles gambiae, Drosophila melanogaster, Danio rerio, Xenopus tropicalis, Gallus gallus, Bos taurus, Rattus norvegicus, Mus musculus, and Homo sapiens. Various measures were calculated for each proteome, such as the percentage of alternatively spliced multi-exon genes and the average number of alternative splice events per protein. For human tissues we also extracted the available alternative splicing information from the Swissprot subset of Uniprot . In addition, we extracted tissue-specificity information from the comment lines of the annotation starting with "CC" in Uniprot, wherever this information was available.
Statistical analysis and programming
For calculating standard deviation values of intrinsic disorder and protein binding regions, random sampling was used. We selected random subsets of 200 to 500 members depending on the proteome size of the original dataset, and calculated the median and/or mean of disorder or number of binding regions. We repeated the selection 500 to 1,000 times, and the standard deviation of the mean was calculated. The significance of differences between selected groups was assessed by the nonparametric Mann-Whitney test. For correlation analysis we used a nonparametric (Spearman) test and chose two-tailed P-values, except for in Figure 1, where we assumed a Gaussian distribution for proteome size and calculated the Pearson correlation coefficients. All programs were written in Perl. The software IUPred  and ANCHOR  were obtained from the authors and were compiled and executed locally.
This research was supported by grants NK71582 and PD76286 from the Hungarian Scientific Research Fund (OTKA), a Korean-Hungarian Joint Laboratory grant from Korea Research Council of Fundamental Science and Technology (KRCF), and both an FP7 Marie Curie Initial Training Network grant (no. 264257, IDPbyNMR) and an FP7 Infrastructures grant (no. 261863, BioNMR) from the European Commission.
- 18.Romero PR, Zaidi S, Fang YY, Uversky VN, Radivojac P, Oldfield CJ, Cortese MS, Sickmeier M, LeGall T, Obradovic Z, Dunker AK: Alternative splicing in concert with protein intrinsic disorder enables increased functional diversity in multicellular organisms. Proc Natl Acad Sci USA. 2006, 103: 8390-8395. 10.1073/pnas.0507916103.PubMedPubMedCentralCrossRefGoogle Scholar
- 36.Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ, OBI Consortium, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone SA, Scheuermann RH, Shah N, Whetzel PL, Lewis S: The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007, 25: 1251-1255. 10.1038/nbt1346.PubMedPubMedCentralCrossRefGoogle Scholar
- 37.Sohngen C, Chang A, Schomburg D: Development of a classification scheme for disease-related enzyme information. BMC Bioinformatics. 12: 329-Google Scholar
- 38.Genomes Online Database. [http://www.genomesonline.org/cgi-bin/GOLD]
- 39.ExPASy Bioinformatics Resource Portal: HAMAP. [http://www.expasy.ch/sprot/hamap/]
- 40.e! Ensembl. [http://www.ensembl.org/pub/release-50/]
- 41.National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov/]
- 42.Prediction of Intrinsically Unstructured Proteins. [http://iupred.enzim.hu/]
- 43.Structural Classification of Proteins. [http://scop.mrc-lmb.cam.ac.uk/scop/]
- 45.The ASTRAL Compendium for Sequence and Structure Analysis. [http://astral.berkeley.edu]
- 47.Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011, 39: D561-568. 10.1093/nar/gkq973.PubMedPubMedCentralCrossRefGoogle Scholar
- 48.Search Tool for the Retrieval of Interacting Genes/Proteins (STRING). [http://string-db.org/]
- 49.ASAP II. [http://bioinformatics.ucla.edu/ASAP2]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.