1 Introduction

1.1 Background of the Present Studies on the Origin of Life

At present, many researchers studying on the origin of life believe that life originated from RNA world, which was formed by RNA self-replication, since Gilbert proposed the hypothesis about 25 years ago (Gilbert, 1986). The main reasons why RNA world hypothesis is widely accepted are that it has been generally considered that acquisition of genetic information or genes must precede the formation of proteins with catalytic functions, because proteins composed of 20 kinds of amino acids are too complex to be produced without the support of genetic information, and that genes must be firstly selected from a pool of RNA accumulated through RNA self-replication. It is described in recent ScienceDaily Feb. 24, 2010, that Blumenthal said, “it now appears that the first catalytic macromolecules could have been RNA molecules, since they are somewhat simpler and RNA molecules were likely to exist early in the formation of the first life forms, and are capable of catalyzing chemical reactions without proteins being present.”

But the RNA world hypothesis has many weak points, as follows (Ikehara, 2005, 2009): (1) It is very difficult to prebiotically synthesize nucleotides, building blocks of RNA, since nucleotides are rather complex organic compounds composed of sugar, nucleobase, and phosphate (Fig. 1). In addition, nucleotides are unstable under the conditions on the primitive earth. In fact, nucleotides have not been detected in meteorites. (2) RNA could not be self-replicated due to the self-discrepancy between RNA catalysts with stable tertiary structures and RNA templates without structures. (3) Even if RNA were self-replicated, it is also quite difficult for the self-replicated RNA to carry genetic information because genes would never be formed by random polymerization of nucleotides. It is not statistically expected for RNA randomly polymerized in the RNA world to encode a functional water-soluble globular protein since genes are organized by a linear arrangement of triplet base sequences or codons encoding the corresponding amino acids. (4) It is also quite difficult to explain evolutionary processes of the fundamental life system, composed of genes, genetic code, and proteins according to the RNA world hypothesis because formation of the first genetic code also could not be explained based on the RNA world hypothesis, since the capability of RNA for self-replication is not relevant to the genetic code or triplet codon sequences for protein synthesis (Ikehara, 2009).

Figure 1.
figure 1_6

Chemical structures of amino acids glycine ([G]) and valine ([V]) and of nucleotides UMP and GMP which are the simplest and the most complex organic compounds among four [GADV]-amino acids and four ribonucleotides. Numbers in parentheses indicate those of atoms in the amino acids and nucleotides.

1.2 Properties of [GADV]-Amino Acids

It is well known that amino acids, especially as glycine [G], alanine [A], aspartic acid [D], and valine [V], which are encoded by GNC codons, are simple organic compounds having one hydrogen atom, positive amino and negative carboxylic groups, and a side chain at α-carbon atom (α-amino acids) (Fig. 1), and all of them are easily synthesized in Miller’s discharge experiments (Higgs, 2009; van der Gulik, 2009). Here square brackets are used to discriminate one-letter symbols of amino acids from nucleobases. In addition, enough amounts of [GADV]-amino acids could be accumulated on the primitive earth, since the amino acids are enough stable for a long time under prebiotic heat conditions (Vallentyne, 1964). Moreover, peptide bond, which is formed between amino group of one amino acid and carboxyl group of the other amino acid, has a planar character, which is favorable for folding polypeptide chain into regular secondary structures and successively into tertiary structures, prerequisitely required to form catalytic center on surface of the tertiary structure.

It must be required to elucidate the origin of the genetic code and to introduce one or two new concepts in order to solve the difficult problem on the origin of life from a new point of view. In addition to the our original GNC-SNS primitive genetic code hypothesis (Ikehara et al., 2002), where N and S means either of four nucleobases (G, C, A, and U (T)) and guanine (G) or cytosine (C), respectively, we introduced two new concepts, protein 0th-order structure composed of roughly equal amounts of [GADV]-amino acids and pseudo-replication of [GADV]-proteins. As a result, we have proposed GADV hypothesis on the origin of life (Ikehara, 2005, 2009). In this chapter, I will describe and discuss the [GADV]-protein world hypothesis or GADV hypothesis.

2 The Origin of the Genetic Code

2.1 Genetic Code in the Fundamental Life System

Genetic code occupies a core position relating genetic function to catalytic function in the fundamental life system (Fig. 2a). Therefore, the genetic code is not only a simple representation of triplet base sequence with an amino acid, but also it should be a key point when one understands the formation processes of the life system, which is composed of genetic function, genetic code, and catalytic function, or the origin of life (Fig. 2b).

Figure 2.
figure 2_6

The role of genetic code in the fundamental life system. (a) Genetic code connects genetic function with catalytic function. (b) Genetic code correlates triplet codons in genes to amino acids (AA) in proteins.

It is considered that the establishment of the most primitive genetic code triggered to produce the “chicken–egg relationship” connecting between genes and proteins. However, it seems to me that previous researches on the origin of life have ignored the important role or the origin of the genetic code.

2.2 GC-NSF(a) Hypothesis on Formation of Entirely New Genes Under the Present Life System

We accidentally started from a study on formation of entirely new genes, that is, the first ancestor genes in gene families consisting of homologous genes, independent of the origins of the genetic code and life (Fig. 3). From analyses of microbial genes and proteins obtained from the GenomeNet database, we found that the entirely new genes could be produced from nonstop frames on antisense strands of not AT-rich but GC-rich microbial genes (GC-NSF(a)) (Ikehara et al., 1996) (Fig. 4).

Figure 3.
figure 3_6

Large ellipsoids indicate gene/protein families. Open and gray circles and black dots represent the first ancestor genes/proteins of the families, ancestral genes/proteins, and progeny genes/proteins originated from the first ancestor genes/proteins, respectively.

Figure 4.
figure 4_6

GC-NSF(a) hypothesis on the formation of entirely new genes, suggesting that entirely new genes were created from base sequences on GC-rich nonstop frames on the antisense strands (GC-NSF(a)). Thick black lines and dotted lines show sense sequences encoding functional proteins and their antisense sequences, respectively.

This conclusion was obtained mainly based on the facts that hypothetical proteins encoded by GC-NSF(a)s well satisfy six conditions (hydropathy, formabilities of α-helix, β-sheet, and turn/coil structures, acidic amino acid, and basic amino acid compositions) for folding of polypeptide chains into water-soluble globular structures (Ikehara et al., 1996).

The six conditions were obtained as each of the six average values plus/minus the respective standard deviations, which were obtained by calculation using amino acid structural indices described in Stryer’s textbook (Berg et al., 2002), and the corresponding amino acid compositions of extant proteins encoded by seven microbial genomes with different GC contents (Table 1) (Ikehara et al., 2002). The reason why we convinced that the six conditions can be used for judgment of folding ability of a polypeptide chain is that those values of most proteins held nearly constant levels, regardless of variable GC contents, which causes wide variation of half number of natural 20 amino acids. Sufficiently small probability of stop codon appearance is also favorable to produce nonstop frames on the GC-NSF(a)s at a high probability (Ikehara et al., 2002).

Table 1. Mean values and their standard deviations (SD) of hydropathy, secondary structure formabilities (α-helix, β-sheet, and turn/coil formabilities), and acidic and basic amino acid contents, which were obtained by analyses of extant proteins from seven microorganisms.

2.3 GNC-SNS Primitive Genetic Code Hypothesis

Why can proteins encoded by GC-NSF(a)s, not AT-NSF(a)s, well satisfy the six conditions? One reason is that base sequences on both strands on GC-rich genes (around 60–70%) are rather symmetrical and are quite similar to SNS repeating sequences. This result suggests that the sequences of SNS repetitions might hold a strong potential to function as genes. We have previously confirmed by using computer analysis that even only 10 amino acids encoded by 16 SNS codons actually satisfied the six conditions for folding polypeptide chains into water-soluble globular structures (Ikehara et al., 2002).

Further, we looked for a minimum set of amino acids encoding water-soluble globular proteins with four conditions (hydropacy, capabilities of forming three secondary structures; α-helix; β-sheet; and turn/coil) since positive and negative charges could be compensated by divalent positive metal ions and divalent negative ions, respectively. It was found that [GADV]-proteins encoded by GNC code satisfied the four conditions when about equal amounts of [GADV]-amino acids were contained in the proteins (Fig. 5a) (Ikehara et al., 2002), but all four amino acids encoded by other four codons in rows and columns in the universal genetic code table did not satisfy at least one of the four conditions, except for the GNG code, a slightly modified form of the GNC code (Ikehara et al., 2002). The results of this search indicate that random synthesis in a pool of four [GADV]-amino acids could produce water-soluble globular proteins that are basically comparable in their potential to contemporary proteins as abilities for forming secondary and tertiary structures.

Figure 5.
figure 5_6

(a) Amino acid compositions of hypothetical [GADV]-proteins satisfying the four conditions for folding of a polypeptide chain into water-soluble globular structure. (b) The origin and evolutionary process of the genetic code (GC).

Table 2 indicates that [GADV]-amino acids have adequately favorable characteristics necessary to fold polypeptide chains composed of them into water-soluble globular proteins. That is, the four properties are allotted for four [GADV]-amino acids, as Gly, Ala, and Val are turn/coil-, α-helix-, and β-sheet-forming amino acids, respectively, and Asp and Val are hydrophilic and hydrophobic amino acids, respectively (Table 2). Therefore, it is considered that the amino acids were selected for primitive protein synthesis, based on unique characters of the amino acids from a pool of amino acids easily accumulated on the primitive earth. Here, we concluded that the genetic code originated from GNC code and evolved to the universal genetic code through SNS code (Fig. 5b).

Table 2. Structural indices (hydropathy and α-helix, β-sheet, and turn/coil propensities) of [GADV]-amino acids used to calculate the formability of water-soluble globular proteins were obtained from Stryer’s textbook of “Biochemistry” (Berg et al., 2002).

2-Aminobutylic acid (2-ABA) having ethyl group on the side chain could be accumulated on the primitive earth similarly as [GADV]-amino acids, but it was not used as a natural amino acid since 2-ABA without a branched chain or bulky group as benzene ring on β-carbon atom is an α-helix-forming amino acid similarly as Ala, Leu, Met, and so on (Berg et al., 2002). Therefore, Ala, not 2-ABA, was used for primitive protein synthesis since Ala with methyl group has simpler structure than 2-ABA.

2.4 Recent Studies on the Origin of the Genetic Code

Recently, van der Gulik et al. (2009) have reported that the early functional peptides could be short (3–8 amino acids long) and were made of [GADV]-amino acids because they were abundantly produced in many prebiotic synthesis experiments and observed in meteorites, and the neutralization of Asp’s negative charge is achieved by metal ions. Furthermore, they conjectured that the abundance of prebiotic [GADV]-oligopeptides is tightly connected with the appearance of the genetic code.

In parallel, Higgs and Pudritz (2009) have also obtained a similar conclusion, suggesting that the genetic code originated from [GADV]-amino acids based on the order of amino acid amounts synthesized by Miller’s atmospheric discharge experiments and in realistic simulations of hydrothermal vents and those detected in meteorites.

Di Giulio (2008) also has reported a similar conclusion, suggesting that the first amino acids to evolve along the biosynthetic pathways are predominantly [GADV]-amino acids and Glu codified by codons of the type GNN, and this observation is found to be statistically significant, and therefore the close biosynthetic relationships between the sibling amino acids Ala-Ser, Ser-Gly, Asp-Glu, and Ala-Val are not random in the genetic code table, based on an extension of the coevolution theory that attributes a crucial role to the first amino acids.

Biro (2009) also have obtained a similar conclusion on the origin of the genetic code, based on his Proteomic Code theory, which determines how individual amino acids interact with each other during folding and in specific protein–protein interactions. He described in his book that the GNC-SNS primitive genetic code hypothesis, which has been proposed by us, is similar to the Proteomic Code and that both concepts agree with each other regarding.

As described above, van der Gulik, Higgs, di Giulio, and Biro have obtained a conclusion similar to the GNC-SNS primitive genetic code hypothesis, suggesting that the first proteins must be composed of only four kinds of [GADV]-amino acids, although they did not discuss on the origin of life. The facts that the same conclusion was obtained by several independent investigations indicate that the genetic code probably originated from the first genetic code, GNC, encoding four [GADV]-amino acids. The conclusion supports GADV hypothesis on the origin of life.

Ser is easily synthesized as similarly as Asp and Val in [GADV]-amino acids in Miller-type discharge experiments using primitive atmosphere, as shown in Table 1 of Higgs’s paper (Higgs and Pudritz, 2009). But it is supposed that the amino acid was not used from the beginning of the emergence of life for natural protein synthesis through the genetic code, but the use of Ser in protein synthesis was delayed because Ser is much more unstable than Asp and Val under heat conditions on the primitive earth due to the existence of hydroxyl group on the side chain. Therefore, enough amount of the amino acid required for the primitive protein synthesis could not accumulate on the primitive earth (Vallentyne, 1964).

Similarly, nucleotides having many hydroxyl groups on ribose ring would not be accumulated on the primitive earth because of the instability under the conditions of primitive earth. From this reason too, RNA world must not be realized on the primitive earth since enough amounts of RNA, which is necessary for formation of the RNA world, could not be prebiotically synthesized. We therefore conclude that it is impossible to explain the emergence of life according to the RNA world hypothesis.

Some amino acids used for protein synthesis in extant organisms have been actually detected in meteorites. The kind and the ratio of amino acid detected in meteorites are rather similar to those of observed by Miller’s atmospheric discharge experiments and in realistic simulations of hydrothermal vents (Higgs and Pudritz, 2009). This indicates that life was not brought directly into the earth by meteorites and/or comets, and that the amino acids, which led to the emergence of life, had been produced in exterior space as well as on the primitive earth.

3 [GADV]-Protein World Hypothesis

3.1 New Concepts Leading to the GADV Hypothesis

It is generally considered that amino acid sequence, which is a primary structure of a protein encoded by one-dimensional genetic information on DNA or RNA, is always the most important when folding of a polypeptide chain into three-dimensional structure is discussed (Berg et al., 2002). Therefore, it has not been recognized by many researchers that amino acid composition is important for folding of a polypeptide chain. It is recognized that the amino acid composition is obtained simply after hydrolysis of polypeptide chain or protein. But unique amino acid composition containing about equal amounts of [GADV]-amino acids, which satisfy the four conditions (hydropathy, α-helix, β-sheet, and turn formabilities) obtained from calculated values with amino acid compositions of extant proteins and structure indices of the corresponding amino acids, is important for effective production of functional proteins through random processes before formation of the first genetic information. The reasons are as follows. Since structure formability of one protein is the same as others randomly assembled in the same amino acid composition, every polypeptide chain randomly synthesized among amino acids in the amino acid composition could be folded into water-soluble globular structure similar to presently existing proteins. But the polypeptide chains should be folded into different, not the same, structures because the proteins have the same amino acid composition but different amino acid sequences from each other. We have named such a specific amino acid composition favorable for formation of water-soluble globular protein structure as protein 0th-order structure (Ikehara, 2009).

Amino acid composition of roughly equal amounts of [GADV]-amino acids is one of the protein 0th-order structure. This indicates that various kinds of water-soluble globular [GADV]-proteins could be created by random polymerization of [GADV]-amino acids even in the absence of any genetic function or before creation of the first gene because individual [GADV]-amino acids are functional units for protein structure formation, and the amino acid composition composed of roughly equal amounts of [GADV]-amino acids satisfies the four conditions for formation of water-soluble globular proteins (Ikehara et al., 2002). The new second concept which we call as “pseudo-replication of [GADV]-proteins” came from the consideration on the first concept of the protein 0th-order structure (Ikehara, 2009).

We reached to the GADV hypothesis on the origin of life based on the above two new concepts, “[GADV]-protein 0th-order structure” and “pseudo-replication of [GADV]-proteins.” The GADV hypothesis suggests that life originated from [GADV]-protein world, which was formed by pseudo-replication of [GADV]-proteins in a pool composed of roughly equal amounts of [GADV]-amino acids (Ikehara, 2009). Another important point for solving the riddle on the origin of life was that even four kinds of amino acids were enough for formation of water-soluble globular proteins, which is necessary as a premise to produce functional enzymes. This could produce weak but significantly high catalytic activities on surfaces of [GADV]-proteins, when the activities were compared with the level without any proteinous enzyme on the primitive earth. The catalytic functions of the proteins would open a new gate for production of functional proteins in [GADV]-protein world before appearance of genetic information. Oligonucleotides and RNA synthesis carried out in the [GADV]-protein world led the formation of next RNA-[GADV]-protein world, in which the first life was created on the primitive earth.

3.2 Possible Steps to the Emergence of Life

Possible evolutionary processes to the emergence of life are described below (Ikehara, 2002, 2005; Oba et al., 2005). Enough amounts of [GADV]-amino acids were synthesized and accumulated on the primitive earth. Successively, [GADV]-proteins were produced, for example, by repeated heat-drying processes of [GADV]-amino acids in tide pools on the primitive earth (Fig. 6a). [GADV]-proteins were further accumulated by pseudo-replication to form [GADV]-protein world (Figs. 6b and 7a). Nucleotides and oligonucleotides as well as [GADV]-amino acids could be synthesized through enough catalytic activities of [GADV]-proteins in the protein world (Fig. 6b). In these cases too, places such as tide pools, where chemical compounds could be concentrated by heat-drying process, were necessary for the syntheses. In addition, divalent metal cations would be used to compensate negative charges on [GADV]-proteins and to accelerate association of [GADV]-protein with nucleotide for the oligonucleotide and RNA syntheses. Judging from the fact that [GADV]-amino acids are encoded by GNC primeval genetic code and that nucleotides are favorable for base pair formation and for transmittance of genetic information from parents to progenies, it is considered that the accumulation of GNC-containing oligonucleotides triggered establishment of the first genetic code or GNC primeval genetic code through stereospecific interaction among four [GADV]-amino acids and the corresponding GNC-containing oligonucleotides. The GNC-containing oligonucleotides or proto-tRNAs might be charged with the corresponding [GADV]-amino acids, and the amino acids were delivered to the peptide bond formation with a [GADV]-protein or a proto-rRNA ribozyme. The proto-tRNA could play roles of both extant tRNA and mRNA by binding with each other through base pair formation between two complementary proto-tRNAs. More efficient [GADV]-protein synthesis with the proto-tRNA-[GADV]-amino acid complexes than direct [GADV]-protein synthesis among individual [GADV]-amino acids could assist to establish the GNC primeval genetic code (Fig. 6c). In parallel, translation system could be evolved through formation of the peptide bonds with complexes of proto-rRNA molecules and [GADV]-proteins to form primitive ribosomal small and large subunits. Next, GNC-repeating sequences were produced by random phosphodiester bond formation among GNC codons or anticodons in the complexes of GNC-containing oligonucleotides and [GADV]-amino acids. Thus, it is assumed that the first single-stranded (GNC)n gene was created, when one (GNC)n sequence encoding a [GADV]-protein with a required function was selected out from a pool of versatile (GNC)n polynucleotides, leading to the emergence of the first life. At this point, [GADV]-protein world ceased a role and transformed to RNA-protein world, which was formed by RNA replication with [GADV]-protein enzymes (Fig. 7b).

Figure 6.
figure 6_6

Possible steps from prebiotic synthesis (a) to the emergence of life (c). Roles of [GADV]-P ([GADV]-peptides and/or [GADV]-proteins) world (b) in the emergence of life.

Figure 7.
figure 7_6

The modern fundamental life system (c), which was composed of DNA, RNA, and protein, evolved from [GADV]-protein world (a), which was formed by pseudo-replication of [GADV]-proteins, through RNA-protein world (b), which was established by RNA replication with proteinous enzymes.

On the way of the evolutionary process of the life system, genetic functions on RNA strands were transferred to base sequences on DNA strands, which were produced by DNA replication (Fig. 7c). Even a reasonable process, how the “chicken and egg relationship” between genes and proteins was formed on the primitive earth, can be explained from a standpoint of the GADV hypothesis. That is, it could be established as going up from the lower (pseudo-replication of [GADV]-proteins) to the upper stream (creation of double-stranded DNA genes) of the genetic flow observed in extant organisms (Ikehara, 2009).

In the RNA world hypothesis, it would be impossible even to imagine another reasonable and concrete strategy for creation of the first gene because genetic function with unique base compositions at the three codon positions would never be produced by random RNA synthesis joining nucleotides one by one. Contrary to that, in the GADV hypothesis, establishment of the most primitive code or GNC primeval genetic code gave a clue for creation of the first genetic function and to proposition of GADV hypothesis on the origin of life (Fig. 6).

4 Summary and Conclusions

RNA world hypothesis was proposed only for solving the “chicken–egg dilemma” between genes and proteins without consideration of formation processes of the fundamental life system, especially of the origin of the genetic code, which should be the most important point when the origin of life is considered. Therefore, in the RNA world hypothesis, it would be impossible even to imagine another reasonable and concrete strategy for creation of the first gene, genetic function with unique base compositions at three codon positions. Genetic sequences would be never produced by RNA synthesis joining nucleotides one by one because the genetic sequences are not simple nucleotide sequences but are codon sequences. Of course, I believe that the formation processes of “chicken–egg dilemma” between genes and proteins have been explained by GADV hypothesis more rationally than the RNA world hypothesis, as described above. What I would like to emphasize here is that the RNA world, which was formed by RNA self-replication, never existed on the evolutionary way from era of the chemical evolution to the emergence of life.