Introduction

Coronavirus often cause cold and upper respiratory tract infections in human body. A new kind of coronavirus named as COVID-19 by the World Health of Organization (WHO) on February, 11, 2020 (World Health Organization 2019, 2020), has rapidly expanded over a short time. At present, the COVID-19 cases have been reported in many parts of the world (Harapan et al. 2020; Hussin and Byrareddy, 2020). According to the latest data, as of May 28th, 2021, more than 156 million people were infected by SARS-CoV-2. Among them, more than 3.2 million cases were dead. The global incidence of COVID-19 has grown dramatically recently, meanwhile, more and more population is now at risk for there is no specific treatment for COVID-19 (Serafin et al. 2020).

A good knowledge of COVID-19 infection prevention is of utmost importance (Na et al. 2020). Patients infected by SARS-CoV-2 show clinical manifestations of pneumonia (Xiaoping et al. 2020; Asadi-Pooya and Simani 2019; Chen et al. 2020), which are similar to the symptoms of SARS-CoV (Huang et al. 2020; Peiris et al. 2004). Learning from previous experiences with SARS is much better, and many current understanding of COVID-19 is compared with the outbreak of SARS in 2003 by many scientists (Gorbalenya et al. 2020; Wei-Jie et al. 2020; Tort et al. 2020), especially the clinical methods for dealing with the COVID-19 (Mahmoud et al. 2020) and the genetic diversity and evolution (Li et al. 2020a). All previous researches have neglected the extent of how relative synonymous codon usage (RSCU) values of SARS-CoV-2 parallel to hosts. How to evaluate the distance between the RSCU value of virus and the host becomes a key issue. SARS-CoV-2 is an enveloped virus with an about 29 kb positive sense single-stranded RNA genome (Shuo et al. 2016). The genome of SARS-CoV-2, like other coronavirus, has a long ORF1ab polyprotein coding gene. The polyprotein ORF1ab in SARS-CoV-2 is split into 16 putative proteins based on its alignment with the human SARS-CoV polyprotein (Srinivasan et al. 2020). Since the translated ORF1ab polyprotein regions are very long, the sequence similarity is sufficient to classify them as homologs (Muhamad et al. 2020). Many factors may affect the RSCU in viruses, such as mutation pressure, gene length, natural selection, etc. That is the hosts can affect the RSCU of viruses via affecting their suitability and immune escape, so the RSCU of ORF1ab may be useful for understanding of their molecular evolution (He et al. 2020). Almost all previous researches have not mentioned the RSCU of ORF1ab in coronavirus. In this study, to investigate the hypothesis that different coronavirus leads to the RSCU varies in coronavirus, we calculated the RSCU values for ORF1ab in 30 SARS-CoV-2 virus strains and 20 SARS-CoV strains. Comprehensive analysis on the differences between the RSCU values of them and their evolution characteristics are all performed.

Material and methods

In order to study the coronavirus, completed genomes of SARS-CoV and SARS-CoV-2 were obtained from the NCBI (http://www.ncbi.nlm.nih.gov) database. Genomes in the NCBI database, including 23 SARS-CoV and 31 SARS-CoV-2 respectively, were downloaded on September, 10th, 2020. The sequences with correct start and stop codons, and having multiple of three bases were considered as effective sequences. The detailed information of effective gene sequences including accession numbers in the NCBI database and their classification are as shown in the Table 1. The ORF1ab gene sequences were then extracted from these initial sequences for calculating their relative synonymous codon usages.

Table 1 Basic sequence information of SARS-CoV and SARS-CoV-2 genes

The RSCU values of ORF1ab genes for all coronavirus were calculated to determine their codon usage pattern. The quantitative difference between SARS-CoV and SARS-CoV-2 of all codons were calculated. The RSCU values were calculated as follows:

$$RSCU = n_{i} \times g_{{ij}} /\sum\limits_{j}^{{n_{i} }} {g_{{ij}} } .$$
(1)

Here gij is the observed number of the ith codon for the jth amino acid, which has ni kinds of different synonymous codons. The codons, whose RSCU values greater than 1.0, are usually regarded as abundant codons, whereas those with the RSCU values less than 1.0 are defined as less-abundant codons. Based on the RSCU values, it is easily to calculate the distance between coronavirus and their hosts theoretically. The formula of D(A, B) was established to evaluate the potential role of the overall codon usage pattern of the host in the formation of the overall codon usage pattern of viruses.

$$R(A,B) = \frac{{\sum\nolimits_{{i = 1}}^{{59}} {a_{i} \times b_{i} } }}{{\sqrt {\sum\nolimits_{{i = 1}}^{{59}} {a_{i}^{2} \times \sum\nolimits_{{i = 1}}^{{59}} {b_{i}^{2} } } } }},$$
(2)
$$D(A,B) = \frac{{1 - R(A,B)}}{2},$$
(3)

where R(A,B) is used to evaluate the similarity distance between coronavirus and human from the aspect of the RSCU. Here aiis defined as the RSCU value for a specific codon in 59 synonymous codons of coronavirus, bi is the RSCU value for the same codon of the human host (Siddiq et al. 2017).

D(A,B) represents the potential effect of the overall codon usage of the host on that of coronavirus, and this value ranges from zero to 1.0. Further, based on the RSCU values, the evolutionary distance of all 50 coronavirus was calculated without considering the confounding influence of stop codons. The Euclidean distances among all observations of RSCU values were used to analysis the divergence among the coronavirus.

Results and discussion

As an important parameter, the RSCU value, which represents the ratio occurrence frequency of one codon and the expected usage frequency, is usually used for evaluating the bias of the synonymous codon (Qi et al. 2020; Prajna et al. 2018). If the RSCU value of a codon is more than 1.0, it would be regarded as a positive codon usage pattern. On the contrary, the codon with less RSCU value (less than 0.5) could be regarded as a less-abundant codon. The overall RSCU values of ORF1ab gene in 20 SARS-CoV and 30 SARS-CoV-2 genomes were calculated, and the result is shown in Fig. 1 From the result, G and C-ended codons are obviously less than A and U-ended codons for two kinds of coronavirus.

Fig. 1
figure 1

A Overall RSCU for ORF1ab of 20 SARS-CoV genomes, and B Overall RSCU for ORF1ab of 30 SARS-CoV-2 genomes. Red bars denote the codons, such as the GUG, UCA, CCA, ACA, UAA, AGA, AGG and GGA, with high RSCU values (more than 1.5) and could be regarded as the abundant codons, and the blue bars in the figure denote the less-abundant codons. All ORF1ab genes of 50 coronavirus select UAA as the stop codon, so the RSCU value of UAA is 3 (color figure online)

The RSCU values shown in the Fig. 1 can reflect the overall characteristics of codon usage in 50 coronavirus ORF1ab genes. For each particular codons, the RSCU values of 50 coronavirus ORF1ab genes are separately shown in the Fig. 1A and the Fig. 1B. It can be seen that there are significant differences among codon usage pattern of all codons except particular ones (AUG and UGG). Among 3 terminal codons, all sequences, within both SARS-CoV and SARS-CoV-2, select UAA as terminal codons, so the RSCU values for UAG and UGA all equal to zero. The RSCU value of some codons such as UUU, UUA, AUA, GUU, CCU, CAA, AAA, GAA, UGU, AGA and GGU of SARS-CoV-2 are all greater than that of SARS-CoV. Another recent research showed that the RSCU pattern of SARS-CoV-2 resembles to human to some extent, while bat-SL-CoVZC45 has similar synonymous codon usage pattern to its host-the bat. The distance between the SARS-CoV-2 and other animals is greater than it to human (Ji et al. 2020).

Interestingly, From the Fig. 1, compared to SARS-CoV, we could see that the frequencies of the codons with the U-ending and A-ending in SARS-CoV-2 tend to be higher. And the average usage frequency for ended codons is shown in the Fig. 2. It can clearly show the composition of ORF1ab gene sequences (for both SARS-CoV-2 and SARS-CoV). The G and the C ended codons in both SARS-CoV-2 and SARS-CoV are less frequently used compared to A and T ended codon. More interestingly, the calculated results show that the G and the C ended codons used in the SARS-CoV-2 are less than that in SARS-CoV, the p values are less than 0.001. The results show the codon usage has a stronger bias in SARS-CoV-2.

Fig. 2
figure 2

Third-base usage frequency for ORF1ab of SARS-CoV-2 and SARS-CoV

When other genes in conorovirus, such as the M and the S coding sequences are concerned, the G and the C ended codons also have higher usage compared to the A and the C ended codons (see Table 2). For the same gene, the G and the C were preferred to be used as the third codon base in SARS-CoV-2.

Table 2 Mean value of the composition for ended bases in conorovirus

Codon links to nucleic acids and proteins, so, the RSCU values sometimes can be used to describe the evolutionary of genomes via calculating their distance (Xiaoyue et al. 2019). The heat map of the RSCU value for 59 codons of coronavirus is shown in Fig. 3A separately. The differences of the RSCU values between the coronavirus were calculated and the result is shown in the Fig. 3B. These results showed that in coronavirus genomes, the codon usage preference showed that the SARS-CoV-2 tend to use more UUA, AUA, GUU, CAA, AAA, GAA, UGU, AGU, AGA and GUU. This is consistent with the conclusion expressed in the Fig. 2. In the Fig. 3B, compared to the SARS-CoV, the results show thatthe T and the A ended codons are more than the GC ended ones in SARS-CoV-2.

Fig. 3
figure 3

A Heat map of RSCU values of 50 ORF1ab coronavirus genes separately, and B The difference the mean RSCU values between SARS-CoV and SARS-CoV-2

In order to facilitate comparison with the evolutionary characteristics of SARS-CoV-2 and to fully understand the virus evolutionary divergence, we selected SARS-CoV as a contrast. Evolutionary distance—the values of D(A,B), which denote the RSCU distance between the viruses and the hosts, of SARS-CoV-2 (= 0.0334) and SARS-CoV (= 0.0215) are shown in the Fig. 4A. Here, the mean value is used to evaluate the evolutionary distance, while their standard deviation is used to evaluate the degree of evolutionary divergence. The results in the Fig. 4A show that the standard deviations of SARS-CoV (= 2.7611e−5) is less than that of SARS-CoV-2 (= 3.5499e−5), revealing the broader evolutionary divergence existing in the SARS-CoV-2 genomes. Small evolutionary distance may cause the little rate of variation, consequently, little rate of variation may cause the little degree of evolutionary divergence. We speculate that if SARS-CoV-2 continues to exist in human hosts, its evolutionary divergence degree in the future would be larger than it is now. The RSCU analysis is very important for exploring the evolution of virus from the molecular level. Differences of the RSCU values among genomes can be used to describe the evolutionary characteristics (Paraskevis et al. 2020). Evolutionary characteristics of 50 coronavirus genomes, their phylogenetic tree, have subtle differences, and the result is shown in the Fig. 4B. The maximum intraspecies difference is about 0.03, while the interspecific difference is about 1.8, which denote that there is no genetic relationship between SARS-CoV and SARS-CoV-2.

Fig. 4
figure 4

A Distance between overall codon usage pattern of coronavirus and human, and B phylogenetic tree of coronavirus genomes

MERS-CoV (Jiang et al. 2020; Rokni et al. 2020), another important coronavirus, emerged in 2012, is compared to the SARS-CoV-2 by many researchers recently (Singh et al. 2020). But the present study does not consider the MERS-CoV for it is a virus which had been already exited in humans for about 8 years even until now, as well as it has not spread all over the world. Unlike the MERS-CoV, the SARS-CoV and the SARS-CoV-2 are all controllable viruses for some certain countries. It showed that the coronavirus strains of the same classification are more similar to each other, indicating that the virus show similar codon usage pattern. On the other hand, the small D(A,B) value also reflect the greater adaption of the SARS strains to their hosts, or a longer time exist in their hosts. In terms of evolutionary distance, the differences between strains of SARS-CoV are smaller, while the differences between SARS-CoV-2 strains are larger (Fig. 4B).

Novel SARS-CoV-2 lies behind the seriously ongoing outbreak of COVID-19 (Li et al. 2020b). The genome of SARS-CoV-2 has a long ORF1ab gene which coding the polyprotein. Although there are growing researches on SARS-CoV-2 from the perspective of virology and clinical strategies (Lai et al. 2020; Koyama et al. 2020), recent researches revealed its attractive mechanisms, even the content, the adaption to human hosts and evolutionary pressures of the SARS-CoV-2 are studied (Dittadi et al. 2020; Dilucca et al. 2020), no bioinformatics method is used for exploring the ORF1ab gene in the coronavirus.

Genetic analysis on SARS-CoV-2 was studied by other researches recently by using eighty-six complete or near-complete genomes of SARS-CoV-2 (Phan 2020), the results conclude the evidence of the genetic diversity and rapid evolution of SARS-CoV-2. In the present study, the certain ORF1ab gene, which is usually used for nucleic acid testing of the SARS-CoV-2 (Mathuria et al. 2020), is used for exploring the diversity, evolution of SARS-CoV-2 from the RSCU values.

Although many methods have been used for exploring the gene evolution of SARS-CoV-2 (Shi et al. 2020; Yin 2020; Bartolini et al. 2020), the samples are so complex, for instance, big numbers and many kinds of sequences (Yoshimoto 2020; Devaux et al. 2020; Pfefferle et al. 2020; Yadav et al. 2020) from too many areas that they may be hardly to get a comprehensive result. In the present study, the scope of research objects was defined and all suited samples of SARS-CoV-2 and SARS-CoV were downloaded and considered.

Conclusions

COVID-19, now, spreads all over the world, and exists in most countries. Exploring its codon usage pattern is useful for understanding genetic characteristics and geographical differences. Coronaviruses caught our attention when they caused more and more human diseases recently. The RSCU value has a very broad significance for exploring the evolution characteristics of coronavirus. It is critical to determine the differences between them understand the molecular mechanism of transmission. Information obtained from the RSCU analysis in ORF1ab coronaviruses will provide some insights to this question and will be helpful for investigation of its recombination. In this paper, ORF1ab genes from samples of SARS-CoV-2 and SARS-CoV are collected for research. The results show that there is significant difference between SARS-CoV and SARS-CoV-2 when the RSCU values of ORF1ab are concerned. Most coronavirus tend to use A and U as their third base. Interestingly, in SARS-CoV-2, this phenomenon becomes more pronounced. Most important, the differences between strains of SARS-CoV-2 are larger than that in SARS-CoV, probably for the longer time existing period in human being of the SARS-CoV-2. The unique RSCU features of ORF1ab in SARS-CoV-2 reveal there is no close genetic relationship with SARS-CoV. New information obtained from present analysis is highly significant for effective control of SARS-CoV-2 induced pneumonia of the whole world.