Keywords

1 Introduction

Novel and re-emerging viruses continue to surface and unleash havocs on human health worldwide. Some of these viruses spread rapidly across the globe and they culminate in high morbidity and mortality. For example, the Severe Acute Respiratory Syndrome (SARS) coronavirus caused a global pandemic in 2003, which resulted in approximately 916 deaths and affected around 30 countries [1]. The most recent outbreak of Ebola Virus Disease (EVD), which was the largest in the history of the disease, started in December 2013 (a decade after the SARS epidemic) and continued until April 2015 in countries like Southern Guinea, Liberia, Nigeria and Sierra Leone. Reports on EVD indicated that there were a total of 15,052 laboratory confirmed cases and 11,169 deaths [2]. Hence, the prompt and unambiguous detection of pathogenic viruses is of critical importance in order to actively control and prevent viral diseases outbreak.

Next Generation Sequencing (NGS) technologies provide unprecedented opportunities to researchers with respect to the development of new methodologies for viral detection. This is because a plethora of viral genomic sequences from NGS based studies are available in the public domain for unrestricted access by researchers. However, researchers have opined that given the abundant NGS data, the analysis of such data is the most challenging aspect of genomic based viral detection [3]. Thus, this opens up a remarkable opportunity for researchers in the bioinformatics and Genomic Signal Processing (GSP) [4, 5] fields. Genomic Signal Processing (GSP) is an emerging branch of bioinformatics, which involves the use of Digital Signal Processing (DSP) techniques for genomic data analysis and the use of the resultant biological facts to develop system based applications [5].

The traditional methods that were mostly in use to identify the origin of genome sequences are pairwise and multiple sequence alignment. However, sequence alignment method is fraught with difficulties for genome-wide comparative analysis of viruses. This is because there is a high rate of divergence between different virus sequences due to gene mutation, horizontal gene transfer as well as gene duplication, insertion and deletion [8]. Likewise, there is currently no universal oligonucleotide that is present in all viruses, which can be used for homologous searches against public databases to detect viruses [3].

To address the problems in the alignment methods, several alignment-free methods have been developed for viral detection using genomic sequences. These include k-mers methods such as G-C content, dinucleotide composition profile and frequency chaos game representation [9,10,11,12, 26]. Another category of alignment-free methods which was recently developed by researchers is the genome space based methods [13, 14]. The Natural Vector (NV) representation and its different variants are representative examples of genome space alignment-free methods [13, 15, 16]. However, the performance accuracy using some of the k-mers and NV methods still leave room for improvement [15, 16, 26].

In the study at hand, we developed GSP-based features named Z-Curve Genomic Cepstral Coefficients (ZCGCC), as an alignment-free method that could be applied for the classification of pathogenic viruses. To evaluate the developed features, we extracted the genomic sequences of twenty six pathogenic viral strains from the Virus Pathogen Database and Analysis Resource (ViPR) corpus [5, 6]. The twenty six viral strains belong to four pathogenic viral species (namely - Enterovirus, Dengue, HepatitisC and Ebola), which are currently attracting global attentions due to their causation of deadly diseases [5]. Different configurations of the naïve Bayes classifier were trained and validated with the ZCGCC. Naïve Bayes classifier was selected for this study because of its attractive physiognomies, which have been widely explored for accurate classification of genomic sequences [7].

2 Materials and Methods

2.1 Dataset

Genomic sequences of twenty six viral strains were extracted from the Virus Pathogen Database and Analysis Resource (ViPR) corpus [6] for this study. The extracted strains belong to four pathogenic viral species namely the Ebolavirus, Dengue virus, Hepatitis C and Enterovirus D68, which have been largely responsible for epidemic disease outbreak. The available strains for each of these species are selected for the study at hand to achieve an elaborate and more robust classification than the study in [5]. The distribution of the extracted data presents a challenge known as imbalance dataset, which is addressed with the random oversampling strategy in this study. Furthermore, there are high variations in sequence length even for samples that belong to the same viral strain. For example, the number of sequences for the Ebola Zaire strain varies from 22 to 19,897 while EnterovirusH varies from 20 to 7,374. These huge differences in the length of nucleotides within the same viral strain clearly illustrate the reason why alignment based and some existing alignment free methods cannot offer accurate viral detection [17]. Thus, this provides the rationale for an investigation of a DSP technique in the current study. In total, 1,948 samples of viral strains were extracted. Since each of the viral strains represent a class in the dataset, our experimentation dataset consequently contains twenty six different classes.

2.2 Z-Curve Genomic Cepstral Coefficients

Deoxyribonucleic Acid (DNA) is a biomolecule that stores the digital information that constitute the genetic blueprint of living organisms [9]. Each nucleotide in a DNA is one of Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). DNA sequence analysis using DSP methods requires mapping of nucleotides to appropriate numbers before any other computational operations can be performed. The selection of the representative numbers affects how well the properties of these nucleotides are reflected for the detection of valuable biological characteristics [18]. The Z-Curve genomic mapping method is selected in this study because of its reported strengths over other competing methods [19, 20, 27, 28]. The steps for computing the ZCGCC being proposed are represented in the block diagram shown in Fig. 1 and the computation procedures are presented subsequently.

Fig. 1.
figure 1

Functional block diagram of the Z-Curve Genomic Cepstral Coefficients (ZCGCC).

Step 1:

The first block in Fig. 1 involves the computation of Z-curve from the input nucleotide sequences. Z-curve is a three-dimensional space curve, which constitute a unique numerical representation of a given DNA sequence [19]. A vital advantage of the Z-curve representation over the other nucleotide numerical representation methods is its reproducibility property. This implies that once the coordinate of Z-curve are well defined, the corresponding nucleotides can be uniquely reconstructed [20]. Given a nucleotide sequences that is read from the 5’ to the 3’ – end with N bases that are inspected from the first base to nth base, the cumulative occurring numbers of each of the bases A, C, G and T are represented by An, Cn, Gn and Tn respectively. For points \( Q_{i} ,\,\,\,\forall \,\,i\, = \,0,\;1,\;2,\; \ldots ,\;n - 1 \) in a 3-D coordinate system, the line that connects the nodes Q0(x0, y0, z0), Q1(x1, y1, z1), Q2(x2, y2, z2), …, Qn(xn, yn, zn), in a successive manner is the Z-Curve of the nucleotide sequences being examined. These nodes are mathematically represented as [20, 28]:

$$ \left\{ {\begin{array}{*{20}c} {x[n]\, = \,2(A_{n} \, + \,G_{n} )\, - \,n} & {\forall \,n\, = \,0,\,1,\,2,\, \ldots ,\,N\, - \,1} \\ {y[n]\, = \,2(A_{n} \, + \,C_{n} )\, - \,n\,} & {} \\ {z[n]\, = \,2(A_{n} \, + \,T_{n} )\, - \,n} & {} \\ \end{array} } \right. $$
(1)

where \( A_{0} \, \, = \, \, C_{0} \, \, = \, \, G_{0} \, \, = \, \, T_{0} \, \, = \, \, 0\;\;and\;\;x_{0} \, = \,y_{0} \, = \,z_{0} \, = \,0 \)

In order to derive biological meaning from Eq. (1), it is normalized using \( A_{n} \, + \,C_{n} \, + \,G_{n} \, + \,T_{n} \, = \,n \), to obtain:

$$ \left\{ {\begin{array}{*{20}c} {x[n]\, = \,(A_{n} + G_{n} )\, - \,(C_{n} + T_{n} ) \equiv R_{n} - Y_{n} } & {\forall \,n = 0,\,1,\,2,\, \ldots ,\,N - 1} \\ {y[n]\, = \,(A_{n} + C_{n} )\, - \,(G_{n} + T_{n} )\, \equiv M_{n} - K_{n} } & {} \\ {z[n]\, = \,(A_{n} + T_{n} )\, - \,(C_{n} + G_{n} ) \equiv W_{n} - S_{n} } & {} \\ \end{array} } \right. $$
(2)

where Rn, Yn, Mn, Kn, Wn and Sn are the distributions of the bases of purine, pyrimidine, amino, keto, weak hydrogen bonds and strong hydrogen bonds respectively [21]. The variables x[n], y[n] and z[n] in Eq. 2, which are also illustrated as the outputs of the first block in Fig. 1 are the three independent components of the Z-Curve, with each having distinct biological meaning. Component x[n] represent the distribution of the bases of the purine/pyrimidine (i.e. A or G/C or T) for the first to the nth input nucleotides and it possesses the following attributes:

$$ x[n] = \left\{ {\begin{array}{*{20}c} {Positive\,\,\,\,if\,\,R_{n} \, > \,Y_{n} } \\ {Negative\,\,\,\,if\,\,\,R_{n} \, < \,Y_{n} } \\ {Zero\,\,\,\,\,\,\,\,\,\,\,\,\,if\,\,R_{n} \, = \,Y_{n} } \\ \end{array} } \right. $$
(3)

The second component of Z-Curve, which is yn is the distribution of the bases of the amino/keto group (i.e. A or C/G or T) along the first to nth input nucleotides and it possesses the following attributes:

$$ y[n] = \left\{ {\begin{array}{*{20}c} {Positive\,\,\,\,if\,\,M_{n} \, > \,K_{n} } \\ {Negative\,\,\,\,if\,\,\,M_{n} \, < \,K_{n} } \\ {Zero\,\,\,\,\,\,\,\,\,\,\,\,\,if\,\,M_{n} \, = \,K_{n} } \\ \end{array} } \right. $$
(4)

The third component of Z-Curve, zn is the distribution of the bases of the weak hydrogen bond/strong hydrogen bond (i.e. A or T/C or G) along the first to the nth input nucleotides with the following characteristics:

$$ z[n] = \left\{ {\begin{array}{*{20}c} {Positive\,\,\,\,if\,\,W_{n} \, > \,S_{n} } \\ {Negative\,\,\,\,if\,\,\,W_{n} \, < \,S_{n} } \\ {Zero\,\,\,\,\,\,\,\,\,\,\,\,\,if\,\,W_{n} \, = \,S_{n} } \\ \end{array} } \right. $$
(5)

Step 2:

The three Z-Curve components computed in the first step, which are streams of digital signals obtained from the input nucleotides are transmitted to the second block in Fig. 1. At this stage, Discrete Fourier Transform (DFT) is applied to the digital signals individually as follows:

$$ \left\{ {\begin{array}{*{20}c} {X[k] = \sum\limits_{n = 0}^{N - 1} {x[n]e^{{ - j\frac{2\pi kn}{N}}} } } & {\forall \,k = 0,1,2, \ldots ,N - 1} \\ {Y[k] = \sum\limits_{n = 0}^{N - 1} {y[n]e^{{ - j\frac{2\pi kn}{N}}} } } & {} \\ {Z[k] = \sum\limits_{n = 0}^{N - 1} {y[n]e^{{ - j\frac{2\pi kn}{N}}} } } & {} \\ \end{array} } \right. $$
(6)

where X[k], Y[k] and Z[k] are the spectra of the digital signals. The power spectrum, which is a quadratic combination of these spectra were computed for some selected pathogenic viral sequences in this study and the outputs are presented in Sect. 4.

Step 3:

Each of the nucleotide spectra computed in the previous step contains peaks which represent the dominant frequency components in the input nucleotide signals. The smooth curve that connects the peaks on a spectrum is referred to as the spectral envelope. The spectral envelope carry the identity of the input nucleotide sequences similar to what obtains in other DSP applications such as speech and mechanical fault diagnosis [22, 23]. The separation of the spectral envelope and spectral details from the spectrum is referred to as cepstral analysis. The required procedure for cepstral analysis are represented with the third, fourth and fifth blocks in Fig. 1 and mathematically depicted as follows:

$$ \left\{ {\begin{array}{*{20}c} {c_{x} [n] = \sum\limits_{n = 0}^{N - 1} {\log (X[k])e^{{j\frac{2\pi kn}{N}}} } } \\ {c_{y} [n] = \sum\limits_{n = 0}^{N - 1} {\log (Y[k])e^{{j\frac{2\pi kn}{N}}} } \,\,\,} \\ {c_{z} [n] = \sum\limits_{n = 0}^{N - 1} {\log (Z[k])e^{{j\frac{2\pi kn}{N}}} } } \\ \end{array} } \right. $$
(7)

Using Euler’s formulae, Eq. (7) becomes:

$$ \begin{aligned} \left\{ {\begin{array}{*{20}c} {c_{x} [n] = \sum\limits_{n = 0}^{N - 1} {\log (X[k])\cos (\frac{2\pi kn}{N}) + j\sum\limits_{n = 0}^{N - 1} {\log (X[k])\sin (\frac{2\pi kn}{N})} } } \\ \begin{aligned} \,\,\, \hfill \\ c_{y} [n] = \sum\limits_{n = 0}^{N - 1} {\log (Y[k])\cos (\frac{2\pi kn}{N}) + j\sum\limits_{n = 0}^{N - 1} {\log (Y[k])\sin (\frac{2\pi kn}{N})} } \hfill \\ \hfill \\ \end{aligned} \\ {c_{z} [n] = \sum\limits_{n = 0}^{N - 1} {\log (Z[k])\cos (\frac{2\pi kn}{N}) + j\sum\limits_{n = 0}^{N - 1} {\log (Z[k])\sin (\frac{2\pi kn}{N})} } } \\ \end{array} } \right. \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,real\,cepstrum\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,complex\,\,cepstrum \hfill \\ \end{aligned} $$
(8)

where each of \( c_{x} [n] \), \( c_{y} [n] \) and \( c_{z} [n] \) represents the complex Z-Curve cepstrum of the x[n], y[n] and z[n] components of the Z-Curve for the input nucleotides respectively. The complex cepstrum is a combination of the real and imaginary cepstrum as shown in Eq. (8). The real cepstrum is the log magnitude spectrum of each of the respective signals while the imaginary cepstrum is the phase components. The spectral envelope and spectral details are captured in the real cepstrum. It should be noted that the word “cepstrum” was coined by reversing the first syllable of “spectrum”. Hence, in the cepstrum domain, quefrency also stands for frequency and lifter is used in place of filter [22]. The spectral envelope is the low quefrency components while the spectral details are the high quefrency components in the cepstrum domain. Authors in other DSP application domains have reported that the first 15 or 20 coefficients of a cepstrum appositely represent the spectral envelope [24]. As depicted with the fifth block of Fig. 1, the first 15 or 20 coefficients (spectral envelope) of the real cepstrum are liftered using the window:

$$ w[n] = \left\{ {\begin{array}{*{20}c} {1,\,\,\,0\, \le \,n\, \le \,L} \\ {0,\,\,\,elsewhere} \\ \end{array} } \right. $$
(9)

where \( L \) is the cut off length of the liftering window, which can be either 15 or 20 as earlier stated. The liftering window in Eq. (9) is multiplied with each of the real cepstra sections of Eq. (8) to obtain:

$$ \left\{ {\begin{array}{*{20}c} \begin{aligned} c_{lx} [n] = w[n]\,.\,c_{x} [n] \hfill \\ \hfill \\ \end{aligned} \\ \begin{aligned} c_{ly} [n] = w[n]\,.\,c_{y} [n] \hfill \\ \hfill \\ \end{aligned} \\ {c_{lz} [n] = w[n]\,.\,c_{z} [n]} \\ \end{array} } \right. $$
(10)

where \( c_{lx} [n] \), \( c_{ly} [n] \) and \( c_{lz} [n] \) are the low quefrency coefficients of \( c_{x} [n] \), \( c_{y} [n] \) and \( c_{z} [n] \) respectively.

Step 4:

In the final step depicted with the last block of Fig. 1, the low quefrency cepstral coefficients obtained from Step 3 are concatenated to obtain the Z-Curve Genomic Cepstral Coefficients (ZCGCC) in this study. The ZCGCC is a compact genomic feature vector, which represent the distribution of the dominant components of the bases of purine, pyrimidine, amino, keto, weak and strong hydrogen bonds in the input nucleotide sequences. The ZCGCC feature vector is therefore an alignment-free identity of the input nucleotide sequences and it can either be 45 or 60 elements in length depending on if L in Eq. (9) is 15 or 20 respectively. Naïve Bayesian classifier hereafter in this study to determine the discriminatory potency of ZCGCC when it is applied to extract features from the pathogenic viral dataset.

2.3 Experiments

In this study, three experiments were carried out on a PC with an Intel Core i5 CPU, of 2.50 GHz speed, 6.00 GB RAM, and runs 64-bit Windows 8 operating system. In all the experiments, the forty five and sixty element ZCGCC were utilized and their performances were compared using appropriate metrics. In the first experiment, the naïve Bayes classifier was trained with the ZCGCC extracted from the imbalance dataset. In the second experiments, random oversampling was applied to obtain a balanced dataset. The random oversampling strategy involves the addition of instances to the minority class in a random manner [25]. Since the highest number of instances for any class in the dataset is 100 (Table 1), we increased the number of instances for all the minority classes (instances < 100) in the dataset to 100 to obtain the balanced dataset. The ZCGCC feature vectors extracted from the balanced dataset were further used to train the naïve Bayes classifier. The third experiment involved the comparison of the variant of ZCGCC that gave the best result in the second experiment using the balanced dataset with two other alignment free methods in the literature, namely, Electron Ion Interaction Pseudopotential – Genomic Cepstral Coefficient (EIIP-GCC) [5] and Frequency Chaos Game Representation (FCGR) [26].

Table 1. Experimental results of the imbalanced dataset with ZCGCC.

3 Results and Discussion

3.1 Power Spectrums of the Z-Curve Encoded Viruses

Figure 2 shows the distinct power spectrums of the different strains of Enterovirus, HepatitisC, Dengue and Ebola viruses. Similar to the illustrations in Fig. 2, previous studies have also utilized power spectral of Z-Curve to graphically illustrate the mitochondria DNA of homo sapiens [27] and lung cancer biomarker genes [28, 29].

Fig. 2.
figure 2

Power spectrums of Z-Curve encoded Enterovirus, HepatitisC, Dengue and Ebola viruses.

3.2 Classifier Training Results

The results of the first experiment in which the imbalanced dataset was investigated are shown in Table 1. Four different naïve Bayes kernel functions were tested, namely Gaussian, uniform, epanechnikov and triangular [30]. The sixty element ZCGCC gave higher accuracies and low Misclassification Errors (ME) for each of the kernel functions. Meanwhile, the triangular function ranked best (accuracy = 91.2218%, ME = 0.0878) for the sixty element ZCGCC. Two-sample t-test was further utilized to investigate if the difference between the forty five and sixty element ZCGCC is statistically significant. The test statistic indicates that the null hypothesis of no difference between the mean of the two sets of accuracies is rejected, p < 0.05 (p = 0.0278) as well as for the two sets of MEs, p < 0.05 (p = 0.0280). This shows that the performance of the sixty element ZCGCC is significantly better than that of the forty five element ZCGCC for the imbalanced dataset.

Table 2 shows the results of the second experiment in which the balanced dataset obtained through random oversampling was used to train the naïve Bayes classifier. The sixty element ZCGCC also gave higher accuracies and lower MEs for all the kernel functions compare to its 45 elements counterpart. Similar to the first experiment, the triangular kernel function gave the best overall performance result for the sixty element ZCGCC (accuracy = 93.0385%, ME = 0.0696).

Table 2. Experimental results of the balanced dataset with ZCGCC

It is also remarkable that the performance results of the ZCGCC for the balanced dataset in the second experiment are better than the corresponding ZCGCC in the first experiment for all the kernel functions. This shows that random oversampling method positively influenced the performance results of the ZCGCC. Since the sixty element ZCGCC gave superior performances in the first and second experiments over the forty element ZCGCC, we further investigated if the improvement of the sixty element ZCGCC for the balanced dataset (second experiment) over the sixty element ZCGCC for the imbalanced dataset (first experiment) is statistically significant. The null hypothesis of no difference between the two sets of accuracies is rejected because p < 0.05 (p = 0.0122) and the null hypothesis of no difference between the mean of the two sets of MEs is also rejected, p < 0.05 (p = 0.0122). Thus, the performance results of the sixty element ZCGCC using the balanced dataset is significantly better than those for the imbalanced dataset.

Thus, the sixty element ZCGCC is proposed as an alignment free method for viral pathogen detection in this study based on its overall best performance.

The third experiment was carried out to compare the proposed alignment free method in this study (i.e. sixty element ZCGCC) with two other alignment free methods in the literature, namely EIIP-GCC [6] and FCGR [26]. Table 3 shows the results of the third experiment for EIIP-GCC and FCGR using the balanced dataset. We deem it adequate to use the balanced dataset for the comparison in this third experiment since it produced the best result for the proposed alignment free method in the second experiment. The performance results of the proposed sixty element ZCGCC in Table 2 for all the kernel functions are better than those of EIIP-GCC in Table 3 for all the corresponding kernel functions. For instance, the triangular kernel function gave the highest accuracy of 93.0385% (ME = 0.0696) for the ZCGCC whereas the accuracy obtained with the triangular kernel function for the EIIP-GCC was 84.5% (ME = 0.1550). Furthermore, the statistical significance of the improvement in the performance of the proposed ZCGCC over EIIP is statistically significant, p < 0.05 (p = 8.82e−06).

Table 3. Experimental results of the balanced dataset with EIIP-GCC and FCGR

The performance result of the proposed ZCGCC in Table 2, which was obtained using the triangular kernel function is also slightly better than the highest performance result of the FCGR (accuracy = 92.9231%, ME = 0.0708).

It can be inferred from the results obtained in this study that the first 20 elements of the real cepstrum is more representative of the spectral envelope for the genomic signal. A previous study reported the development of ZCURVE_V, which is a gene finding application for viruses using DNA sequences and the Z-Curve mathematical paradigm. The authors reported that ZCURVE_V can accurately predict genes in viral genomes as short as about 1000 nucleotides [19]. However, the alignment free ZCGCC method proposed in this study detect viral genomes of both long and short lengths with accuracy that compares favorably with existing alignment-free methods in the literature.

4 Conclusion

We have successfully reported the development of ZCGCC, which is an alignment-free method for virus detection in this paper. The sixty element ZCGCC gave superior performance to the EIIP-GCC and comparable performance to FCGR. However, ZCGCC provides remarkable advantages such as low dimension, global genome analysis and low computational requirements, which make it a promising method for developing diagnostic tool for detection of pathogenic viral diseases. Future works will include an investigation of the ZCGCC for the detection of other organisms in the prokaryotic and eukaryotic domains of life. We also hope to experiment with other machine learning methods to investigate the possibility of improved performance.