A classification model for lncRNA and mRNA based on kmers and a convolutional neural network
Abstract
Background
Longchain noncoding RNA (lncRNA) is closely related to many biological activities. Since its sequence structure is similar to that of messenger RNA (mRNA), it is difficult to distinguish between the two based only on sequence biometrics. Therefore, it is particularly important to construct a model that can effectively identify lncRNA and mRNA.
Results
First, the difference in the kmer frequency distribution between lncRNA and mRNA sequences is considered in this paper, and they are transformed into the kmer frequency matrix. Moreover, kmers with more species are screened by relative entropy. The classification model of the lncRNA and mRNA sequences is then proposed by inputting the kmer frequency matrix and training the convolutional neural network. Finally, the optimal kmer combination of the classification model is determined and compared with other machine learning methods in humans, mice and chickens. The results indicate that the proposed model has the highest classification accuracy. Furthermore, the recognition ability of this model is verified to a single sequence.
Conclusion
We established a classification model for lncRNA and mRNA based on kmers and the convolutional neural network. The classification accuracy of the model with 1mers, 2mers and 3mers was the highest, with an accuracy of 0.9872 in humans, 0.8797 in mice and 0.9963 in chickens, which is better than those of the random forest, logistic regression, decision tree and support vector machine.
Keywords
lncRNA mRNA Kmers Relative entropy Convolutional neural networkAbbreviations
 ANT
Adjoining Nucleotide Triplets
 AUC
Area Under Curve
 CNCI
CodingNon Coding Index
 CPC
Coding Potential Calculator
 DT
Decision Tree
 lncRNA
Longchain noncoding RNA
 LR
Logistic regression
 MLCDS
MostLike Coding Domain Sequence
 mRNA
Messenger RNA
 RF
Random Forest
 ROC curve
Receiver operating characteristic curve
 SVM
Support Vector Machine
Background
Transcription of the genome includes messenger RNAs (mRNAs), small (miRNAs, snRNAs) and noncoding RNAs (ncRNAs) [1, 2]. LncRNA is a kind of noncoding RNA with a length exceeding 200 nucleotides [3]. There is a growing concern over the long noncoding RNA [4]. Current studies demonstrate that lncRNA sequences primarily play a role in two aspects of the organism. On one hand, they play a vital biological function in many stages, such as transcription and regulation of life processes. For example, lncRNAs can participate in the regulation of gene expression levels at three levels, epigenetic regulation, transcriptional regulation and posttranscriptional regulation [5], and some lncRNAs can bind to specific chromatinrelated sites through chromatin remodelling, resulting in the expression of silenced related genes [6, 7]. On the other hand, lncRNAs have direct or indirect links with some diseases in humans, such as lung cancer, prostate cancer, Alzheimer’s disease, PraderWilli syndrome, agitation, etc. [4]. Therefore, the identification and inclusion of lncRNAs will help researchers to further research and explore their functions at the molecular level [8].
However, thus far, only a small number of lncRNAs have been included in the noncoding RNArelated database. Additionally, in the existing database, only a small number of lncRNA functions have been thoroughly studied and annotated. Even more difficult is that functional studies of the lncRNA sequence are based on the premise that they can determine whether the sequence is a lncRNA, which is the main difficulty in biological and information biology research. Since lncRNAs and mRNAs have many similarities in sequence structure, the task of identifying lncRNA sequences becomes more challenging. Accordingly, how to design a model that can accurately identify lncRNA and mRNA sequences based on the large amount of sequence data obtained by highthroughput sequencing will be an important biological research topic.
At present, research mainly classifies coding RNA and noncoding RNA based on three aspects: first is the discrimination by the length of the open reading frame of the coding sequence and the noncoding sequence, second is the discrimination by comparing the similarity between the sequence and the known protein sequence using the comparative genomics methods, and third is the prediction by conservation of the RNA secondary structure. However, each of these three methods has its own merits and demerits, and it is difficult to acquire accurate sequence classification results based on only one of them. To solve this problem, some scholars have constructed models and software for classifying mRNAs and lncRNAs by extracting noncoding features in lncRNA sequences. For example, the Bioinformatics Center of Peking University had developed an online lncRNA identification tool, CPC (Coding Potential Calculator, CPC) [9], which has been widely used in many fields, such as sequence alignment, disease research and evolution analysis. Its principle is mainly to extract six features, containing the ratio of the length of the open reading frame to the sequence length, the integrity of the open reading frame, the prediction reliability evaluation score of the open reading frame, etc., train data by placing those features into a support vector machine (SVM), and develop a prediction model of noncoding RNA. Sun et al. [10] proposed a method named CNCI (CodingNonCoding Index, CNCI) based on Adjoining Nucleotide Triplets (ANT). Its framework consisted of a scoring matrix and classification model. First, the species categories of the sample data were determined, and the probability of occurrence of each pair of adjacent triplets in the coding region, noncoding region and intergene region was respectively counted to construct three ANT probability matrices. Then, as the reference, the logratio of the ANT probability matrix of the coding and noncoding region were respectively calculated to obtain the scoring matrix of the CNCI algorithm. Further, The CNCI scoring matrix was used to determine the MostLike Coding Domain Sequence (MLCDS), and then five different features were extracted from each MLCDS for the classification. Dang [11] selected three characteristics from the perspective of the open reading frame, three characteristics of the integrated sequence secondary structure and two characteristics of protein similarity and summarized seven combinations of three types of features. Her lncRNA prediction model could be suitable for different data source.
The CSF prediction software proposed by Lin et al. [12] mainly aimed to identify lncRNAs by calculating the frequency of codon substitution in the target sequence. Based on the CSF model, the evolution information of the alignment sequence was introduced, and they developed the Phylo CSF recognition model [13]. Wucher et al. developed the FEELnc program, an lncRNA and mRNA recognition tool, which was a random forestbased classification model trained using features such as open reading frames [2]. In 2014, Lertampaiporn et al. developed a hybrid model based on logistic regression and random forests to distinguish short noncoding RNA sequences from lncRNA sequences. The model synthesized five combined features, SCORE, which improved the lncRNA classification performance [14].
To summarize, most of the available methods for identifying lncRNA among mRNA sequences are based on the biological characteristics of the sequences. However, the lncRNA sequence may contain some sequences that can overlap with the coding regions of mRNAs [2]. Thus, the recognition of lncRNA sequences is more complex than the recognition of mRNA sequences when using existing methods. To avoid the use of sequence biological characteristics to establish a classification model of the sequence, Wei proposed an lncRNA and mRNA classification model based on the kmer [15]. This model used the maximum entropy algorithm to screen kmers and the support vector machine algorithm for classification; however, it demonstrated great computational complexity and a high computational cost. In addition, preprocessing of raw input data and sequence features should be selected by domainexpert knowledge and to finetune parameters to increase accuracy when using conventional machine learning algorithms such as support vector machines, logistic regression, decision trees, SVM, NN, BNs, GAs, and HMMs, etc. [16]. Therefore, we propose a model to effectively classify lncRNAs and mRNAs, without relying on the sequencing quality and biological structural characteristics of the sequence, as well as avoiding a large number of calculations.
Since the Convolutional Neural Network (CNN) model can selflearn the characteristics of the sequence through continuous training without artificial intervention and efficiently calculate large amounts of data, no domainexpert knowledge or finetuning of parameters to increase accuracy are needed [17, 18]. It has been used to predict DNAprotein binding sites [19] and to predict the specificity of DNA and RNA binding proteins [20]. Zhang et al., developed two methods for predicting DNAprotein binding using the HighOrder Convolutional Neural Network Architecture and Weakly Supervised Convolutional Neural Network Architecture [21, 22]. Transcription factor prediction using ChIPseq data [23] and CRISPR guide RNA design [24] can also be finely conducted using CNN. Whether CNN can be finely used in the classification of lncRNAs and mRNAs is not known.
In this study, we intend to introduce the convolutional neural network model to establish a classification model of lncRNAs and mRNAs. The content of this paper is arranged as follows. First, the kmer frequency information for lncRNA and mRNA sequences is statistically analysed. Second, we construct the classification model of lncRNA and mRNA sequences by convolutional neural network taking the kmer frequency matrix as input. Third, we determine the optimal kmer combination of the model, compare it with those of other machine learning methods and verify the recognition ability of identifying a single sequence.
Results and discussion
Training data and testing data
We download human lncRNA sequence data and mRNA sequence data from the GENCODE database, gencode.v 26. The 10,000 sequences data are randomly selected from the two sample sets each time, i.e., 10,000 lncRNA sequences and 10,000 mRNA sequences, of which 8000 sequences are selected as training samples and the remaining 2000 sequences are used as test samples. We perform 10 random selections to verify the contingency impact of the randomly selected data training model.
Determination of kmer parameters of the lncRNA classification model
First, the calculation is divided into two steps. In the first step, lncRNA sequences ranging from 250 nt to 3500 nt and mRNA sequences ranging from 200 nt to 4000 nt are selected. The kmer subsequence is extracted using the kmer algorithm. For the kmers with larger values, the relative entropy is used to select the features, and the accuracy of the model before and after the screening is compared. Finally, the frequency of the kmer subsequence in each sequence is counted, and the frequency matrix is constructed. In the second step, the convolutional neural network model is trained using the constructed frequency matrix to obtain the classification results of the model. When k is taken different values, the results are compared to obtain the optimal classification model parameters.
Determination of optimal kmer combination in the lncRNA classification model
Based on the statistical analysis, we randomly select 10,000 lncRNA sequence data ranging from 250 nt to 3000 nt in the lncRNA dataset downloaded from the GENCODE database, and we also randomly select 10,000 mRNA sequence data ranging from 200 nt to 4000 nt in length.
Next, we build an lncRNA and mRNA classification model. The first layer of the convolutional neural network uses 32 convolution kernels of 3 × 3, selects the Relu activation function, and the periphery of the kmer frequency matrix is padded with “0” to ensure a constant size of the matrix before and after the convolution calculation. The second layer is still the convolutional layer, with 64 convolution kernels of 3 × 3, and the activation function is still the Relu function. The third layer is the largest pooling layer, and the size of the pooling area is 2 × 2. The partial neuron connections with a probability of 0.25 are omitted before the fully connected layer to prevent overfitting. The last layer is the fully connected layer. There are 128 neurons in the fully connected layer. After the whole layer is connected, the probability of connections between the omitted neurons is 0.5. Finally, the SoftMax function is used to obtain the classification result. The loss function in the model training process selects the cross entropy loss function, and the optimizer is Adadelta.
To determine the most differential kmers in lncRNA and mRNA sequences and to maximize the accuracy of the model kmers, we select kmers with different k values. The established lncRNA and mRNA classification models are used to learn autonomously. Finally, the classification accuracy, model accuracy, recall rate and F_{1} score of the classification model are compared when different k values are compared.
Model classification accuracy for individual k value
kvalue  number of kmers  matrix form  model accuracy  precision rate(P)  recall rate(R)  F_{1}score  calculating time (s/epoch) 

3  64  8 × 8  0.7508  0.81  0.79  0.79  5 
4  256  16 × 16  0.7610  0.85  0.83  0.83  20 
5  1024  32 × 32  0.7565  0.93  0.92  0.92  95 
6  4096  64 × 64  0.7748  0.87  0.85  0.84  855 
Model classification accuracy rate of two k value combinations
kvalue  number of kmers  matrix form  model accuracy  precision rate(P)  recall rate(R)  F_{1}score  calculating time (s/epoch) 

1 + 3  68  17 × 4  0.9280  0.94  0.94  0.94  4 
1 + 4  260  10 × 26  0.9600  0.98  0.98  0.98  32 
1 + 5  1028  4 × 257  0.4995  0.50  0.50  0.36  43 
2 + 3  80  8 × 10  0.9810  0.99  0.99  0.99  9 
2 + 4  272  16 × 17  0.7838  0.87  0.86  0.86  37 
2 + 5  1040  26 × 40  0.7672  0.91  0.90  0.90  180 
3 + 4  320  16 × 20  0.7666  0.90  0.90  0.90  47 
3 + 5  1088  32 × 34  0.7566  0.94  0.94  0.94  189 
4 + 5  1280  32 × 40  0.7532  0.95  0.94  0.94  290 
Since the time complexity is too high when k = 6, if the combination calculation is performed, then the calculation time will be unsatisfactory, only taking k to be 1, 2, 3, 4, and 5 in pairs. From Table 2, the recognition accuracy of the classification model is significantly improved when we combine the two kmers, especially the combination of k = 2 and k = 3, with an accuracy reaching 0.9810. The second is a combination of k = 1 and k = 4 with an accuracy of 0.9600. The result can be explained by the combination of kmers, which is equivalent to strengthening the kmer information of the sequence, after which the model can receive more difference information through convolutional neural network selflearning. However, Table 2 also reveals such information. Although the combined information can greatly improve the accuracy of model recognition, not every combination of information can improve the recognition accuracy of the model compared with before the combination. For example, when k = 5, as shown in Table 1, the classification accuracy of the model is 0.7565, and when k = 1 and k = 5 are combined, the accuracy is 0.4995, and the accuracy of the model is not increased but decreased.
Although the combined kmers can greatly improve the accuracy of the classification model, this strategy also consumes more computation time than the model of a single kmer. By comparison, the calculation time of the classification model is proportional to the number of kmers. When the number of kmers is larger, the calculation time consumed by the model is also greater. There are 80 kmers in the combination of k = 2 and k = 3, and the calculation time consumed by 1 epoch is 9 s, which is second only to the combination of k = 1 and k = 3. The longest calculation time is the combination of k = 4 and k = 5, which contains a total of 1280 kmers. The calculation of 1 epoch requires 290 s. If 200 iterations are obtained, it will take approximately 16 h to train the model. It should be noted that the calculation time in the combination of k = 1 and k = 3 is 4 s, which is faster than the calculation time in k = 3 (5 s). This phenomenon is due to the presence of 68 kmers in the combination of k = 1 and k = 3, which is input as 4 × 17 matrix, and 2 × 2 convolution layer for feature extraction. When the number of 3mers is 64, which is input as 8 × 8 matrix, and, the convolution layer is 3 × 3 for feature extraction. Then the calculation time for the combination of k = 1 and k = 3 is faster than the calculation time for k = 3.
Model classification accuracy rate of three k value combinations
kvalue  number of kmers  matrix form  model accuracy  precision rate  recall rate  F_{1}score  calculating time (s/epoch) 

1 + 2 + 3  84  17 × 20  0.9872  1.00  1.00  1.00  6 
2 + 3 + 4  336  12 × 28  0.9738  1.00  1.00  1.00  57 
2 + 3 + 5  1104  24 × 46  0.9798  1.00  1.00  1.00  217 
Based on Table 3, we find that the combination of k = 1, k = 2, and k = 3 can further improve the accuracy of the model due to the combination of kmers, and the recognition accuracy of the model can reach 0.9872, as shown in Table 3. More excitingly, the calculation time is only 6 s, which is far less than that of other kmer combinations. Consequently, the kmer combination of k = 1, 2, 3 not only achieves the best model accuracy but also has an accuracy rate and recall rate and F_{1} score of 1.00, which indicates that the classification effect of the model is also excellent. Based on the above results, we determine the kmers that allow optimal construction of the lncRNA and mRNA classification model, which is the combination of 1mers, 2mers and 3mers.
Determination of the optimal combination of kmers for the selected lncRNA classification model
In the calculation process, we find that when the k value is greater than 4, that is, when it is 5 or 6 or more, a considerable portion of the kmers exhibits a frequency of 0. The lack of most kmer values may affect the recognition accuracy of the model, resulting in a low classification accuracy of the model. To verify this conjecture, we use the relative entropy to filter the kmers of k = 5 and k = 6. By sorting the information gains and selecting the top 98% kmers, the kmers carrying more and less difference information are filtered out. This method can also effectively reduce the dimensions of the kmer frequency matrix.
Kmers calculation results after KL screening
kvalue  number of kmers  number of kmers after KL screening  original model accuracy  model accuracy after KL screening  calculation time of the original model (s/epoch)  calculation time of KL screening model (s/epoch) 

5  1024  115  0.7565  0.782  95 s  4 s 
6  4096  1045  0.7748  0.779  855 s  47 s 
4 + 5  1280  112  0.7532  0.629  290 s  4 s 
2 + 3 + 5  1104  195  0.9798  0.9761  217 s  27 s 
To compare the classification accuracy of the kmer combination, we combine the 5mers after the relative entropy screening with the kmers of k = 4. It is found that the combined kmer accuracy is only 0.6290, and the accuracy of the model is not improved but reduced. In addition, we combine the 5mers after the relative entropy screening with the kmers of k = 2 and k = 3, and we find that the accuracy of the model is improved to 0.9761, but it is still not the kmer combination that provides the best model classification.
Although the accuracy of the model is not very obvious after the relative entropy screening, according to the information in Table 4, since the kmers are screened by relative entropy screening, the number of kmers is reduced. The computation time of the model after relative entropy screening is greatly reduced. In particular, the combination of k = 4 and k = 5 reduces the calculation time from 290 to 4 s, a reduction of more than 70 times.
Comparison of the model accuracy with four machine learning methods
Based on previous analysis, when kmers of k = 1, k = 2 and k = 3 are combined as input in the convolutional neural network, the accuracy of the classification model can be maximized. In the case of the 10fold crossvalidation calculation, the training set loss function value of the convolutional neural network model averages 0.043, and the average classification accuracy rate is 0.9872. The average loss function of the verification set is 0.0431, and the average accuracy of the verification set is 0.9790. In the machine learning algorithm, such as the random forest, logistic regression, decision tree, and support vector machine, to compare the superiority of the convolutional neural network model in the classification of lncRNA sequences and mRNA sequences, we use these four algorithms to classify lncRNA and mRNA sequences.
Five model effect comparison table in human
model  model accuracy  precision rate(P)  recall rate(R)  F_{1}score 

CNN  0.9872  0.9993  0.9955  0.9974 
RF  0.8820  0.8949  0.8867  0.8925 
LR  0.7020  0.7247  0.7183  0.7218 
DT  0.8030  0.7873  0.7852  0.7869 
SVM  0.7020  0.7245  0.7158  0.7179 
From Table 5, in terms of model accuracy, the model accuracy of the convolutional neural network algorithm is 0.9872, which is far superior to those of other algorithms. Followed by random forests, the classification accuracy is 0.8820. Again, the decision tree has a classification accuracy of 0.8030. The classification accuracy of the logistic regression and support vector machine is the same at only 0.7020.
Five model effect comparison table in mouse
model  model accuracy  precision rate(P)  recall rate(R)  F_{1}score 

CNN  0.8797  0.8960  0.8590  0.8771 
RF  0.8120  0.8132  0.8130  0.8131 
LR  0.7541  0.7454  0.7700  0.7575 
DT  0.7001  0.6991  0.6977  0.6984 
SVM  0.7528  0.7564  0.7476  0.7520 
Five model effect comparison table in chicken
model  model accuracy  precision rate(P)  recall rate(R)  F_{1}score 

CNN  0.9963  0.9943  0.9984  0.9963 
RF  0.9302  0.9351  0.9245  0.9298 
LR  0.8743  0.8902  0.8546  0.8720 
DT  0.8227  0.8148  0.8315  0.8230 
SVM  0.8724  0.8881  0.8538  0.8706 
Verification of the classification model in single lncRNA sequence recognition
Our results are tested using 2000 mRNA sequences and 2000 lncRNA sequences selected from gencode.v 26 data. In addition, to verify whether the proposed classification model is suitable for the identification of a single sequence, we download a human lncRNA sequence in the NCBI database, which was discovered by Professor GasriPlotnitsky, an Israeli professor at the University of Barllan’s Institute of Life Sciences, and his team, and published in Oncotarget magazine in 2017 [25]. This lncRNA sequence is called GASL1. Professor GasriPlotnitsky’s research indicates that GASL1 expression inhibits cell cycle progression, identifying it as novel lncRNA modulator of cell cycle progression and cell proliferation, and has a potential role in cancer. Simultaneously, if the expression level of GASL1 is low in liver cancer patients, the survival rate may be worse.
Taking the GASL1 sequence as an example, we verify whether the classification model of the lncRNA and mRNA sequences proposed in this paper can correctly classify the sequences. The frequencies of 1mers, 2mers and 3mers in the sequence are first calculated, and the three kmers are combined together for a total of 84 kmers. Finally, the frequencies of the 84 kmers are constructed into a 7 × 12 matrix and convoluted. In the model constructed in this paper, the classification label of lncRNA is 0, and the classification label of mRNA is 1. To predict the category to which the sequence belongs, the output model is finally required to identify the category label of the category to which the sequence belongs.
The final prediction result is “pre_label is 0”, that is, the model recognizes the sequence as the lncRNA sequence, indicating that the model is correct for recognition of the sequence.
Conclusions
The main purpose of this paper is to construct a model that can effectively classify lncRNA and mRNA. First, based on the statistical analysis of the sample sequence length and kmer frequency distribution, the lncRNA and mRNA sequences in the model training set are determined to range from 250 nt to 3500 nt and from 200 nt to 4000 nt, respectively, and a kmer frequency matrix is constructed. Then, using the kmer frequency matrix as input in the convolutional neural network, a classification model of lncRNAs and mRNAs is established and programmed using Python. By calculating the classification accuracy of the frequency matrix of different kmer combinations, the classification accuracy of the model with 1mers, 2mers and 3mers is highest with an accuracy of 0.9872. Comparing the established lncRNA and mRNA classification models with random forest, logistic regression, decision tree and support vector machine analyses using the ROC curve, the model classification effect is improved. Application of the model is then examined: the correct classification result is obtained by identifying the known lncRNA sequence GASL1.
There remain many limitations to our research. For example, in the statistical analysis of the kmers of lncRNA and mRNA sequences, only simple frequency analysis was used, and no indepth statistical analysis was performed. In addition, when applying extensions to sequences of different species, the kmer information difference between different species was not analysed in depth, but the preliminary discussion is based on the calculation results. In the future, we will conduct a systematic analysis of kmer information differences between different species. In addition, as pointed out in [26] userfriendly and publicly accessible webservers represent the future direction for developing practically more useful prediction methods and computational tools. Actually, many practically useful webservers have significantly increased the impacts of bioinformatics on medical science [27], driving medicinal chemistry into an unprecedented revolution [28], we shall make efforts in our future work to provide a webserver for the prediction method presented in this paper.
Methods
Statistical analysis of kmers of lncRNA and mRNA sequences
The human lncRNAs and mRNAs data were downloaded from the GENCODE database (Gencode.v26). The mouse lncRNAs and mRNAs data were downloaded from the GENCODE database (Genecode. VM21). The chicken lncRNAs and mRNAs data were downloaded from the Ensembl website database (5.0). Human data were used to build a convolutional neural network model. Mouse and chicken data were used to compare the superiority of the convolutional neural network model using the built model. The computed information used in this paper were as follows: (1) Operating system, Windows 10, with an InterCore I3–2365 M processor, memory size, 6G; (2) Python 3.5 to run the CNN code.
Studies have shown that the kmer frequency information in lncRNA and mRNA sequences can reveal the distribution of various subsequences in biological sequences, to measure the similarities and variances of sequences [29].
For a sequence, if the sequence length is m, then the number of kmer subsequence of length k has mk + 1. The sequence generally consists of four bases, A, T, C, and G, and thus kmers of length k have 4^{k} possible structures.
From Fig. 4, the average contents of the A, C, G, and T bases in the lncRNA sequence were approximately 254 nt, 217 nt, 216 nt, and 240 nt, respectively, while the mRNA sequence had A, C, G, and T base contents of approximately 364 nt, 420 nt, 422 nt, and 343 nt, respectively. Therefore, the contents of the four 1mers in the mRNA were higher than in the lncRNA. Furthermore, in the lncRNA and mRNA sequences, the contents of C and G bases were very similar, and the contents of A and T base were equivalent. However, in the lncRNA sequence, the contents of C and G base were lower than the contents of A and T base. While in the mRNA sequence, the opposite was true.
When k is taken as three, the 3mer fragments appearing in the sequence are {AAA, AAT, AAC, AAG, ATA, ATT, ATC, ATG,..., GGA, GGT, GGC, GGG}. Similarly, the frequency information for each 3mer segment of each sequence in the sample sequence is counted in turn, and the respective mean values are calculated to estimate the frequency of occurrence of each 3mer in the lncRNA and mRNA sequences. The histogram of the 3mer frequency distribution in the mRNA and lncRNA is plotted, as shown in Fig. 5a and b. As seen in Fig. 5a and b, the frequency distribution of most 3mer subsequences in the mRNA sequence fluctuated sharply, but the frequency of 3mers of GCG, CGG, CGC, CGT, CGA, TCG, and ACG was small, with only approximately 4 in each sequence. Moreover, the TGG, CAG, CTG, CCA, CCT, GCC, and GGC segments were enriched in the mRNA sequence, all with frequencies of approximately 40. In the lncRNA sequence, the frequency distribution of each 3mer subsequence was relatively stable, and most of them were distributed approximately 15 with little fluctuation. A few 3mers had a higher frequency, such as AAG, AGA, AGG, TGG, CAG, CTG, CCA, CCT, GAA, and GGA, exceeding 20. Moreover, only four 3mers, ACG, TCG, CGA, and CGT, had frequencies lower than 5.
By analysing the frequency distributions of 1mers and 3mers of lncRNA and mRNA sequences, the kmer distributions of the two were found to have their own preferences. Therefore, we could use the kmer frequency distribution information for the sequence as the difference information between lncRNA and mRNA sequences.
The lncRNA and mRNA classification model based on the convolutional neural network
As shown in Fig. 6, the lncRNA and mRNA classification model includes the input part and the convolutional neural network part. The input section contains the kmers extracted from the sequence and their construction into a kmer frequency matrix. The convolutional neural network consists of two layers. First, in the kmer feature extraction layer, the input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted. Once the local feature is extracted, its positional relationship with other kmer features is also determined. The second is the kmer feature mapping layer. Each computing layer of the network consists of multiple feature maps. Each feature map is a plane, and the weights of all neurons on the plane are equal. The kmer feature mapping structure uses the sigmoid function as the activation function of the convolutional network so that the feature map has displacement invariance. In addition, since all neurons on a mapped surface share weights, the number of network parameters is reduced. Each convolutional layer in the convolutional neural network is followed by a computational layer for local averaging and quadratic extraction of kmer features. This unique twofeature extraction structure reduces feature resolution.
Construction of the kmer frequency matrix

Step 1: The kmer frequency information for each sequence is counted. In this paper, the sequence is traversed in the order of A, T, C, and G, and finally the frequency of each 4mer is counted;

Step 2: Normalize the frequency of each kmer in the sequence in each sequence. That is, the frequency p_{i}(i = n, n = 4^{k}) of all kmers in each sequence is obtained, and the sum of the frequencies of kmers in each sequence is 1;

Step 3: The frequencies of all kmers in each sequence are constructed into a matrix form A × B(A × B = 4^{k}), and the elements in the matrix are arranged horizontally in the order of the kmer. For example, when k is 4, the constructed matrix is 16 × 16.
Kmer screening based on relative entropy
When the value of k increases, the kmer types with a length k in the sequence increase exponentially. If k is large, the average frequency of each kmer will be less. Numerous kmers have a frequency of zero. To reduce the complexity of the data calculation, we used relative entropy to screen kmers.
R reflects the target ratio of the extracted information of kmers. The λ corresponding to R from 1 to 4^{k} is sequentially calculated. If set R ≥ 98%, then calculate λ = ϖ, and the first ϖ kmers are the filtered kmers.

Step 1: The sum of each kmer in the lncRNA sample sequence and the mRNA sample sequence is separately determined. Then, the frequency of each kmer in it is counted. Finally, the frequency values of the 4^{k} kinds of kmers in the sample are obtained. For example, when k = 4, the total frequency of 256 4mers in the lncRNA and mRNA sample sequences are respectively counted, then the frequency value of each 4mer is counted, and finally, 256 kinds of 4mers are obtained. The sum of the frequency values of the 256 4mers is 1 for the frequency values in the two sample sequences, respectively. The results of this step calculation are p_{lnc} and p_{m} in the formula (1).

Step 2: According to the frequency value of each kmer in the two sequences obtained in step 1, the relative entropy, that is, the value of D(p_{lnc}, p_{m}), is calculated according to the formula (1). Then, R is calculated according to the value of D(p_{lnc}, p_{m}) in formula (3), and finally, λ is taken as the value of ϖ when R ≥ 98%. Now according to the descending order of d_{λ}, the first ϖ kmers are the filtered kmers.

Step 3: The lncRNA and mRNA are separately counted based on the frequency of the kmers screened in step 2, and then the kmers frequency is constructed in the form of a matrix with reference to the steps of data input processing.
Convolution calculation of the kmer frequency matrix
The convolution calculation is used to strengthen the important features in the kmer frequency matrix and weaken the influence of irrelevant kmer features in this paper.
The first element of the feature map in Eq. (7) is the weighted sum of the first 3 × 3 local element value of the input matrix M and the corresponding element of the convolution kernel. Similarly, the second element is the weighted sum of the second 3 × 3 local element value and the corresponding element of the convolution kernel. Finally, a 6 × 6 size of the output matrix feature map is obtained.
Pooling calculation of the convolution kernel output matrix
The pooling method adopted in this paper is the max pooling, which aims to reduce and compress the kmer characteristics, as well as the calculation amount. The model has only one pooling layer. After the second convolutional layer, the window sliding calculation is performed in a step size of 2.
Since 64 kmer matrices of 6 × 6 were obtained after the second convolutional layer and the size of the pooling window was 2 × 2, the number of rows and columns of the matrix became half the original, representing 64 kmer 3 × 3 matrices.
Fully connected neural network based on SoftMax function
In order to prevent overfitting and improve the generalization ability of the model, we reduce the connection of some neurons with a certain probability, so that some neurons in each training are not activated. After the pooling layer, the connection between the output neurons of the pooling layer and the neurons of the full connective layer was reduced with a probability of 0.25. The output matrix of the pooling layer is flattened and expanded to connect 128 neurons in the full connection layer. The activation function is still Relu function.
From Eq. (10), if z_{j} is greater than the other z, the value of the function f(z_{j}) approaches 1, and otherwise it approaches 0. Therefore, when the value of f(z_{j}) is 1, the input sequence of this model is judged to be an lncRNA sequence, and when the value of f(z_{j}) is 0, the input sequence is a mRNA sequence.
The Adadelta optimizer is used to train the gradient descent in the training process, and the crossentropy loss function is used as the loss function.
Setting of the evaluation index in the classification model
The indicator for evaluating the performance of a classification model is generally the classification accuracy, which is also known as the model accuracy. Commonly used evaluation indicators for the twocategory problem are precision and recall. In this paper, the positive class was the lncRNA sequence, and the mRNA sequence was the negative class. The predictions of the classifier on the test data set were either correct or incorrect. The total number of occurrences in the four cases was recorded as follows:
TP——The positive class is predicted as the positive class number;
FN——The positive class is predicted as the negative class number;
FP——The negative class is predicted as the positive class number;
TN——The negative class is predicted as the negative class number.
If both the precision and recall rates are high, then F_{1} will be high [30].
Notes
Acknowledgements
The authors would like to thank Yebei Xu and Ling Yang for suggestions regarding this work.
Authors’ contributions
BD and XX designed the study. JW, YL, YS, and HH carried out analyses and wrote the program. JW and YL wrote the paper. All authors read and approved the final manuscript.
Funding
This work was jointly supported by the National Nature Science Foundation of China (61403288, 71871174), the Natural Science Foundation of Hubei Province, China (2019cfb589), and Fundamental Research Funds for the Central Universities (WUT: 2019IA005). The funders had no role in study design, data collection, analysis, decision to publish, or preparation of the manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
References
 1.Djebali S, Davis CA, Merkel A, et al. Landscape of transcription in human cells. Nature. 2012;489:101–8.CrossRefGoogle Scholar
 2.Wucher V, Legeai F, Hédan B, et al. FEELnc: a tool for long noncoding RNA annotation and its application to the dog transcriptome. Nucleic Acids Res. 2017;45(8):57–68.Google Scholar
 3.Han SY, Liang YC, Li Y, et al. Long noncoding RNA identification: comparing machine learning based tools for long noncoding transcripts discrimination. Biomed Res Int. 2016;2016:1–14.Google Scholar
 4.Li WS, Xiao XW, Su H, et al. The research progress of LncRNA. J Gannan Med Univ. 2017;37(3):433–7.Google Scholar
 5.Caley DP, Pink RC, Truillano D. Long noncoding RNAs, chromatin and development. Sci World J. 2010;8(10):90–102.CrossRefGoogle Scholar
 6.Nagano T, Mitchell JA, Sanz LA, et al. The air noncoding RNA epigenetically silences transcription by targeting G9a to chromatin. Science. 2008;322(5908):1717–20.CrossRefGoogle Scholar
 7.Wang X, Arai S, Song X, et al. Induced ncRNAs allosterically modify RNAbinding proteins in cis to inhibit transcription. Nature. 2008;454(7200):126–30.CrossRefGoogle Scholar
 8.Wapinski O, Chang HY. Corrigendum: long noncoding RNAs and human disease. Trends Cell Biol. 2011;21(6):354–61.CrossRefGoogle Scholar
 9.Kong L, Zhang Y, Ye ZQ, et al. CPC: assess the proteincoding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35:345–9.CrossRefGoogle Scholar
 10.Sun L, Luo H, Bu D, et al. Utilizing sequence intrinsic composition to classify proteincoding and long noncoding transcripts. Nucleic Acids Res. 2013;41(17):166–73.CrossRefGoogle Scholar
 11.Dang HX. Multifeature based long noncoding RNA recognition method. Xian: Xidian University; 2013.Google Scholar
 12.Mariner PD, Walters RD, Espinoza CA, et al. Human Alu RNA is a modular transacting repressor of mRNA transcription during heat shock. Mol Cell. 2008;29(4):499–509.CrossRefGoogle Scholar
 13.Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and noncoding regions. Bioinformatics. 2011;27(13):275–82.CrossRefGoogle Scholar
 14.Lertampaiporn S, Thammarongtham C, Nukoolkit C, et al. Identification of noncoding RNAs with a new composite feature in the hybrid random forest ensemble algorithm. Nucleic Acids Res. 2014;42(11):93–104.CrossRefGoogle Scholar
 15.Wei M. Identification of long noncoding RNA and mRNA based on maximum entropy and kmer. Xian: Xidian University; 2015.Google Scholar
 16.Qaisar A, Syed R, Azizuddin B, et al. A review of computational methods for finding noncoding rna genes. Genes. 2016;7(12):113.CrossRefGoogle Scholar
 17.Li H, Wang Y, Xu X, et al. Shortterm passenger flow prediction under passenger flow control using a dynamic radial basis function network. Appl Soft Comput. 2019;83:105620.CrossRefGoogle Scholar
 18.Chen Y, Wang L, Li F, et al. Air quality data clustering using EPLS method. Information Fusion. 2017;7(36):225–32.CrossRefGoogle Scholar
 19.Zeng H, Edwards MD, Liu G, et al. Convolutional neural network architectures for predicting DNAprotein binding. Bioinformatics. 2016;32(12):121–7.CrossRefGoogle Scholar
 20.Alipanahi B, Delong A, Weirauch MT, et al. Predicting the sequence specificities of DNA and RNAbinding proteins by deep learning. Nat Biotechnol. 2015;33(8):831–8.CrossRefGoogle Scholar
 21.Zhang Q, Zhu L, Huang DS. Highorder convolutional neural network architecture for predicting DNAprotein binding sites. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(4):1184–92.Google Scholar
 22.Zhang Q, Zhu L, Bao WZ, et al. Weaklysupervised convolutional neural network architecture for predicting proteinDNA binding. IEEE/ACM Trans Comput Biol Bioinform. 2018:1–1. Online. https://doi.org/10.1109/TCBB.2018.2864203.
 23.Zhang Q, Zhu L, Huang DS. WSMD: weaklysupervised motif discovery in transcription factor ChIPseq data. Sci Rep. 2017;7(1):3217.CrossRefGoogle Scholar
 24.Chuai GH, Ma HH, Yan JF, et al. DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol. 2018;19(1):80.CrossRefGoogle Scholar
 25.GasriPlotnitsky L, Ovadia A, Shamalov K, et al. A novel lncRNA, GASL1, inhibits cell proliferation and restricts E2F1 activity. Oncotarget. 2017;8(14):23775–86.CrossRefGoogle Scholar
 26.Chou KC, Shen HB. Recent advances in developing webservers for predicting protein attributes. Nat Sci. 2009;1:63–92.Google Scholar
 27.Chou KC. Impacts of bioinformatics to medicinal chemistry. Med Chem. 2015;11:218–34.CrossRefGoogle Scholar
 28.Chou KC. An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Curr Top Med Chem. 2017;17:2337–58.CrossRefGoogle Scholar
 29.Chen X. Biological classification based on kmer frequency statistics. Changchun: Jilin University; 2011.Google Scholar
 30.Li H. Statistics learning method. Beijing: Peking University impress; 2012. p. 18–9.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.