Probabilistic topic modeling for the analysis and classification of genomic sequences
- 1.2k Downloads
Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques.
The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences.
Results and conclusions
We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased.
KeywordsProbabilistic topic model ultra short sequence classification LDA
The study of genomic sequences for classification and taxonomic purposes has a leading role both in microbial identification , with important consequences in the biomedical field, and in the classification of living species such as animals or plants, for studies about the biodiversity of different ecosystems . These kinds of analysis are carried out focusing only on a well defined region of the genome, usually referred as barcode genes: for example the 16S rRNA gene for bacteria , and the cytochrome c oxidase I (COI) for animals . The first computational approaches with these data were based on sequence alignments and sequence similarities, obtained through the evolutionary distances, with already identified genomic sequences . More recently, novel machine learning and data mining methodologies have been developed. For example clustering algorithms, which are unsupervised techniques able to find groups of similar objects, have been applied for the identification of the taxonomic rank of bacteria isolates. The aim of this approach was to find a correlation between clusters and collections of bacteria belonging to the same taxon (taxonomic category). Clustering techniques have been used considering similarity among gene sequences expressed both in terms of classic evolutionary models [6, 7], and in terms of compression-based models [8, 9], that derive their theoretic assumption from the information theory concepts of Universal Similarity Metric . The compression-based approaches have been also adopted for the study of phylogenetic relationships among animal species, considering the barcode COI gene [11, 12].
Recent alignment-free computational approaches consider genomic sequences as a collection of k -mers. A k -mer is a small fragment of DNA string of size k. In bioinformatics domain a k -mer representation has been used in many works. For example, a deep analysis of k -mer spectra has been carried out in ; a vector representation of DNA sequence using k -mers has been adopted for classification task using Support Vector Machines (SVM)  in [15, 16], and using Neural Gas algorithm  in ; k -mer occurrences in genomic sequences have been considered for training a Naive Bayesian classifier [19, 20]. Two of the most accurate sequence classifiers that adopt a k -mer representation, as shown in , are the RDP classifier  and the Simrank algorithm . The RDP tool trains a Naive Bayesian classifier  using as input data the frequency occurrence of k -mers of a 16S gene dataset; the fitted probabilistic model is then able to predict the taxonomic label of an unknown (unlabeled) sequence. Simrank tool is a search algorithm that employs k -mers representation in order to speed up the sequence similarity searches between an unknown query sequence and a repository of tagged 16S genomic strings.
In this work we propose a new computational method for sequence classification based on k -mers representation and text mining techniques. If we consider DNA sequences as documents and the related k -mers as words, it is possible to extract the most recurrent themes, or topics, shared by the corpus of sequences. Since similar text documents about specific issues, like economy or biology, share the same topics, our thesis is to demonstrate that sequences belonging to the same most recurring themes (topics) have strong similarities among them and belong to the same taxonomic rank. For this reason, our approach is based on the probabilistic topic modeling methodology , usually adopted for identification and classification of text documents. Probabilistic topic models, in fact, are algorithms that, given a set of text documents called corpus, extract a group of probability distributions over the words in the documents, i.e. the topics. Our aim is then to learn a probabilistic topic model using this representation, in order to extract the most probable topics from the DNA corpus. The extracted topics will be used to classify unknown test sequences. Apart from text documents, topic models have been also adopted for the analysis of image, audio and music data. In image processing, it is assumed that similar collections of images share the same visual patterns (representing the topics). This way topic modeling has been applied for example for image classification , for building image hierarchies  and for linking captions and images . In order to infer musical key-profiles of classical music, music files have been considered as text documents, musical notes as words and musical key-profiles as topics . Topic modeling has also been used for audio information retrieval, as in : authors adopted Latent Dirichlet Allocation (LDA) as topic model, and they considered one of the parameter of the fitted model (namely the posterior Dirichlet parameter) as a feature vector in order to perform classification by means of the SVM algorithm. In bioinformatics, topic models have been applied to genomic data by , in order to find the topics, representing a genetic signature, belonging to a population with a shared ancestral parent. Moreover authors in [31, 32] applied the probabilistic Latent Semantic Analysis (pLSA) topic model  in order to predict annotations of Gene Ontology (GO) terms using only the previously available GO annotations.
We carried out experiments on a rich bacterial dataset, more than 7000 sequences, also including ultra-short sequences (length ≤ 50 bp), in order to consider the robustness of the proposed approach with respect to sequence length. Classification results were compared with the ones provided by the RDP classifier and the SVM classifier.
The rest of the paper is structured as follows: the Methods section reports the computational tools used in the paper, with a focus on the probabilistic topic model adopted and our document paradigm for DNA sequences; the Results and discussion section presents the datasets used and the classification results; finally the conclusions are drawn.
In this Section we present our computational approaches to the analysis and classification of genomic sequences. After a brief description of probabilistic topic models, we formalize our document paradigm for gene sequences, then we explain our experimental pipelines both for the training and the testing phases.
Probabilistic topic models
where P (w i ) is the probability of the word w i in a given document; P (z = z j ) is the probability of choosing a word from topic z j for the current document; P (w i |z = z j ) is the probability of sampling the word w i , given the topic zj ; T is the number of topics. Given the words, representing the observable variables, into a corpus of documents, a probabilistic topic model is learned by estimating the topic distributions per document and the words distribution per topic, representing the hidden variables. The number T of topics is a model parameter and it has to be fixed a priori. There are several algorithms used to infer a probabilistic topic model. One of the earliest topic model is the Probabilistic Latent Semantic Analysis (pLSA) algorithm . In pLSA, each document is represented as a set of the mixing proportions among the topics, but it is not defined a generative probabilistic model . That means that it is not possible to assign a topic distribution to documents not belonging to the training set. Because of that, in our work we selected the Latent Dirichlet Allocation (LDA)  as probabilistic topic model. LDA is one of the simplest algorithm to infer the topics distributions from the generative document model, defined in Eq. 1, and, unlikely pLSA, it provides a fitted model that is able to assign a topic distribution to test documents (i.e. not belonging to the corpus used to train the model) by computing its posterior probability, defined as the conditional distributions of topics given the words in the document. The generative model introduced by LDA is defined as follows. P (w|z) is represented as a set of T multinomial distributions φ over all the W unique words of the joint set of documents: P (w|z = z j ) = φ(j). P (z) is represented as a set of D, the number of documents d in the corpus, multinomial distributions θ over the T topics: P (z = z j ) = θ(d). Documents are then generated by first selecting a distribution over topics θ from a Dirichlet distribution. The words in the document are generated by selecting a topic zj from this distribution and then by selecting a word from this topic, using the distribution P (w|z = z j ) that is determined from another Dirichlet distribution. More formally, LDA's generative model can be summarized in the following steps:
where ≈ means "is distributed as".
3 For every word w i
Select a topic z j ≈ Multinomial(θ).
Select a word w i from a multinomial probability distribution given the topic zj : p(w i |z j , φ).
More complex topic models, like Pachinko Allocation Model (PAM)  and Hierarchical Dirichlet Processes (HDP)  were not taken into account. PAM is able to find correlations between topics. In our work, however, we are not interested in inter-topic correlation because we suppose that topics, related to taxonomic ranks in our framework, are independent each other. HDP is an extended version of LDA since it estimates the number of topics. In this work, as explained in section Results and discussion, we are also interested in how classification results vary depending on the number of topics. For this reason we prefer the LDA model because it allows us to select a priori the number of topics for our experiments.
Document paradigm for gene sequences
Finding topics in gene sequences
Using the document paradigm described in "Document paradigm for gene sequences" section, we applied probabilistic topic models to a corpus of nucleotide sequences in order to extract the topics by means of the LDA algorithm. We aim at demonstrating that similar sequences share the same group of most probable topics, so that if it is possible to assign a taxonomic label to those topics, we are able to classify the sequences with respect to their topic distributions. Moreover, using a fitted model, we can also predict the taxonomic rank of an unknown sequence, considering the label of its highest probable topic. The methodologies adopted to assign a label to the topics and to find the most probable topic of a test sequence will be described in the following Sections.
where T is the number of topics, N is the number of sequences in the training set and P (z|d) is the topic distribution for document.
where is the label of sequence d i ; R is the number of sequences belonging to topic z j ; f (d i , d k ) is a function that is equal to 1 if the label of sequences d i and d k are the same, 0 otherwise.
At the end of the training phase, we then obtain a fitted probabilistic topic model and a set of topics representing the taxonomic ranks of the input DNA corpus.
where true positives (TP) are correctly classified test sequences, that is their predicted label matches with the topic label; otherwise false positives (FP) represent misclassified test sequences.
Results and discussion
In this Section we present the 16S bacteria dataset used and we describe both the experiments settings and the results obtained using the probabilistic topic modeling approach for sequence classification. Our results are compared with other two algorithms used for sequence classification: the RDP classifier and the support vector machine classifier.
We evaluated our approach for gene sequences classification considering bacteria species. For classification and taxonomic studies of bacteria, it is usually considered only a limited part of the genome, about 1200-1400 bp, that is the housekeeping 16S rRNA gene . In our study we arranged a 16S dataset downloading the gene sequences from the Ribosomal Database Project (RDP) repository , release 10.32. We chose the four richest phyla, Actinobacteria, Bacteroidetes, Firmicutes, Proteobacteria, and, in order to retain a good quality dataset, we selected the 16S sequences that satisfy the following constraints:
1 type strain, representing reference specimen;
2 size ≥ 1200 bp, considering this way full gene sequences;
3 good quality, according to the quality parameters provided by the RDP repository;
4 NCBI taxonomy, i.e. sequences are labeled with the NCBI taxonomic nomenclature .
Main features of the 16S bacteria Dataset.
The experiments proposed in this paper, aimed at validating the probabilistic topic modeling approach, represent an expansion and an in-depth analysis of our previous work . There, with a smaller dataset of 3000 sequences, we carried out a series of trials, using a tenfold cross-validation procedure, in order to test how the classification results varied with regards to the number of topics and the dataset composition. We obtained, with k -mer size = 8, global results ranging from 99% of precision score at phylum taxonomic level to 80% at family level. In all cases, we noticed that the best scores were reached only when the number of apriori fixed topics is at least equal to the number of different categories of the input dataset. For example, if we want to classify our dataset at order level, we have to train a topic model with a number of topics equal or greater than the number of orders. Of course only in an ideal situation the number of topics matches exactly with the number of categories, in fact in our previous study we obtained better results with a larger number of topics, about two times the number of categories, considering a situation in which each different class covers, in average, two most probable topics. In this work, we enriched that experimental pipeline first of all taking into account a bigger dataset consisting of 7856 gene sequences, described in "Dataset used" section. Moreover, in order to tune the choice of the number of topics, the probabilistic topic models were trained in a hierarchical way. That means we fitted a different topic model at each taxonomic level, for the four different phylum. Considering the Firmicutes phylum, for instance, in order to classify at class level, we trained a model considering an input training set composed of all the Firmicutes sequences. In order to classify at order level, we trained a different topic model for each of the four different classes of Firmicutes phylum (look at Table 1 for info about the number of categories of our bacteria dataset), and so on. As a general rule, we considered, for each topic model a number of topics equal to one time and two times the number of lower categories: if one class has four orders, for that class we trained a topic model with four and eight topics. Once again all the tests have been carried out by means of a ten fold cross-validation procedure.
Unlike our previous work, in this paper we also evaluate the robustness and the generalization ability of our approach with respect to the sequences length. For this reason, we tested our method also with small sized sequences, considering respectively sequence fragments of 400, 200, 100, 50, 40, 25 bp. In this case we submit to the testing workflow a fragment of length f (with f = 25, 40, 50 and so on) randomly extracted from the full length sequence and we consider the output classification. The need of a robust classifier able to correctly predict the taxonomic rank of small DNA fragments is of fundamental importance in metagenomics applications, where genetic sequences are mainly extracted from environmental species and in many cases ultra short sequences, with size ≤ 50 bp, are available .
Classification results, in terms of precision scores (Eq. 7), were compared with other two sequence classifiers: the RDP classifier , and the SVM classifier. The former consists of a naive Bayesian classifier trained on a k -mer representation of the sequences, the latter works on a vector representation of the gene sequences obtained considering the number of k -mers occurrences. We adopted the SVM implementation provided by the R package e1071 , that allows a simple interface with the well known LIBSVM library . SVM has been run with default parameters and Gaussian Radial Basis kernel.
In this paper we presented a novel computational approach for gene sequence classification. Using the probabilistic topic models, mainly adopted in text mining applications, we developed a pipeline that, by means of the Latent Dirichlet Allocation algorithm, is able to learn a probabilistic topic model from a dataset of 16S gene sequences. Considering each genomic sequence as a document, our goal is to extract the topics, that are recurring meaningful themes, from the training sequence dataset. On the basis of their topic distributions, our aim is to demonstrate that sequences sharing the same groups of high probable topics belong to the same taxonomic ranks. Classification results, in terms of precision scores, have been compared with the RDP classifier, representing state of the art sequence classifier, and with the SVM general purpose classifier. Experiments were carried out at different taxonomic levels, from phylum to family, and for different sequence sizes, from full length down to 25 bp. The results show our approach reached very similar results, within a 10% spread, compared to RDP and SVM, at every taxonomic level and for full length sequences. The most interesting results were obtained considering the robustness and generalization ability of our method with regards to short sized sequences (from 400 bp to 25 bp). Our approach, therefore, proved very reliable considering full length sequences, with precision scores very close to the ones obtained with RDP and SVM classifiers. Most importantly, it demonstrated its high robustness, with a smooth decrease of performances when applied for classification of ultra short sequences. In the near future, we want to further validate our approach by considering noisy sequences, i.e. "not good" according to RDP repository parameters, and taking into account sequence fragments extracted from different parts of the original sequences. Noisy sequences are interesting because for example in case of environmental species it is possible to obtain degraded sequences. The study of several fragments of the same input sequence can allow us to understand which part of the original sequence carries the most informative content.
The publication costs for this article were funded by the CNR Interomics Flagship Project "- Development of an integrated platform for the application of "omic" sciences to biomarker definition and theranostic, predictive and diagnostic profiles".
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 6, 2015: Selected articles from the 10th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S6.
- 5.Nei M, Kumar MD: Molecular Evolution and Phylogenetics. 2000, Oxford University Press, New YorkGoogle Scholar
- 6.La Rosa M, Di Fatta G, Gaglio S, Giammanco G, Rizzo R, Urso A: Soft topographic map for clustering and classification of bacteria. Advances in Intelligent Data Analysis VII Lecture Notes in Computer Science. 2007, Springer, Berlin, Heidelberg, 4723: 332-343.Google Scholar
- 7.La Rosa M, Rizzo R, Urso A: Soft Topographic Maps for Clustering and Classifying Bacteria Using Housekeeping Genes. Advances in Artificial Neural Systems. 2011, 2011:Google Scholar
- 8.La Rosa M, Rizzo R, Urso A, Gaglio S: Comparison of genomic sequences clustering using normalized compression distance and evolutionary distance. Knowledge-Based Intelligent Information and Engineering Systems Lecture Notes in Computer Science. 2008, Springer, Berlin, Heidelberg, 5179: 740-746.CrossRefGoogle Scholar
- 11.La Rosa M, Fiannaca A, Rizzo R, Urso A: A Study of Compression-Based Methods for the Analysis of Barcode Sequences. Computational Intelligence Methods for Bioinformatics and Biostatistics Lecture Notes in Computer Science. Edited by: Peterson, L.E., Masulli, F., Russo, G. 2013, Springer, Berlin, Heidelberg, 7845: 105-116. 10.1007/978-3-642-38342-7_10.CrossRefGoogle Scholar
- 14.Scholkopf B, Smola AJ: Learning with Kernels. 2002, MIT Press, CambridgeGoogle Scholar
- 18.Fiannaca A, La Rosa M, Rizzo R, Urso A: Analysis of DNA Barcode Sequences Using Neural Gas and Spectral Representation. Engineering Applications of Neural Networks Communications in Computer and Information Science. Edited by: Iliadis, L., Papadopoulos, H., Jayne, C. 2013, Springer, Berlin, Heidelberg, 384: 212-221. 10.1007/978-3-642-41016-1_23.CrossRefGoogle Scholar
- 24.Steyvers M, Griffiths T: Probabilistic Topic Models. Handbook of Latent Semantic Analysis. Edited by: Landauer, T., McNamara, D.S., Dennis, S., Kintsch, W. 2007, Erlbaum, Hillsdale, NJGoogle Scholar
- 25.Perona P: A Bayesian Hierarchical Model for Learning Natural Scene Categories. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). 2005, IEEE, 2: 524-531.Google Scholar
- 27.Blei DM, Jordan MI: Modeling annotated data. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval - SIGIR '03. 2003, ACM Press, New York, New York, USA, 127:Google Scholar
- 28.Hu DJ, Saul LK: A probabilistic topic model for unsupervised learning of musical key-profiles. 10th International Society for Music Information Retrieval Conference (ISMIR 2009). 2009, 441-446.Google Scholar
- 36.Blei DM, Ng AY, Jordan MI: Latent Dirichlet Allocation. J Mach Learn Res. 2003, 3: 993-1022.Google Scholar
- 39.Grun B, Hornik K: topicmodels: An R Package for Fitting Topic Models. Journal of Statistical Software. 2011, 40 (13):Google Scholar
- 40.Casella G, George EI: Explaining the Gibbs Sampler. The American Statistician. 1992, 46 (3): 167-174.Google Scholar
- 41.Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen aS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM: The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic acids research. 2009, 37 (Database issue): D141-D145.PubMedCentralCrossRefPubMedGoogle Scholar
- 43.La Rosa M, Fiannaca A, Rizzo R, Urso A: Genomic Sequence Classification using Probabilistic Topic Modeling. Computational Intelligence Methods for Bioinformatics and Biostatistics Lecture Notes in Computer Science. 2014, Springer, Berlin, Heidelberg, 8452: 49-61. 10.1007/978-3-319-09042-9_4.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.