Exploring Genome-Wide Expression Profiles Using Machine Learning Techniques
Although contemporary high-throughput –omics methods produce high-dimensional data, the resulting wealth of information is difficult to assess using traditional statistical procedures. Machine learning methods facilitate the detection of additional patterns, beyond the mere identification of lists of features that differ between groups.
Here, we demonstrate the utility of (1) supervised classification algorithms in class validation, and (2) unsupervised clustering in class discovery. We use data from our previous work that described the transcriptional profiles of gingival tissue samples obtained from subjects suffering from chronic or aggressive periodontitis (1) to test whether the two diagnostic entities were also characterized by differences on the molecular level, and (2) to search for a novel, alternative classification of periodontitis based on the tissue transcriptomes.
Using machine learning technology, we provide evidence for diagnostic imprecision in the currently accepted classification of periodontitis, and demonstrate that a novel, alternative classification based on differences in gingival tissue transcriptomes is feasible. The outlined procedures allow for the unbiased interrogation of high-dimensional datasets for characteristic underlying classes, and are applicable to a broad range of –omics data.
Key wordsPeriodontal disease Aggressive periodontitis Chronic periodontitis Gene expression Transcriptome Gingiva Classification Machine learning
This work was supported by grants from the German Society for Periodontology (DG PARO) and the German Society for Oral and Maxillo-Facial Sciences (DGZMK) to M.K., and by grants from NIH/NIDCR (DE015649 and DE024735) and by an unrestricted gift from Colgate-Palmolive Inc. to P.N.P. The authors thank Prof. Anne-Laure Boulesteix (Munich, Germany) and Prof. Bettina Grün (Linz, Austria) for their support with the CMA and flexmix packages, respectively.
- 8.Warnes GR, Bolker B, Bonebakker L, Gentleman R, Huber W, Liaw A, Lumley T, Maechler M, Magnusson A, Moeller S, Schwartz M, Venables B (2009) gplots: various R programming tools for plotting data. R Package Version 2(4)Google Scholar
- 9.Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) MCLUST version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report no. 597, Department of Statistics, University of Washington, USAGoogle Scholar
- 18.Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3:Article3Google Scholar