Abstract
Mass spectrometry (MS)-based metabolomics studies often require handling of both identified and unidentified metabolite data. In order to avoid bias in data interpretation, it would be of advantage for the data analysis to include all available data. A practical challenge in exploratory metabolomics analysis is therefore how to interpret the changes related to unidentified peaks. In this paper, we address the challenge by predicting the class membership of unknown peaks by applying and comparing multiple supervised classifiers to selected lipidomics datasets. The employed classifiers include k-nearest neighbours (k-NN), support vector machines (SVM), partial least squares and discriminant analysis (PLS-DA) and Naive Bayes methods which are known to be effective and efficient in predicting the labels for unseen data. Here, the class label predictions are sought for unidentified lipid profiles coming from high throughput global screening in Ultra Performance Liquid Chromatography Mass Spectrometry (UPLCTM/MS) experimental setup. Our investigation reveals that k-NN and SVM classifiers outperform both PLS-DA and Naive Bayes classifiers. Naive Bayes classifier perform poorly among all models and this observation seems logical as lipids are highly co-regulated and do not respect Naive Bayes assumptions of features being conditionally independent given the class. Common label predictions from k-NN and SVM can serve as a good starting point to explore full data and thereby facilitating exploratory studies where label information is critical for the data interpretation.
Similar content being viewed by others
References
Barker, M., & Rayens, W. (2003). Partial least squares for discrimination. Journal of Chemometrics, 17, 166–173.
Bijlsma, S., Bobeldijk, I., Verheij, E. R., et al. (2006). Large-scale human metabolomics studies: A strategy for data (pre-) processing and validation. Analytical Chemistry, 78, 567–574. doi:10.1021/ac051495j.
Brereton, R. G. (2006). Consequences of sample size, variable selection, and model validation and optimisation for predicting classification ability from analytical data. TrAC Trends in Analytical Chemistry, 25, 1103–1111.
Caffrey, M., & Hogan, J. (1992). LIPIDAT: A database of lipid phase transition temperatures and enthalpy changes. DMPC Data Subset Analysis. Chemistry and Physics of Lipids, 61, 1–109.
Chang, C. -C. & Lin, C. -J. (2001). LIBSVM: A library for support vector machines. Available online: http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley.
Ejsing, C. S., Duchoslav, E., Sampaio, J., et al. (2006). Automated identification and quantification of glycerophospholipid molecular species by multiple precursor ion scanning. Analytical Chemistry, 78, 6202–6214.
Ekroos, K., Chernushevich, I. V., Simons, K., & Shevchenko, A. (2002). Quantitative profiling of phospholipids by multiple precursor ion scanning on a hybrid quadrupole time-of-flight mass spectrometer. Analytical Chemistry, 74, 941–949.
Fahy, E., Sud, M., Cotter, D., & Subramaniam, S. (2007). LIPID MAPS online tools for lipid research. Nucleic Acids Research, 35, W606–612.
Han, X., & Gross, R. W. (2005). Shotgun lipidomics: Electrospray ionization mass spectrometric analysis and quantitation of cellular lipidomes directly from crude extracts of biological samples. Mass Spectrometry Reviews, 24, 367–412.
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge: MIT Press.
Hu, C., van Dommelen, J., van der Heijden, R., et al. (2008). RPLC-Ion-Trap-FTMS method for lipid profiling of plasma: Method validation and application to p53 mutant mouse model. Journal of Proteome Research, 7, 4982–4991. doi:10.1021/pr800373m.
Katajamaa, M., Miettinen, J., & Oresic, M. (2006). MZmine: Toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics, 22, 634–636. doi:10.1093/bioinformatics/btk039.
Katajamaa, M., & Orešic, M. (2005). Processing methods for differential analysis of LC/MS profile data. BMC Bioinformatics, 6, 179–190.
Kind, T., & Fiehn, O. (2007). Seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics, 8, 105.
Lu, Y., Hong, S., Gotlinger, K., & Serhan, C. (2006a). Lipid mediator informatics and proteomics in inflammation-resolution. The Scientific World Journal, 6, 589–614.
Lu, Y., Hong, S., & Serhan, C. (2006b). Lipid mediator informatics-lipidomics: Novel pathways in mapping resolution. AAPS Journal, 8, E284–E297.
Mertens, B. J. A., Noo, M. E. D., Tollenaar, R. A. E. M., & Deelder, A. M. (2006). Mass spectrometry proteomic diagnosis: Enacting the double cross-validatory paradigm. Journal of Computational Biology, 13(159), 1–1605. doi:10.1089/cmb.2006.13.1591.
Moco, S., Vervoort, J., Moco, S., Bino, R. J., De Vos, R. C. H., & Bino, R. (2007). Metabolomics technologies and metabolite identification. TrAC Trends in Analytical Chemistry, 26, 855–866.
Pietiläinen, K. H., Sysi-Aho, M., Rissanen, A., et al. (2007). Acquired obesity is associated with changes in the serum lipidomic profile independent of genetic effects—a monozygotic twin study. PLoS ONE, 2, e218.
Rogers, S., Scheltema, R. A., Girolami, M., & Breitling, R. (2009). Probabilistic assignment of formulas to mass peaks in metabolomics experiments. Bioinformatics, 25(51), 2–518. doi:10.1093/bioinformatics/btn642.
Smit, S., Hoefsloot, H. C. J., & Smilde, A. K. (2008). Statistical data processing in clinical proteomics. Journal of Chromatography B, 866, 77–88.
Smit, S., van Breemen, M. J., Hoefsloot, H. C. J., Smilde, A. K., Aerts, J. M. F. G., & de Koster, C. G. (2007). Assessing the statistical validity of proteomics based biomarkers. Analytica Chimica Acta, 592, 210–217.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B, 36, 111–133.
Sud, M., Fahy, E., Cotter, D., et al. (2007). LMSD: LIPID MAPS structure database. Nucleic Acids Research, 35, D527–532.
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag.
Watanabe, K., Yasugi, E., & Oshima, M. (2000). How to search the glycolipid data in LIPIDBANK for Web: the newly developed lipid database. Japan Trend Glycoscience and Glycotechnology, 12, 175–184.
Yetukuri, L., Katajamaa, M., Medina-Gomez, G., Seppanen-Laakso, T., Vidal-Puig, A., & Oresic, M. (2007). Bioinformatics strategies for lipidomics analysis: Characterization of obesity related hepatic steatosis. BMC Systems Biology, 1, 12.
Acknowledgments
This project was supported by the Academy of Finland (Decision # 111338).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yetukuri, L., Tikka, J., Hollmén, J. et al. Functional prediction of unidentified lipids using supervised classifiers. Metabolomics 6, 18–26 (2010). https://doi.org/10.1007/s11306-009-0179-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11306-009-0179-x