Wrapper- and Ensemble-Based Feature Subset Selection Methods for Biomarker Discovery in Targeted Metabolomics

  • Holger Franken
  • Rainer Lehmann
  • Hans-Ulrich Häring
  • Andreas Fritsche
  • Norbert Stefan
  • Andreas Zell
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7036)

Abstract

The discovery of markers allowing for accurate classification of metabolically very similar proband groups constitutes a challenging problem. We apply several search heuristics combined with different classifier types to targeted metabolomics data to identify compound subsets that classify plasma samples of insulin sensitive and -resistant subjects, both suffering from non-alcoholic fatty liver disease. Additionally, we integrate these methods into an ensemble and screen selected subsets for common features. We investigate, which methods appear the most suitable for the task, and test feature subsets for robustness and reproducibility. Furthermore, we consider the predictive potential of different compound classes. We find that classifiers fail in discriminating the non-selected data accurately, but benefit considerably from feature subset selection. Especially, a Pareto-based multi-objective genetic algorithm detects highly discriminative subsets and outperforms widely used heuristics. When transferred to new data, feature sets assembled by the ensemble approach show greater robustness than those selected by single methods.

Keywords

Feature Subset Hill Climbing Multi Objective Genetic Algorithm Feature Subset Selection Average AUCs 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Atkinson, A., Colburn, W., DeGruttola, V., DeMets, D., Downing, G., Hoth, D., Oates, J., Peck, C., Schooley, R., Spilker, B., et al.: Biomarkers and surrogate endpoints: Preferred definitions and conceptual framework. Clinical Pharmacology & Therapeutics 69(3), 89–95 (2001)CrossRefGoogle Scholar
  2. 2.
    Cleary, J., Trigg, L.: K*: An Instance-based Learner Using an Entropic Distance Measure. In: Proceedings of the 12th International Conference on Machine Learning (1995)Google Scholar
  3. 3.
    Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6(2), 182–197 (2002)CrossRefGoogle Scholar
  4. 4.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)CrossRefGoogle Scholar
  5. 5.
    Harrigan, G., Goodacre, R.: Metabolic profiling: its role in biomarker discovery and gene function analysis. Springer, Netherlands (2003)CrossRefGoogle Scholar
  6. 6.
    Holland, J.: Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor (1975)Google Scholar
  7. 7.
    Holmes, E., Nicholson, J., Nicholls, A., Lindon, J., Connor, S., Polley, S., Connelly, J.: The identification of novel biomarkers of renal toxicity using automatic data reduction techniques and PCA of proton NMR spectra of urine. Chemometrics and Intelligent Laboratory Systems 44(1), 245–255 (1998)CrossRefGoogle Scholar
  8. 8.
    Koopmann, J., Zhang, Z., White, N., Rosenzweig, J., Fedarko, N., Jagannath, S., Canto, M., Yeo, C., Chan, D., Goggins, M.: Serum diagnosis of pancreatic adenocarcinoma using surface-enhanced laser desorption and ionization mass spectrometry. Clinical Cancer Research 10(3), 860 (2004)CrossRefGoogle Scholar
  9. 9.
    Kronfeld, M., Planatscher, H., Zell, A.: The EvA2 optimization framework. In: Blum, C., Battiti, R. (eds.) LION 4. LNCS, vol. 6073, pp. 247–250. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  10. 10.
    Larranaga, P., Lozano, J.: Estimation of distribution algorithms: A new tool for evolutionary computation, vol. 2. Springer, Netherlands (2002)MATHGoogle Scholar
  11. 11.
    Li, J., Zhang, Z., Rosenzweig, J., Wang, Y., Chan, D.: Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clinical Chemistry 48(8), 1296 (2002)Google Scholar
  12. 12.
    Li, T., Zhang, C., Ogihara, M.: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics (Oxford, England) 20(15), 2429–2437 (2004)CrossRefGoogle Scholar
  13. 13.
    Lim, T.: A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms. Machine Learning 40, 203–228 (2000)CrossRefMATHGoogle Scholar
  14. 14.
    Mahadevan, S., Shah, S.L., Marrie, T.J., Slupsky, C.M.: Analysis of metabolomic data using support vector machines. Analytical Chemistry 80(19), 7562–7570 (2008)CrossRefGoogle Scholar
  15. 15.
    Masseglia, F., Poncelet, P., Teisseire, M.: Successes and new directions in data mining. Information Science Publishing (2008)Google Scholar
  16. 16.
    Mitchell, M., Holland, J., Forrest, S.: When will a genetic algorithm outperform hill climbing? Ann Arbor 1001, 48109Google Scholar
  17. 17.
    Petersen, K., Dufour, S., Befroy, D., Lehrke, M., Hendler, R., Shulman, G.: Reversal of Nonalcoholic Hepatic Steatosis, Hepatic Insulin Resistance, and Hyperglycemia by Moderate Weight Reduction in Patients With Type 2 Diabetes. Metabolism 54, 603–608 (2005)Google Scholar
  18. 18.
    Ressom, H.W., Varghese, R.S., Abdel-Hamid, M., Eissa, S.A.L., Saha, D., Goldman, L., Petricoin, E.F., Conrads, T.P., Veenstra, T.D., Loffredo, C.A., Goldman, R.: Analysis of mass spectral serum profiles for biomarker selection. Bioinformatics (Oxford, England) 21(21), 4039–4045 (2005)CrossRefGoogle Scholar
  19. 19.
    Saeys, Y., Inza, I.N., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics (Oxford, England) 23(19), 2507–2517 (2007)CrossRefGoogle Scholar
  20. 20.
    Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learning), 1st edn. The MIT Press (2001)Google Scholar
  21. 21.
    Stefan, N., Kantartzis, K., Häring, H.U.: Causes and metabolic consequences of Fatty liver. Endocrine Reviews 29(7), 939–960 (2008)CrossRefGoogle Scholar
  22. 22.
    Streichert, F., Stein, G., Ulmer, H., Zell, A.: A clustering based niching EA for multimodal search spaces. In: Liardet, P., Collet, P., Fonlupt, C., Lutton, E., Schoenauer, M. (eds.) EA 2003. LNCS, vol. 2936, pp. 293–304. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  23. 23.
    Zou, W., Tolstikov, V.: Probing genetic algorithms for feature selection in comprehensive metabolic profiling approach. Rapid Communications in Mass Spectrometry 22(8), 1312–1324 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Holger Franken
    • 1
  • Rainer Lehmann
    • 2
    • 3
  • Hans-Ulrich Häring
    • 2
    • 3
  • Andreas Fritsche
    • 2
    • 3
  • Norbert Stefan
    • 2
    • 3
  • Andreas Zell
    • 1
  1. 1.Center for Bioinformatics (ZBIT)University of TübingenTübingenGermany
  2. 2.Division of Clinical Chemistry and Pathobiochemistry (Central Laboratory)University Hospital TübingenTübingenGermany
  3. 3.Paul-Langerhans-Institute Tübingen, Member of the German Centre for Diabetes Research (DZD)Eberhard Karls University TübingenTübingenGermany

Personalised recommendations