Skip to main content

Better Interpretable Models for Proteomics Data Analysis Using Rule-Based Mining

  • Conference paper
  • First Online:
Towards Integrative Machine Learning and Knowledge Extraction

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10344))

Abstract

Recent advances in -omics technology has yielded in large data-sets in many areas of biology, such as mass spectrometry based proteomics. However, analyzing this data is still a challenging task mainly due to the very high dimensionality and high noise content of the data. One of the main objectives of the analysis is the identification of relevant patterns (or features) which can be used for classification of new samples to healthy or diseased. So, a method is required to find easily interpretable models from this data.

To gain the above mentioned goal, we have adapted the disjunctive association rule mining algorithm, TitanicOR, to identify emerging patterns from our mass spectrometry proteomics data-sets. Comparison to five state-of-the-art methods shows that our method is advantageous them in terms of identifying the inter-dependency between the features and the TP-rate and precision of the features selected. We further demonstrate the applicability of our algorithm to one previously published clinical data-set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Vapnik, V.: Pattern recognition using generalized portrait method. Autom. Remote Control 24, 774–780 (1963)

    Google Scholar 

  2. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  3. Helleputte, T.: LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++ Library. R package version 2.10-8 (2017)

    Google Scholar 

  4. Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39(5), 1–13 (2011)

    Article  Google Scholar 

  5. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  6. Therneau, T., Beth Atkinson, B.R.: Recursive Partitioning and Regression Trees. R package version 4.1-10 (2015)

    Google Scholar 

  7. Kuhn, M.: Classification and Regression Training. R package version 6.0-73 (2016)

    Google Scholar 

  8. Vimieiro, R., Moscato, P.: Mining disjunctive minimal generators with titanicor. Expert Syst. Appl. 39(9), 8228–8238 (2012)

    Article  Google Scholar 

  9. Gibb, S., Strimmer, K.: Multi-Class Discriminant Analysis using Binary Predictors. R package version 1.0.3 (2015)

    Google Scholar 

  10. Holzinger, A.: Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Inf. 3(2), 119–131 (2016)

    Article  Google Scholar 

  11. Holzinger, A., Plass, M., Holzinger, K., Crisan, G.C., Pintea, C.M., Palade, V.: A glass-box interactive machine learning approach for solving np-hard problems with the human-in-the-loop. arXiv preprint (2017). arXiv:1708.01104

  12. Bakin, S., et al.: Adaptive regression and model selection in data mining problems. Ph.D. thesis, The Australian National University (1999)

    Google Scholar 

  13. Lawton, W.H., Sylvestre, E.A.: Self modeling curve resolution. Technometrics 13(3), 617–633 (1971)

    Article  Google Scholar 

  14. Loekito, E., Bailey, J.: Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 307–316. ACM (2006)

    Google Scholar 

  15. Vimieiro, R., Moscato, P.: A new method for mining disjunctive emerging patterns in high-dimensional datasets using hypergraphs. Inf. Syst. 40, 1–10 (2014)

    Article  MATH  Google Scholar 

  16. Vimieiro, R.: Mining disjunctive patterns in biomedical data sets. Ph.D. thesis, University of Newcastle, Faculty of Engineering & Built Environment, School of Electrical Engineering and Computer Science (2012)

    Google Scholar 

  17. Zhao, L., Zaki, M.J., Ramakrishnan, N.: Blosom: a framework for mining arbitrary boolean expressions. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 827–832. ACM (2006)

    Google Scholar 

  18. Liu, Q., Sung, A.H., Qiao, M., Chen, Z., Yang, J.Y., Yang, M.Q., Huang, X., Deng, Y.: Comparison of feature selection and classification for maldi-ms data. BMC Genom. 10(1), S3 (2009)

    Article  Google Scholar 

  19. Swan, A.L., Mobasheri, A., Allaway, D., Liddell, S., Bacardit, J.: Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. Omics: J. Integr. Biol. 17(12), 595–610 (2013)

    Article  Google Scholar 

  20. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM Sigmod Record, vol. 22, pp. 207–216. ACM (1993)

    Google Scholar 

  21. Varadan, V., Anastassiou, D.: Inference of disease-related molecular logic from systems-based microarray analysis. PLoS Comput. Biol. 2(6), e68 (2006)

    Article  Google Scholar 

  22. Sahoo, D., Dill, D.L., Gentles, A.J., Tibshirani, R., Plevritis, S.K.: Boolean implication networks derived from large scale, whole genome microarray datasets. Genome Biol. 9(10), R157 (2008)

    Article  Google Scholar 

  23. Li, J., Li, H., Wong, L., Pei, J., Dong, G.: Minimum description length principle: Generators are preferable to closed patterns. AAA I, 409–414 (2006)

    Google Scholar 

  24. Gibb, S., Strimmer, K.: MALDIquant: a versatile R package for the analysis of mass spectrometry data. Bioinformatics 28(17), 2270–2271 (2012)

    Article  Google Scholar 

  25. Savitzky, A., Golay, M.J.: Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36(8), 1627–1639 (1964)

    Article  Google Scholar 

  26. He, Q.P., Wang, J., Mobley, J.A., Richman, J., Grizzle, W.E.: Self-calibrated warping for mass spectra alignment. Cancer Inf. 10, 65 (2011)

    Google Scholar 

  27. Fayyad, U., Irani, K.: Multi-interval discretization of continuous valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1029 (1993)

    Google Scholar 

  28. Kim, H.: Data preprocessing, discretization for classification. R package version 1.0-1 (2010)

    Google Scholar 

  29. Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)

    Article  MATH  Google Scholar 

  30. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., Lakhal, L.: Computing iceberg concept lattices with titanic. Data Knowl. Eng. 42(2), 189–222 (2002)

    Article  MATH  Google Scholar 

  31. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings 20th International Conference Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)

    Google Scholar 

  32. Li, J.: Prediction by collective likelihood from emerging patterns, US Patent Ap. 10/524,606, 22 August 2002

    Google Scholar 

  33. Dong, G., Li, J.: Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 43–52. ACM (1999)

    Google Scholar 

  34. Fiedler, G.M., Leichtle, A.B., Kase, J., Baumann, S., Ceglarek, U., Felix, K., Conrad, T., Witzigmann, H., Weimann, A., Schütte, C., et al.: Serum peptidome profiling revealed platelet factor 4 as a potential discriminating peptide associated with pancreatic cancer. Clin. Cancer Res. 15(11), 3812–3819 (2009)

    Article  Google Scholar 

  35. Conrad, T.O., Genzel, M., Cvetkovic, N., Wulkow, N., Leichtle, A., Vybiral, J., Kutyniok, G., Schütte, C.: Sparse proteomics analysis-a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data. BMC Bioinf. 18(1), 160 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fahrnaz Jayrannejad .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jayrannejad, F., Conrad, T.O.F. (2017). Better Interpretable Models for Proteomics Data Analysis Using Rule-Based Mining. In: Holzinger, A., Goebel, R., Ferri, M., Palade, V. (eds) Towards Integrative Machine Learning and Knowledge Extraction. Lecture Notes in Computer Science(), vol 10344. Springer, Cham. https://doi.org/10.1007/978-3-319-69775-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69775-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69774-1

  • Online ISBN: 978-3-319-69775-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics