Abstract
Recent advances in -omics technology has yielded in large data-sets in many areas of biology, such as mass spectrometry based proteomics. However, analyzing this data is still a challenging task mainly due to the very high dimensionality and high noise content of the data. One of the main objectives of the analysis is the identification of relevant patterns (or features) which can be used for classification of new samples to healthy or diseased. So, a method is required to find easily interpretable models from this data.
To gain the above mentioned goal, we have adapted the disjunctive association rule mining algorithm, TitanicOR, to identify emerging patterns from our mass spectrometry proteomics data-sets. Comparison to five state-of-the-art methods shows that our method is advantageous them in terms of identifying the inter-dependency between the features and the TP-rate and precision of the features selected. We further demonstrate the applicability of our algorithm to one previously published clinical data-set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Vapnik, V.: Pattern recognition using generalized portrait method. Autom. Remote Control 24, 774–780 (1963)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Helleputte, T.: LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++ Library. R package version 2.10-8 (2017)
Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39(5), 1–13 (2011)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Therneau, T., Beth Atkinson, B.R.: Recursive Partitioning and Regression Trees. R package version 4.1-10 (2015)
Kuhn, M.: Classification and Regression Training. R package version 6.0-73 (2016)
Vimieiro, R., Moscato, P.: Mining disjunctive minimal generators with titanicor. Expert Syst. Appl. 39(9), 8228–8238 (2012)
Gibb, S., Strimmer, K.: Multi-Class Discriminant Analysis using Binary Predictors. R package version 1.0.3 (2015)
Holzinger, A.: Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Inf. 3(2), 119–131 (2016)
Holzinger, A., Plass, M., Holzinger, K., Crisan, G.C., Pintea, C.M., Palade, V.: A glass-box interactive machine learning approach for solving np-hard problems with the human-in-the-loop. arXiv preprint (2017). arXiv:1708.01104
Bakin, S., et al.: Adaptive regression and model selection in data mining problems. Ph.D. thesis, The Australian National University (1999)
Lawton, W.H., Sylvestre, E.A.: Self modeling curve resolution. Technometrics 13(3), 617–633 (1971)
Loekito, E., Bailey, J.: Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 307–316. ACM (2006)
Vimieiro, R., Moscato, P.: A new method for mining disjunctive emerging patterns in high-dimensional datasets using hypergraphs. Inf. Syst. 40, 1–10 (2014)
Vimieiro, R.: Mining disjunctive patterns in biomedical data sets. Ph.D. thesis, University of Newcastle, Faculty of Engineering & Built Environment, School of Electrical Engineering and Computer Science (2012)
Zhao, L., Zaki, M.J., Ramakrishnan, N.: Blosom: a framework for mining arbitrary boolean expressions. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 827–832. ACM (2006)
Liu, Q., Sung, A.H., Qiao, M., Chen, Z., Yang, J.Y., Yang, M.Q., Huang, X., Deng, Y.: Comparison of feature selection and classification for maldi-ms data. BMC Genom. 10(1), S3 (2009)
Swan, A.L., Mobasheri, A., Allaway, D., Liddell, S., Bacardit, J.: Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. Omics: J. Integr. Biol. 17(12), 595–610 (2013)
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM Sigmod Record, vol. 22, pp. 207–216. ACM (1993)
Varadan, V., Anastassiou, D.: Inference of disease-related molecular logic from systems-based microarray analysis. PLoS Comput. Biol. 2(6), e68 (2006)
Sahoo, D., Dill, D.L., Gentles, A.J., Tibshirani, R., Plevritis, S.K.: Boolean implication networks derived from large scale, whole genome microarray datasets. Genome Biol. 9(10), R157 (2008)
Li, J., Li, H., Wong, L., Pei, J., Dong, G.: Minimum description length principle: Generators are preferable to closed patterns. AAA I, 409–414 (2006)
Gibb, S., Strimmer, K.: MALDIquant: a versatile R package for the analysis of mass spectrometry data. Bioinformatics 28(17), 2270–2271 (2012)
Savitzky, A., Golay, M.J.: Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36(8), 1627–1639 (1964)
He, Q.P., Wang, J., Mobley, J.A., Richman, J., Grizzle, W.E.: Self-calibrated warping for mass spectra alignment. Cancer Inf. 10, 65 (2011)
Fayyad, U., Irani, K.: Multi-interval discretization of continuous valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1029 (1993)
Kim, H.: Data preprocessing, discretization for classification. R package version 1.0-1 (2010)
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., Lakhal, L.: Computing iceberg concept lattices with titanic. Data Knowl. Eng. 42(2), 189–222 (2002)
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings 20th International Conference Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Li, J.: Prediction by collective likelihood from emerging patterns, US Patent Ap. 10/524,606, 22 August 2002
Dong, G., Li, J.: Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 43–52. ACM (1999)
Fiedler, G.M., Leichtle, A.B., Kase, J., Baumann, S., Ceglarek, U., Felix, K., Conrad, T., Witzigmann, H., Weimann, A., Schütte, C., et al.: Serum peptidome profiling revealed platelet factor 4 as a potential discriminating peptide associated with pancreatic cancer. Clin. Cancer Res. 15(11), 3812–3819 (2009)
Conrad, T.O., Genzel, M., Cvetkovic, N., Wulkow, N., Leichtle, A., Vybiral, J., Kutyniok, G., Schütte, C.: Sparse proteomics analysis-a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data. BMC Bioinf. 18(1), 160 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Jayrannejad, F., Conrad, T.O.F. (2017). Better Interpretable Models for Proteomics Data Analysis Using Rule-Based Mining. In: Holzinger, A., Goebel, R., Ferri, M., Palade, V. (eds) Towards Integrative Machine Learning and Knowledge Extraction. Lecture Notes in Computer Science(), vol 10344. Springer, Cham. https://doi.org/10.1007/978-3-319-69775-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-69775-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69774-1
Online ISBN: 978-3-319-69775-8
eBook Packages: Computer ScienceComputer Science (R0)