Abstract
Many biomedical experiments produce large data sets in the form of binary matrices, with features labeling the columns and individuals (samples) associated to the rows. An important case is when the rows are also labeled into two groups, namely the positive (or healthy) and the negative (or diseased) samples. The Logical Analysis of Data (LAD) is a procedure aimed at identifying relevant features and building boolean formulas (rules) which can be used to classify new samples as positive or negative. These rules are said to explain the data set. Each rule can be represented by a string over {0,1,-}, called a pattern. A data set can be explained by alternative sets of patterns, and many computational problems arise related to the choice of a particular set of patterns for a given instance. In this paper we study the computational complexity of these pattern problems and show that they are, in general, very hard. We give an integer programming formulation for the problem of determining if two sets of patterns are equivalent. We also prove computational complexity results which imply that there should be no simple ILP model for finding a minimal set of patterns explaining a given data set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alexe, G., Alexe, S., Bonates, T.O., Kogan, A.: Logical analysis of data - the vision of Peter L. Hammer. Ann. Math. Artif. Intell. 49, 265–312 (2007)
Baker, W., van den Broek, A., Camon, E., Hingamp, P., Sterk, P., Stoesser, G., Tuli, M.A.: The EMBL nucleotide sequence database. Nucleic Acid Res. 28, 19–23 (2000)
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acid Res. 28, 235–242 (2000)
Bertolazzi, P., Felici, G., Festa, P., Lancia, G.: Logic classification and feature selection for biomedical data. Comput. Math. Appl. 55, 889–899 (2008)
Boros, E., Hammer, P., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Trans. Knowl. Data Eng. 12, 292–306 (2000)
Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X.C., Stern, D., Winkler, J., Lockhart, D.J., Morris, M.S., Fodor, S.P.A.: Accessing genetic information with high-density DNA arrays. Science 274, 610–614 (1996)
Chikalov, I., Lozin, V., Lozina, I., Moshkov, M., Son Nguyen, H., Skowron, A., Zielosko, B.:Logical analysis of data: theory, methodology and applications. In: Three Approaches to Data Analysis, pp. 147–192. Springer (2013)
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1, 131–156 (1997)
Felici, G., de Angelis, V., Mancinelli, G.: Feature selection for data mining. In: Felici, G., Triantaphyllou, E. (eds.) Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques, pp. 227–252. Springer (2006)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, San Francisco (1979)
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Hammer, P., Bonates, T.: Logical analysis of data: from combinatorial optimization to medical applications. RUTCOR Research Report, 10-05, Rutgers University, NJ (2005)
Hebert, P.D.N., Cywinska, A., Ball, S.L., de Waard, J.R.: Biological identifications through DNA barcodes. Proc. R. Soc. Lond. B 270, 313–321 (2003)
Hu, H., Li, J., Plank, A., Wang, H., Daggard, G.: A comparative study of classification methods for microarray data analysis. In: Proceedings of the fifth Australasian Conference on Data Mining and Analystics, vol. 61, pp. 33–37, Sydney, Australia (2006)
Li, T., Zhang, C., Ogihara, M.: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20, 2429–2437 (2004)
Montgomery, D., Undem, B.L.: CombiMatrix’ customizable DNA microarrays-Tutoial: In situ computer-aided synthesis of custom oligo microarrays. Genetic Eng. News 22, 32–33 (2002). Drug Discovery
Serafini, P.: Classifying negative and positive points by optimal box clustering. Discrete Appl. Math. 165(270–282), 10 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Lancia, G., Serafini, P. (2016). The Complexity of Some Pattern Problems in the Logical Analysis of Large Genomic Data Sets. In: Ortuño, F., Rojas, I. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2016. Lecture Notes in Computer Science(), vol 9656. Springer, Cham. https://doi.org/10.1007/978-3-319-31744-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-31744-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31743-4
Online ISBN: 978-3-319-31744-1
eBook Packages: Computer ScienceComputer Science (R0)