The Complexity of Some Pattern Problems in the Logical Analysis of Large Genomic Data Sets

Lancia, Giuseppe; Serafini, Paolo

doi:10.1007/978-3-319-31744-1_1

Giuseppe Lancia¹⁵ &
Paolo Serafini¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9656))

Included in the following conference series:

International Conference on Bioinformatics and Biomedical Engineering

1879 Accesses
1 Citations

Abstract

Many biomedical experiments produce large data sets in the form of binary matrices, with features labeling the columns and individuals (samples) associated to the rows. An important case is when the rows are also labeled into two groups, namely the positive (or healthy) and the negative (or diseased) samples. The Logical Analysis of Data (LAD) is a procedure aimed at identifying relevant features and building boolean formulas (rules) which can be used to classify new samples as positive or negative. These rules are said to explain the data set. Each rule can be represented by a string over {0,1,-}, called a pattern. A data set can be explained by alternative sets of patterns, and many computational problems arise related to the choice of a particular set of patterns for a given instance. In this paper we study the computational complexity of these pattern problems and show that they are, in general, very hard. We give an integer programming formulation for the problem of determining if two sets of patterns are equivalent. We also prove computational complexity results which imply that there should be no simple ILP model for finding a minimal set of patterns explaining a given data set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alexe, G., Alexe, S., Bonates, T.O., Kogan, A.: Logical analysis of data - the vision of Peter L. Hammer. Ann. Math. Artif. Intell. 49, 265–312 (2007)
Article MathSciNet MATH Google Scholar
Baker, W., van den Broek, A., Camon, E., Hingamp, P., Sterk, P., Stoesser, G., Tuli, M.A.: The EMBL nucleotide sequence database. Nucleic Acid Res. 28, 19–23 (2000)
Article Google Scholar
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acid Res. 28, 235–242 (2000)
Article Google Scholar
Bertolazzi, P., Felici, G., Festa, P., Lancia, G.: Logic classification and feature selection for biomedical data. Comput. Math. Appl. 55, 889–899 (2008)
Article MathSciNet MATH Google Scholar
Boros, E., Hammer, P., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Trans. Knowl. Data Eng. 12, 292–306 (2000)
Article Google Scholar
Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X.C., Stern, D., Winkler, J., Lockhart, D.J., Morris, M.S., Fodor, S.P.A.: Accessing genetic information with high-density DNA arrays. Science 274, 610–614 (1996)
Article Google Scholar
Chikalov, I., Lozin, V., Lozina, I., Moshkov, M., Son Nguyen, H., Skowron, A., Zielosko, B.:Logical analysis of data: theory, methodology and applications. In: Three Approaches to Data Analysis, pp. 147–192. Springer (2013)
Google Scholar
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1, 131–156 (1997)
Article Google Scholar
Felici, G., de Angelis, V., Mancinelli, G.: Feature selection for data mining. In: Felici, G., Triantaphyllou, E. (eds.) Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques, pp. 227–252. Springer (2006)
Google Scholar
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, San Francisco (1979)
MATH Google Scholar
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Article Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Article MATH Google Scholar
Hammer, P., Bonates, T.: Logical analysis of data: from combinatorial optimization to medical applications. RUTCOR Research Report, 10-05, Rutgers University, NJ (2005)
Google Scholar
Hebert, P.D.N., Cywinska, A., Ball, S.L., de Waard, J.R.: Biological identifications through DNA barcodes. Proc. R. Soc. Lond. B 270, 313–321 (2003)
Article Google Scholar
Hu, H., Li, J., Plank, A., Wang, H., Daggard, G.: A comparative study of classification methods for microarray data analysis. In: Proceedings of the fifth Australasian Conference on Data Mining and Analystics, vol. 61, pp. 33–37, Sydney, Australia (2006)
Google Scholar
Li, T., Zhang, C., Ogihara, M.: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20, 2429–2437 (2004)
Article Google Scholar
Montgomery, D., Undem, B.L.: CombiMatrix’ customizable DNA microarrays-Tutoial: In situ computer-aided synthesis of custom oligo microarrays. Genetic Eng. News 22, 32–33 (2002). Drug Discovery
Google Scholar
Serafini, P.: Classifying negative and positive points by optimal box clustering. Discrete Appl. Math. 165(270–282), 10 (2014)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, University of Udine, Udine, Italy
Giuseppe Lancia & Paolo Serafini

Authors

Giuseppe Lancia
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Serafini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giuseppe Lancia .

Editor information

Editors and Affiliations

Universidad de Granada, Granada, Spain
Francisco Ortuño
Universidad de Granada, Granada, Spain
Ignacio Rojas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lancia, G., Serafini, P. (2016). The Complexity of Some Pattern Problems in the Logical Analysis of Large Genomic Data Sets. In: Ortuño, F., Rojas, I. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2016. Lecture Notes in Computer Science(), vol 9656. Springer, Cham. https://doi.org/10.1007/978-3-319-31744-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-31744-1_1
Published: 25 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31743-4
Online ISBN: 978-3-319-31744-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics