Discovering Networks of Interdependent Features in High-Dimensional Problems

Dramiński, Michał; Da̧browski, Michał J.; Diamanti, Klev; Koronacki, Jacek; Komorowski, Jan

doi:10.1007/978-3-319-26989-4_12

Michał Dramiński⁴,
Michał J. Da̧browski⁴,
Klev Diamanti⁵,
Jacek Koronacki⁴ &
…
Jan Komorowski⁶

Part of the book series: Studies in Big Data ((SBD,volume 16))

4057 Accesses
5 Citations

Abstract

The availability of very large data sets in Life Sciences provided earlier by the technological breakthroughs such as microarrays and more recently by various forms of sequencing has created both challenges in analyzing these data as well as new opportunities. A promising, yet underdeveloped approach to Big Data, not limited to Life Sciences, is the use of feature selection and classification to discover interdependent features. Traditionally, classifiers have been developed for the best quality of supervised classification. In our experience, more often than not, rather than obtaining the best possible supervised classifier, the Life Scientist needs to know which features contribute best to classifying observations (objects, samples) into distinct classes and what the interdependencies between the features that describe the observation. Our underlying hypothesis is that the interdependent features and rule networks do not only reflect some syntactical properties of the data and classifiers but also may convey meaningful clues about true interactions in the modeled biological system. In this chapter we develop further our method of Monte Carlo Feature Selection and Interdependency Discovery (MCFS and MCFS-ID, respectively), which are particularly well suited for high-dimensional problems, i.e., those where each observation is described by very many features, often many more features than the number of observations. Such problems are abundant in Life Science applications. Specifically, we define Inter-Dependency Graphs (termed, somewhat confusingly, ID Graphs) that are directed graphs of interactions between features extracted by aggregation of information from the classification trees constructed by the MCFS algorithm. We then proceed with modeling interactions on a finer level with rule networks. We discuss some of the properties of the ID graphs and make a first attempt at validating our hypothesis on a large gene expression data set for CD4\(^{+}\) T-cells. The MCFS-ID and ROSETTA including the Ciruvis approach offer a new methodology for analyzing Big Data from feature selection, through identification of feature interdependencies, to classification with rules according to decision classes, to construction of rule networks. Our preliminary results confirm that MCFS-ID is applicable to the identification of interacting features that are functionally relevant while rule networks offer a complementary picture with finer resolution of the interdependencies on the level of feature-value pairs.

We thank the reviewer for providing valuable and detailed comments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Consortium, Encode Project, Bernstein et al: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012). doi:10.1038/nature11247
Article Google Scholar
Birney, E., et al.: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146), 799–816 (2007)
Article Google Scholar
Beck, T., Hastings, R.K., Gollapudi, S., Free, R.C., Brookes, A.J.: GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. Eur. J. Hum. Genet. 22(7), 949–952 (2014). doi:10.1038/ejhg.2013.274
Article Google Scholar
Bernstein, B.E., et al.: The NIH roadmap epigenomics mapping consortium. Nat. Biotechnol. 28(10), 1045–1048 (2010). doi:10.1038/nbt1010-1045
Article Google Scholar
Genomes Project, Consortium, Abecasis, G. R. et al: An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422), 56–65 (2012). doi:10.1038/nature11632
Article Google Scholar
Dudoit, S., Fridlyand, J.: Classification in microarray experiments. In: Speed, T. (ed.) Statistical Analysis of Gene Expression Microarray Data, pp. 93–158. Chapman & Hall/CRC (2003)
Google Scholar
Saeys, Y., Inza, I., Larrañaga, P.: A review of featrure selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Article Google Scholar
Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by nearest shrunken centroids of gene exressions. Proc. Natl. Acad. Sci. USA 99, 6567–6572 (2002)
Article Google Scholar
Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statis. Sci. 18, 104–117 (2003)
Article MathSciNet MATH Google Scholar
Li, Y., Campbell, C., Tipping, M.: Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics 18(10), 1332–1339 (2002)
Article Google Scholar
Lu, C., Devos, A., Suykens, J.A., Arús, C., Van Huffel, S.: Bagging linear sparse bayesian learning models for variable selection in cancer diagnosis. IEEE Trans. Inf. Technol. Biomed. 11, 338–347 (2007)
Article Google Scholar
Chrysostomou, K., Chen, Sherry Y., S.Y. and Liu, X.: Combining multiple classifiers for wrapper feature selection. Int. J. Data Mining Modell. Manag. 1, 91–102 (2008)
Google Scholar
Breiman, L., Cutler, A.: Random forests—classification/clustering manual. http://www.math.usu.edu/~adele/forests/cc_home.htm (2008)
Diaz-Uriarte, R., de Andres, S.A.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7(3), (2006). doi:10.1186/1471-2105-7-3
Google Scholar
Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources, and a solution. BMC Bioinform. 8(25), (2007). doi:10.1186/1471-2105-8-25
Archer, K.J., Kimes, R.V.: Empirical characterization of random forest variable importance measures. Comp. Stat. Data Anal. 52(4), 2249–2260 (2008)
Article MathSciNet MATH Google Scholar
Nicodemus, K.K., Malley, J.D., Strobl, C., Ziegler, A.: The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinform. 11, 110 (2010)
Article Google Scholar
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinform. 9(307), (2008). doi:10.1186/1471-2105-9-307
Paul, J., Dupont, P.: Inferring statistically significant features from random forests. Neurocomputing 150, 471–480 (2015)
Article Google Scholar
Huynh-Thu, V.A.A., Saeys, Y., Wehenkel, L., Geurts, P.: Statistical interpretation of machine learning-based feature importance scores for biomarker discovery. Bioinformatics 28(13), 1766–1774 (2012)
Article Google Scholar
Dramiński, M., Koronacki, J., Komorowski, J.: A study on Monte Carlo Gene screening. In: Intelligent Information Processing and Web Mining, pp. 349–356. Springer (2005)
Google Scholar
Dramiński, M., Rada Iglesias, A., Enroth, S., Wadelius, C., Koronacki, J., Komorowski, J.: Monte Carlo feature selection for supervised classification. Bioinformatics 24(1), 110–117 (2008)
Article Google Scholar
Dramiński, M., Kierczak, M., Nowak-Brzezińska, A., Koronacki, J.: The Monte Carlo feature selection and interdependency discovery is practically unbiased. Control Cybern. 40(2), 199–211 (2011)
MATH Google Scholar
Dramiński, M., Kierczak, M., Koronacki, J. and Komorowski, J.: Monte Carlo feature selection and interdependency discovery in supervised classification. In: Advances in Machine Learning, vol. 2, pp. 371–385. Springer (2010)
Google Scholar
Kierczak, M., Ginalski, K., Dramiński, M., Koronacki, J., Rudnicki, W., Komorowski, J.: A rough set-based model of HIV-1 RT Resistome. Bioinformatics a. Biol. Insights 3, 109–127 (2009)
Google Scholar
Kierczak, M., Dramiński, M., Koronacki, J., Komorowski, J.: Computational analysis of local molecular interaction networks underlying change of HIV-1 resistance to selected reverse transcriptase inhibitors. Bioinformatics a. Biol. Insights 4, 137–146 (2010)
Google Scholar
Bornelöv, S., Marillet, S., Komorowski, J.: Ciruvis: a web-based tool for rule networks and interaction detection using rule-based classifiers. BMC Bioinform. 15, 139 (2014)
Google Scholar
Hvidsten, T.R., Wilczyński, B., Kryshtafovych, A., Tiuryn, J., Komorowski, J., Fidelis, K.: Discovering regulatory binding-site modules using rule-based learning. Genome Res. 15(6), 856–866 (2005)
Article Google Scholar
Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Analysis Mach. Intell. 20(8), 832–844 (1998)
Google Scholar
Gyenesei, A., Wagner, U., Barkow-Oesterreicher, S., Stolte, E., Schlapbach, R.: Mining co-regulated gene profiles for the detection of functional associations in gene expression data. Bioinformatics 23(15), 1927–1935 (2007)
Article Google Scholar
Hastie, T., Tibshirani, R., Botstein, D., Brown, P.: Supervised harvesting of expression trees. Genome Biol. 2(1), research0003.1-0003.12 (2001)
Google Scholar
Smyth, G.K., Yang, Y.H., Speed, T.: Statistical issues in cDNA microarray data analysis. In: Brownstein, M.J., Khodursky, A.B. (eds.) Functional Genomics: Methods and Protocols. Methods in Molecular Biology, vol. 224, pp. 111–136. Humana Press (2003)
Google Scholar
Pawlak, Z.: Information systems: theoretical foundations. Inform. Syst. 6(3), 205–218 (1981)
Article MATH Google Scholar
Krzywinski, M., Schein, J., Birol, İ., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J., Marra, M.A.: Circos: an information aesthetic for comparative genomics. Genome Res. 19(9), 1639–1645 (2009)
Article Google Scholar
Ye, C.J., et al.: Intersection of population variation and autoimmunity genetics in human T cell activation. Science 345(6202), 1254665 (2014)
Article Google Scholar
Ames, R.S., et al.: Human urotensin-II is a potent vasoconstrictor and agonist for the orphan receptor GPR14. Nature 401(6750), 282–6 (1999). doi:10.1038/45809
Article Google Scholar
Lehner, U., et al.: Ligands and signaling of the G-protein-coupled receptor GPR14, expressed in human kidney cells. Cell. Physiol. Biochem. 20(1–4), 181–192 (2007)
Article Google Scholar
Ciruvis CD4+example. http://bioinf.icm.uu.se/~ciruvis/results/result_format_rules_TOXhXJ18/ (2014)

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Polish Acad. Sci, Ordona 21, Warsaw, Poland
Michał Dramiński, Michał J. Da̧browski & Jacek Koronacki
Department of Cell and Molecular Biology, Uppsala University, Box 596, Uppsala, Sweden
Klev Diamanti
Department of Cell and Molecular Biology, Uppsala University and Institute of Computer Science, Polish Acad. Sci, Uppsala, Sweden
Jan Komorowski

Authors

Michał Dramiński
View author publications
You can also search for this author in PubMed Google Scholar
Michał J. Da̧browski
View author publications
You can also search for this author in PubMed Google Scholar
Klev Diamanti
View author publications
You can also search for this author in PubMed Google Scholar
Jacek Koronacki
View author publications
You can also search for this author in PubMed Google Scholar
Jan Komorowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Komorowski .

Editor information

Editors and Affiliations

University of Ottawa, Ottawa, Ontario, Canada
Nathalie Japkowicz
Institute of Computing Sciences, Poznań University of Technology, Poznań, Poland
Jerzy Stefanowski

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dramiński, M., Da̧browski, M.J., Diamanti, K., Koronacki, J., Komorowski, J. (2016). Discovering Networks of Interdependent Features in High-Dimensional Problems. In: Japkowicz, N., Stefanowski, J. (eds) Big Data Analysis: New Algorithms for a New Society. Studies in Big Data, vol 16. Springer, Cham. https://doi.org/10.1007/978-3-319-26989-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-26989-4_12
Published: 17 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26987-0
Online ISBN: 978-3-319-26989-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics