Exploiting causality in gene network reconstruction based on graph embedding

Abstract

Gene network reconstruction is a bioinformatics task that aims at modelling the complex regulatory activities that may occur among genes. This task is typically solved by means of link prediction methods that analyze gene expression data. However, the reconstructed networks often suffer from a high amount of false positive edges, which are actually the result of indirect regulation activities due to the presence of common cause and common effect phenomena or, in other terms, due to the fact that the adopted inductive methods do not take into account possible causality phenomena. This issue is accentuated even more by the inherent presence of a high amount of noise in gene expression data. Existing methods for the identification of a transitive reduction of a network or for the removal of (possibly) redundant edges suffer from limitations in the structure of the network or in the nature/length of the indirect regulation, and often require additional pre-processing steps to handle specific peculiarities of the networks (e.g., cycles). Moreover, they are not able to consider possible community structures and possible similar roles of the genes in the network (e.g. hub nodes), which may change the tendency of nodes to be highly connected (and with which nodes) in the network. In this paper, we propose the method INLOCANDA, which learns an inductive predictive model for gene network reconstruction and overcomes all the mentioned limitations. In particular, INLOCANDA is able to (i) identify and exploit indirect relationships of arbitrary length to remove edges due to common cause and common effect phenomena; (ii) take into account possible community structures and possible similar roles by means of graph embedding. Experiments performed along multiple dimensions of analysis on benchmark, real networks of two organisms (E. coli and S. cerevisiae) show a higher accuracy with respect to the competitors, as well as a higher robustness to the presence of noise in the data, also when a huge amount of (possibly false positive) interactions is removed. Availability: http://www.di.uniba.it/~gianvitopio/systems/inlocanda/

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    A formal definition of neighborhood will be introduced later.

  2. 2.

    The fact that this dataset is relatively difficult to analyze is confirmed by the maximum AUPRC obtained by the method proposed in Marbach et al. (2012) and by GENERE (Ceci et al. 2015), which are 0.09 and 0.12, respectively.

References

  1. Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37–66.

    Google Scholar 

  2. Aho, A. V., Garey, M. R., & Ullman, J. D. (1972). The transitive reduction of a directed graph. SIAM Journal on Computing, 1(2), 131–137.

    MathSciNet  MATH  Article  Google Scholar 

  3. Atias, N., & Sharan, R. (2012). Comparative analysis of protein networks: Hard problems, practical solutions. Communications of the ACM, 55(5), 88–97.

    Article  Google Scholar 

  4. Babu, M. M., Luscombe, N. M., Aravind, L., Gerstein, M., & Teichmann, S. A. (2004). Structure and evolution of transcriptional regulatory networks. Current Opinion in Structural Biology, 14(3), 283–291.

    Article  Google Scholar 

  5. Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems 14 (pp. 585–591). Cambridge: MIT Press.

    Google Scholar 

  6. Berger, M. F., & Bulyk, M. L. (2009). Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nature Protocols, 4(3), 393–411.

    Article  Google Scholar 

  7. Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Top-down induction of clustering trees. In J. W. Shavlik (Ed.), ICML 1998 (pp. 55–63). Burlington: Morgan Kaufmann.

    Google Scholar 

  8. Böck, M., Ogishima, S., Tanaka, H., Kramer, S., & Kaderali, L. (2012). Hub-centered gene network reconstruction using automatic relevance determination. PLOS ONE, 7(5), 1–17.

    Article  Google Scholar 

  9. Bošnački, D., Odenbrett, M. R., Wijs, A., Ligtenberg, W., & Hilbers, P. (2012). Efficient reconstruction of biological networks via transitive reduction on general purpose graphics processors. BMC Bioinformatics, 13(1), 281.

    MATH  Article  Google Scholar 

  10. Bulyk, M. L. (2005). Discovering DNA regulatory elements with bacteria. Nature Biotechnology, 23(8), 942–944.

    Article  Google Scholar 

  11. Ceci, M., Pio, G., Kuzmanovski, V., & Dẑeroski, S. (2015). Semi-supervised multi-view learning for gene network reconstruction. PLOS ONE, 10(12), 1–27.

    Article  Google Scholar 

  12. Cohen, W. W. (1995). Fast effective rule induction. In Proceedings of the twelfth international conference on international conference on machine learning, ICML’95 (pp. 115–123). San Francisco, CA: Morgan Kaufmann Publishers Inc.

  13. de Jong, H. (2002). Modeling and simulation of genetic regulatory systems: A literature review. Journal of Computational Biology, 9(1), 67–103.

    MathSciNet  Article  Google Scholar 

  14. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.

    MathSciNet  MATH  Google Scholar 

  15. Emmert-Streib, F., Glazko, G., De Matos Simoes, R., et al. (2012). Statistical inference and reverse engineering of gene regulatory networks from observational expression data. Bioinformatics and Computational Biology, 3, 8.

    Google Scholar 

  16. Gallagher, B., & Eliassi-Rad, T. (2010). Leveraging label-independent features for classification in sparsely labeled networks: An empirical study. In L. Giles, M. Smith, J. Yen, & H. Zhang (Eds.), Advances in Social Network Mining and Analysis (pp. 1–19). Berlin: Springer.

    Google Scholar 

  17. Geistlinger, L., Csaba, G., Dirmeier, S., Küffner, R., & Zimmer, R. (2013). A comprehensive gene regulatory network for the diauxic shift in saccharomyces cerevisiae. Nucleic Acids Research, 41(18), 8452–8463. https://doi.org/10.1093/nar/gkt631.

  18. Grover, A., & Leskovec, J. (2016). Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16 (pp. 855–864). New York, NY: ACM.

  19. Hase, T., Ghosh, S., Yamanaka, R., & Kitano, H. (2013). Harnessing diversity towards the reconstructing of large scale gene regulatory networks. PLoS Computational Biology, 9(11), e1003361.

    Article  Google Scholar 

  20. Hecker, M., Lambeck, S., Toepfer, S., Van Someren, E., & Guthke, R. (2009). Gene regulatory network inference: Data integration in dynamic models—A review. Biosystems, 96(1), 86–103.

    Article  Google Scholar 

  21. Hempel, S., Koseska, A., Nikoloski, Z., & Kurths, J. (2011). Unraveling gene regulatory networks from time-resolved gene expression data—A measures comparison study. BMC Bioinformatics, 12(1), 292.

    Article  Google Scholar 

  22. Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H., & Faloutsos, C. (2011). It’s who you know: Graph mining using recursive structural features. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11 (pp. 663–671). New York: ACM.

  23. Hsu, H. T. (1975). An algorithm for finding a minimal equivalent graph of a digraph. Journal of ACM, 22(1), 11–16.

    MathSciNet  MATH  Article  Google Scholar 

  24. Ibarguren, I., Lasarguren, A., Pérez, J. M., Muguerza, J., Gurrutxaga, I., & Arbelaitz, O. (2016). Bfpart: Best-first part. Information Sciences, 367–368, 927–952.

    Article  Google Scholar 

  25. Itani, S., Ohannessian, M., Sachs, K., Nolan, G.P., & Dahleh, M.A. (2008). Structure learning in causal cyclic networks. In Proceedings of the international conference on causality: objectives and assessment—Vol. 6, COA’08 (pp. 165–176) JMLR.org.

  26. Korb, K. B., & Nicholson, A. E. (2010). Bayesian Artificial Intelligence (2nd ed.). Boca Raton, FL: CRC Press Inc.

    Google Scholar 

  27. Li, J., & Xie, D. (2015). Rack1, a versatile hub in cancer. Oncogene, 34(15), 1890–1898.

    Article  Google Scholar 

  28. Lo, L., Wong, M., Lee, K., & Leung, K. (2015). Time delayed causal gene regulatory network inference with hidden common causes. PLOS ONE, 10(9), 1–47.

    Article  Google Scholar 

  29. Lü, L., & Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6), 1150–1170.

    Article  Google Scholar 

  30. Marbach, D., Costello, J. C., Küffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., et al. (2012). Wisdom of crowds for robust gene network inference. Nature Methods, 9, 796–804.

    Article  Google Scholar 

  31. Margolin, A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R., et al. (2006). Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7(Suppl 1), S7.

    Article  Google Scholar 

  32. Markowetz, F., & Spang, R. (2007). Inferring cellular networks—A review. BMC Bioinformatics, 8(Suppl 6), S5.

    Article  Google Scholar 

  33. Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR arXiv:1301.3781

  34. Omranian, N., Eloundou-Mbebi, J. M. O., Mueller-Roeber, B., & Nikoloski, Z. (2016). Gene regulatory network inference using fused lasso on multiple data sets. Scientific Reports, 6, 20533.

    Article  Google Scholar 

  35. Park, P. J. (2009). ChIP-seq: Advantages and challenges of a maturing technology. Nature Reviews Genetics, 10(10), 669–680.

    Article  Google Scholar 

  36. Pearl, J. (2000). Causality: Models, reasoning, and inference. New York, NY: Cambridge University Press.

    Google Scholar 

  37. Penfold, C. A., & Wild, D. L. (2011). How to infer gene networks from expression profiles, revisited. Interface Focus, 1(6), 857–870.

    Article  Google Scholar 

  38. Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’14 (pp. 701–710). New York, NY: ACM.

  39. Pinna, A., Soranzo, N., & de la Fuente, A. (2010). From knockouts to networks: Establishing direct cause-effect relationships through graph analysis. PLoS ONE, 10(5), e12912.

    Article  Google Scholar 

  40. Pio, G., Ceci, M., Malerba, D., & D’Elia, D. (2015). ComiRNet: A web-based system for the analysis of miRNA-gene regulatory networks. BMC Bioinformatics, 16(9), S7.

    Article  Google Scholar 

  41. Pio, G., Ceci, M., Prisciandaro, F., & Malerba, D. (2017). LOCANDA: Exploiting causality in the reconstruction of gene regulatory networks. In A. Yamamoto, T. Kida, T. Uno, & T. Kuboyama (Eds.), Discovery science 2017, Lecture notes in computer science (Vol. 10558, pp. 283–297). Berlin: Springer.

    Google Scholar 

  42. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.

    Article  Google Scholar 

  43. Selvanathan, S. P., Graham, G. T., Erkizan, H. V., Dirksen, U., Natarajan, T. G., Dakic, A., et al. (2015). Oncogenic fusion protein ews-fli1 is a network hub that regulates alternative splicing. Proceedings of the National Academy of Sciences, 112(11), E1307–E1316.

    Article  Google Scholar 

  44. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, WWW ’15 (pp. 1067–1077). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland.

  45. Tenenbaum, J. B., Silva, V. D., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.

    Article  Google Scholar 

  46. Thattai, M., & van Oudenaarden, A. (2001). Intrinsic noise in gene regulatory networks. Proceedings of the National Academy of Sciences, 98(15), 8614–8619.

    Article  Google Scholar 

  47. Van den Bulcke, T., Van Leemput, K., Naudts, B., van Remortel, P., Ma, H., Verschoren, A., et al. (2006). SynTReN: A generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics, 7, 43.

    Article  Google Scholar 

  48. Vilalta, R., & Drissi, Y. (2002). A perspective view and survey of meta-learning. Artificial Intelligence Review, 18(2), 77–95.

    Article  Google Scholar 

  49. Yu, D., Lim, J., Wang, X., Liang, F., & Xiao, G. (2017). Enhanced construction of gene regulatory networks using hub gene information. BMC Bioinformatics, 18(1), 186.

    Article  Google Scholar 

  50. Zitnik, M., & Zupan, B. (2015). Data imputation in epistatic MAPs by network-guided matrix completion. Journal of Computational Biology, 22(6), 595–608.

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge the support of the European Commission through the Projects MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant Number ICT-2013-612944) and TOREADOR - Trustworthy Model-aware Analytics Data Platform (Grant Number H2020-688797). We would also like to thank Lynn Rudd for her help in reading and correcting the manuscript.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Gianvito Pio.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Takuya Kida, Takeaki Uno, Tetsuji Kuboyama, Akihiro Yamamoto.

Appendices

Appendix 1: Results obtained by KNN classifier

See Figs. 19, 20, 21, 22 and 23.

Fig. 19
figure19

Box plots depicting the results obtained by KNN on Syntren E.coli datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 20
figure20

Box plots depicting the results obtained by KNN on Syntren E.coli datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 21
figure21

Box plots depicting the results obtained by KNN on Syntren Yeast datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 22
figure22

Box plots depicting the results obtained by KNN on Syntren Yeast datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 23
figure23

Box plots depicting the results obtained by KNN on DREAM5 E.coli dataset, by varying the threshold on the weight of the original network edges

Appendix 2: Results obtained by JCHAID classifier

See Figs. 24, 25, 26, 27 and 28.

Fig. 24
figure24

Box plots depicting the results obtained by JCHAID on Syntren E.coli datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 25
figure25

Box plots depicting the results obtained by JCHAID on Syntren E.coli datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 26
figure26

Box plots depicting the results obtained by JCHAID on Syntren Yeast datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 27
figure27

Box plots depicting the results obtained by JCHAID on Syntren Yeast datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 28
figure28

Box plots depicting the results obtained by JCHAID on DREAM5 E.coli dataset, by varying the threshold on the weight of the original network edges

Appendix 3: Results obtained by JRIP classifier

See Figs. 29, 30, 31, 32 and 33.

Fig. 29
figure29

Box plots depicting the results obtained by JRIP on Syntren E.coli datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 30
figure30

Box plots depicting the results obtained by JRIP on Syntren E.coli datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 31
figure31

Box plots depicting the results obtained by JRIP on Syntren Yeast datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 32
figure32

Box plots depicting the results obtained by JRIP on Syntren Yeast datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 33
figure33

Box plots depicting the results obtained by JRIP on DREAM5 E.coli dataset, by varying the threshold on the weight of the original network edges

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pio, G., Ceci, M., Prisciandaro, F. et al. Exploiting causality in gene network reconstruction based on graph embedding. Mach Learn 109, 1231–1279 (2020). https://doi.org/10.1007/s10994-019-05861-8

Download citation

Keywords

  • Causality
  • Bionformatics
  • Network Reconstruction
  • Link prediction