# Penalized estimation of directed acyclic graphs from discrete data

## Abstract

Bayesian networks, with structure given by a directed acyclic graph (DAG), are a popular class of graphical models. However, learning Bayesian networks from discrete or categorical data is particularly challenging, due to the large parameter space and the difficulty in searching for a sparse structure. In this article, we develop a maximum penalized likelihood method to tackle this problem. Instead of the commonly used multinomial distribution, we model the conditional distribution of a node given its parents by multi-logit regression, in which an edge is parameterized by a set of coefficient vectors with dummy variables encoding the levels of a node. To obtain a sparse DAG, a group norm penalty is employed, and a blockwise coordinate descent algorithm is developed to maximize the penalized likelihood subject to the acyclicity constraint of a DAG. When interventional data are available, our method constructs a causal network, in which a directed edge represents a causal relation. We apply our method to various simulated and real data sets. The results show that our method is very competitive, compared to many existing methods, in DAG estimation from both interventional and high-dimensional observational data.

## Keywords

Coordinate descent Discrete Bayesian network Multi-logit regression Structure learning Group norm penalty## Notes

### Acknowledgements

This work was supported by NSF Grant IIS-1546098 (to Q.Z.).

## Supplementary material

## References

- Aragam, B., Zhou, Q.: Concave penalized estimation of sparse Bayesian networks. J. Mach. Learn. Res.
**16**, 2273–2328 (2015)MathSciNetzbMATHGoogle Scholar - Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science
**286**(5439), 509–512 (1999)MathSciNetzbMATHGoogle Scholar - Bielza, C., Li, G., Larranaga, P.: Multi-dimensional classification with Bayesian networks. Int. J. Approx. Reason.
**52**(6), 705–727 (2011)MathSciNetzbMATHGoogle Scholar - Bouckaert, R.R.: Probabilistic network construction using the minimum description length principle. In: Symbolic and Quantitative Approaches to Reasoning and Uncertainty: European Conference ECSQARU ’93, Lecture Notes in Computer Science, vol. 747, pp. 41–48. Springer (1993)Google Scholar
- Bouckaert, R.R.: Probabilistic network construction using the minimum description length principle. Technical Report RUU-CS-94-27, Department of Computer Science, Utrecht University (1994)Google Scholar
- Buntine, W.: Theory refinement on Bayesian networks. In: Proceedings of the Seventh Annual Conference on Uncertainty in Artificial Intelligence, pp. 52–60. Morgan Kaufmann Publishers Inc. (1991)Google Scholar
- Chickering, D.M., Heckerman, D.: Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Mach. Learn.
**29**(2), 181–212 (1997)zbMATHGoogle Scholar - Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn.
**9**(4), 309–347 (1992)zbMATHGoogle Scholar - Cooper, G.F., Yoo, C.: Causal discovery from a mixture of experimental and observational data. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 116–125. Morgan Kaufmann Publishers Inc. (1999)Google Scholar
- Csárdi, G., Nepusz, T.: The igraph software package for complex network research. InterJ. Complex Syst.
**1695**, 1–9 (2006). http://igraph.org - Ellis, B., Wong, W.H.: Learning causal Bayesian network structures from experimental data. J. Am. Stat. Assoc.
**103**(482), 778–789 (2008)MathSciNetzbMATHGoogle Scholar - Erdos, P., Rényi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci.
**5**(1), 17–60 (1960)MathSciNetzbMATHGoogle Scholar - Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat.
**1**(2), 302–332 (2007)MathSciNetzbMATHGoogle Scholar - Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw.
**33**(1), 1–22 (2010)Google Scholar - Fu, W.: Penalized regressions: the bridge versus the lasso. J. Comput. Graph. Stat.
**7**(3), 397–416 (1998)MathSciNetGoogle Scholar - Fu, F., Zhou, Q.: Learning sparse causal Gaussian networks with experimental intervention: regularization and coordinate descent. J. Am. Stat. Assoc.
**108**(501), 288–300 (2013)MathSciNetzbMATHGoogle Scholar - Gámez, J.A., Mateo, J.L., Puerta, J.M.: Learning Bayesian networks by hill climbing: efficient methods based on progressive restriction of the neighborhood. Data Min. Knowl. Disc.
**22**(1–2), 106–148 (2011)MathSciNetzbMATHGoogle Scholar - Han, S.W., Chen, G., Cheon, M.S., Zhong, H.: Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. J. Am. Stat. Assoc.
**111**(515), 1004–1019 (2016)MathSciNetGoogle Scholar - Hauser, A., Bühlmann, P.: Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. J. Mach. Learn. Res.
**13**, 2409–2464 (2012). http://jmlr.org/papers/v13/hauser12a.html - Hauser, A., Bühlmann, P.: Jointly interventional and observational data: estimation of interventional markov equivalence classes of directed acyclic graphs. J. R. Stat. Soc. Ser. B Stat. Methodol.
**77**(1), 291–318 (2015)MathSciNetGoogle Scholar - Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn.
**20**(3), 197–243 (1995)zbMATHGoogle Scholar - Herskovits, E., Cooper, G.: Kutató: an entropy-driven system for construction of probabilistic expert systems from databases. In: Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, pp. 117–128. Elsevier Science Inc. (1990)Google Scholar
- Kalisch, M., Bühlmann, P.: Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res.
**8**, 613–636 (2007)zbMATHGoogle Scholar - Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H., Bühlmann, P.: Causal inference using graphical models with the R package pcalg. J. Stat. Softw.
**47**(11), 1–26 (2012). http://www.jstatsoft.org/v47/i11/ - Kou, S., Zhou, Q., Wong, W.H.: Equi-energy sampler with applications in statistical inference and statistical mechanics (with discussion). Ann. Stat.
**34**, 1581–1652 (2006)zbMATHGoogle Scholar - Lam, W., Bacchus, F.: Learning Bayesian belief networks: an approach based on the MDL principle. Comput. Intell.
**10**(3), 269–293 (1994)Google Scholar - Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers, pp. 1–12 (2016)Google Scholar
- Meganck, S., Leray, P., Manderick, B.: Learning causal Bayesian networks from observations and experiments: a decision theoretic approach. In: International Conference on Modeling Decisions for Artificial Intelligence, pp. 58–69. Springer (2006)Google Scholar
- Meier, L., van de Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol.
**70**(1), 53–71 (2008)MathSciNetzbMATHGoogle Scholar - Pearl, J.: Causality: models, reasoning, and inference. Econom. Theory
**19**, 675–685 (2003)Google Scholar - Peér, D., Regev, A., Elidan, G., Friedman, N.: Inferring subnetworks from perturbed expression profiles. Bioinformatics
**17**(suppl 1), S215–S224 (2001)Google Scholar - Pournara, I., Wernisch, L.: Reconstruction of gene networks using Bayesian learning and manipulation experiments. Bioinformatics
**20**(17), 2934–2942 (2004)Google Scholar - Sachs, K., Perez, O., Peér, D., Lauffenburger, D.A., Nolan, G.P.: Causal protein-signaling networks derived from multiparameter single-cell data. Science
**308**(5721), 523–529 (2005)Google Scholar - Schmidt, M., Murphy, K.: Lassoordersearch: learning directed graphical model structure using \(\ell _1\)-penalized regression and order search. Learning
**8**(34), 2 (2006)Google Scholar - Schmidt, M., Niculescu-Mizil, A., Murphy, K., et al.: Learning graphical model structure using \(\ell _1\)-regularization paths. AAAI
**7**, 1278–1283 (2007)Google Scholar - Scutari, M.: Learning Bayesian networks with the bnlearn R package. J. Stat. Softw.
**35**(3), 1–22 (2010). https://doi.org/10.18637/jss.v035.i03 MathSciNetGoogle Scholar - Scutari, M.: An empirical-Bayes score for discrete Bayesian networks. In: Conference on Probabilistic Graphical Models, pp. 438–448 (2016)Google Scholar
- Scutari, M.: Bayesian network constraint-based structure learning algorithms: parallel and optimized implementations in the bnlearn R package. J. Stat. Softw.
**77**(2), 1–20 (2017). https://doi.org/10.18637/jss.v077.i02 Google Scholar - Shojaie, A., Michailidis, G.: Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika
**97**(3), 519–538 (2010)MathSciNetzbMATHGoogle Scholar - Shojaie, A., Jauhiainen, A., Kallitsis, M., Michailidis, G.: Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles. PLoS ONE
**9**(2), e82393 (2014)Google Scholar - Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search. Springer, New York (1993)zbMATHGoogle Scholar
- Suzuki, J.: A construction of Bayesian networks from databases based on an MDL principle. In: Proceedings of the Ninth Annual Conference on Uncertainty in Artificial Intelligence, pp. 266–273 (1993)Google Scholar
- Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max–min hill-climbing Bayesian network structure learning algorithm. Mach. Learn.
**65**(1), 31–78 (2006)Google Scholar - Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program.
**117**(1), 387–423 (2009)MathSciNetzbMATHGoogle Scholar - van de Geer, S., Bühlmann, P.: \(\ell _0\)-penalized maximum likelihood for sparse directed acyclic graphs. Ann. Stat.
**41**(2), 536–567 (2013)zbMATHGoogle Scholar - Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002). http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-95457-0
- Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature
**393**(6684), 440–442 (1998)zbMATHGoogle Scholar - Wu, T., Lange, K.: Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat.
**2**, 224–244 (2008)MathSciNetzbMATHGoogle Scholar - Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol.
**68**(1), 49–67 (2006)MathSciNetzbMATHGoogle Scholar - Zhou, Q.: Multi-domain sampling with applications to structural inference of Bayesian networks. J. Am. Stat. Assoc.
**106**(496), 1317–1330 (2011)MathSciNetzbMATHGoogle Scholar - Zhu, J., Hastie, T.: Classification of gene microarrays by penalized logistic regression. Biostatistics
**5**(3), 427–443 (2004)zbMATHGoogle Scholar