Abstract
Gaussian graphical models (GGM) express conditional dependencies among variables of Gaussiandistributed highdimensional data. However, reallife datasets exhibit heterogeneity which can be better captured through the use of mixtures of GGMs, where each component captures different conditional dependencies a.k.a. contextspecific dependencies along with some common dependencies a.k.a. shared dependencies. Methods to discover shared and contextspecific graphical structures include joint and grouped graphical Lasso, and the EM algorithm with various penalized likelihood scoring functions. However, these methods detect graphical structures with high false discovery rates and do not detect two types of dependencies (i.e., contextspecific and shared) together. In this paper, we develop a method to discover shared conditional dependencies along with contextspecific graphical models via a twolevel hierarchical Gaussian graphical model. We assume that the graphical models corresponding to shared and contextspecific dependencies are decomposable, which leads to an efficient greedy algorithm to select edges minimizing a score based on minimum message length (MML). The MMLbased score results in lower false discovery rate, leading to a more effective structure discovery. We present extensive empirical results on synthetic and reallife datasets and show that our method leads to more accurate prediction of contextspecific dependencies among random variables compared to previous works. Hence, we can consider that our method is a state of the art to discover both shared and contextspecific conditional dependencies from highdimensional Gaussian heterogeneous data.
Introduction
Graphical models represent multivariate distributions by explicitly expressing conditional independencies, which is particularly suitable for the analysis of highdimensional data [21]. Gaussian graphical models (GGM) are widely used as a framework to model structural relationships among random variables, assuming their joint distribution is Gaussian. The pattern of nonzero entries in the inverse covariance matrix in the multivariate Gaussian distribution corresponding to the edges in the corresponding GGM. In standard setting, it is assumed that all observations are generated from a single underlying multivariate Gaussian distribution. However, reallife datasets exhibit heterogeneity, which can be better modeled through the use of mixtures of GGMs to let each component exhibit different conditional dependencies among variables, a.k.a contextspecific dependencies along with many common dependencies a.k.a. shared dependencies [26, 37]. For an example, the recent studies on Cancer Genome Atlas Network have found that gene expression data can be described by mixtures with the small number of components harboring different expression pathways [28]. However, typically there is far less number of samples (i.e., observations) compared to the number of variables, generated from a mixture with unknown number of components. Hence, highdimensional heterogenous data make the conditional dependencies discovery challenging.
Meilă and Jordan [26] and Kumar and Koller [20] are pioneers to discover contextspecific dependencies from highdimensional continuous data using EMbased approaches. Armstrong et al. [3] uses a Bayesian approach by assigning a prior to graphical structures. However, Gao et al. and Rodriguez et al. [18, 37] emphasized that contextspecific graphical structures share some edges, which are not well discovered by the abovementioned methods. GGMs with shared structure have been modeled with hierarchical Gaussian graphical models (HGGM) in [18], where they developed a method to discover HGGM using hierarchical penalized likelihood and graphical Lasso. However, this method discovers only shared structure, not contextspecific dependencies. Danaher et al. [8] proposed a method to discover contextspecific graphical models with a shared structure using joint graphical Lasso with penalized maximum likelihood estimation as a scoring function. However, this method is heavily dependant on two tuning parameters which are user defined; furthermore, it faces the issue of predicting many false edges. To resolve this issue, Peterson et al. [31] introduced a Bayesian approach that estimates all GGMs via edgespecific informative prior over the common structure; however, this method does not infer the common structure accurately. Gao et al. [15] improved the discovery of HGGM with the joint graphical lasso and the hard EM algorithm. Ma and Michailidis [23] investigated the joint estimation problem for HGGM using group and joint graphical lasso. Both Gao et al. [15] and Ma and Michailidis [23] use Friedman [14]’s proposed graphical lasso in which the tuning parameter is not dataspecific. Therefore, both methods suffer from the problem of discovering the high number of false edges. However, recently Hao et al. [19], Li et al. [22] and Maretic et al. [24] improved the graphical Lasso technique by estimating better tuning parameter. But, these methods only detect contextspecific dependencies with few the shared dependencies.
In this paper, we address the above issues by proposing a novel method to learn the mixtures of HGGM based on the EM algorithm, which iterates over the following two steps. First, it clusters the data into distinct clusters. Second, it employs the forward selection algorithm [10] for structure discovery for the data in each cluster. To discover contextspecific graphical models and their shared structure, our method incrementally adds the best edges maximising a scoring function in the forward selection algorithm. In both steps of our EMstyle algorithm, we use minimum message length (MML) as the objective function. Our MMLbased approach is an information theoretic method enjoying (a) low false discovery rate, (b) suitability for the small number of samples when discovering statistical dependencies (associations) among large number of variables, and (c) scalability to largescale problems involving thousands of variables. We present extensive empirical results on synthetic and reallife datasets and show that our method leads to more accurate prediction of contextspecific dependencies among variables, compared to the previous works.
Background
Let \({\mathcal {D}} = \{X_1,\ldots ,X_n\}\) be a training set consisting of n data points where \(X_i \in \mathbb {R}^d\) and d is the number of dimensions (equivalently, the number of random variables). Let us assume the data have been generated from a probabilistic graphical model corresponding to a graph G. Parameterization of the graphical model corresponds to multivariate functions assigned to subset of variables in the maximal clique^{Footnote 1} in G. The probability density function corresponding \(f({\mathcal {D}})\) to the graphical model is defined as
where \({\mathcal {C}}\) is the set of maximal cliques of G, and \(f_C({\mathcal {D}}^C)\) is a cliquespecific nonnegative function defined on the subset of variables appearing in a clique C. \({\mathcal {D}}^C\) is the data points of a maximal clique C. In any distribution resulted from a graphical model, two random variables are statistically independent conditioned on the variables in a cut separating the two.
In this paper, we assume that the observed input vectors have been generated from multivariate Gaussian distributions, which means the cliques are also parameterized by Gaussian distributions. Therefore, our aim is to discover the structure of these socalled Gaussian graphical models. For computational convenience, we work with chordal graphical structures, leading to decomposable models which we detail in the next section.
Decomposable Models
Decomposable Models is a subclass of undirected graphical models which provides a usefully constrained representation in which model selection and parameter estimation can be done efficiently, making it suitable for largescale problems.
Definition 1
A graphical model is decomposable if the associated graph G is chordal. A chordal graph is one in which all cycles of four or more vertices have a chord, which is an edge that is not part of the cycle but connects two vertices of the cycle [10].
Let \({\mathcal {M}}\) be a decomposable model corresponding to G, and \(f_{{\mathcal {M}}}\) be the probability density function of a Gaussian distribution corresponding to \({\mathcal {M}}\). It can be shown that:
where \({\mathcal {C}}\) is the set of maximal cliques and \({\mathcal {S}}\) is the set of minimal separators corresponding to the chordal graph of the model \({\mathcal {M}}\). The importance of this result is that it relates the Gaussian distribution over all variables to those on the subsets of variables, i.e., Gaussian distributions over the variables involved in maximal cliques \(f({\mathcal {D}}^C)\) or minimal separators \(f({\mathcal {D}}^S)\). This amounts to a closed form solution for the maximum likelihood estimate (MLE) of the covariance matrix \({\hat{\varSigma }}\) of the Gaussian graphical model \(f_{{\mathcal {M}}}\), through the MLE of the covariance matrices of the component models.
To discover the optimal decomposable graphical structures from a given training data, typically one of the following strategies is employed [10]:

Forward selection: Starting with the simplest model with no edge (i.e. \(E\ =\ \emptyset\)). Edges are added incrementally, as long as the new hypothesized models are not rejected according to an appropriate test statistics.

Backward elimination: Starting with the complete graph over the V vertices, edges are deleted incrementally, as long as the new hypothesized models are not rejected according to an appropriate test statistics.
In this paper, we adopt the forward selection strategy, and add the edges incrementally. As we want the resulting model to be decomposable, the addition of an edge has to be done with care. [10] characterises the edges that can be added to a decomposable model while retaining its decomposability. Furthermore, it presents an efficient algorithm to enumerate all such edges in \(O(V^2)\). This is achieved by a data structure called the clique graph, which keeps track of the maximal cliques \({\mathcal {C}}\) and minimal separators \({\mathcal {S}}\). Adding an edge to the graph and updating the underlying data structures also takes \(O(V^2)\).
Theorem 1
If two decomposable models\({\mathcal {M}}\subset {\mathcal {M}}'\)differ only in one edge (a, b), (i.e.,\((a,b) \in {\mathcal {M}}'\)and\((a,b) \not \in {\mathcal {M}}\)), then the maximal cliques and the minimal separators\(({\mathcal {C}},{\mathcal {S}})\)and\(({\mathcal {C}}',{\mathcal {S}}')\)in these two models differ as follows:

If\(C_a \not \subset C_{ab}\)and\(C_b \not \subset C_{ab}\), then\({\mathcal {C}}'= C + C_{ab}\)and\({\mathcal {S}}' = {\mathcal {S}} + C_{ab} \cap C_a + C_{ab} \cap C_b  S_{ab}\)

If\(C_a \subset C_{ab}\)and\(C_b \not \subset C_{ab}\), then\({\mathcal {C}}'= {\mathcal {C}} + C_{ab}  C_a\) and \({\mathcal {S}}' = {\mathcal {S}} + C_{ab} \cap C_b  S_{ab}\)

If\(C_a \not \subset C_{ab}\)and\(C_b \subset C_{ab}\), then\({\mathcal {C}}'= {\mathcal {C}} + C_{ab}  C_b\)and\({\mathcal {S}}' = {\mathcal {S}} + C_{ab} \cap C_a  S_{ab}\)

If\(C_a \subset C_{ab}\)and\(C_b \subset C_{ab}\), then\({\mathcal {C}}'= {\mathcal {C}} + C_{ab}  C_a  C_b\)and\({\mathcal {S}}' = {\mathcal {S}}  S_{ab}\)
[10] where\(C_{ab}\)and\(S_{ab}\)are the maximal clique and minimal separator for the nodesaandb, and\(C_a\)and\(C_b\)are the maximal cliques including each of these nodes (Fig. 1). This immediately leads to the following theorem.
Theorem 2
If two decomposable models\({\mathcal {M}}\subset {\mathcal {M}}'\)differ only in one edge (a, b), (i.e., \((a,b) \in {\mathcal {M}}'\)and\((a,b) \not \in {\mathcal {M}}\)), then
Thus, the change in the determinant of the MLE estimates of the covariance \(\varSigma \) after adding an edge (a, b) is only dependent on the minimal separator of the two vertices \(S_{ab}\), the newly formed clique \(C_{ab}\), and the newly formed separators \(C_{ab} \cap C_a\) and \(C_{ab} \cap C_b\). This means we only have to compute the determinant terms relevant to the candidate edges that can be added to the current model for faster computation.
Minimum Message Length
Estimating decomposable model with maximum likelihood estimation requires many samples to accept correct hypotheses. Moreover, it relies on the existence of the maximum likelihood estimates, which may not exist if the number of samples is less than the size of the largest clique in the graph. These drawbacks can be overcome by using a scoring function based on the minimum message length (MML).
MML is an informationbased criterion to find the best hypothesis for the observed data by controlling the false discovery rate and requiring far fewer samples to accept true hypotheses [40]. Let us consider a hypothesis (or model) \({\mathcal {M}}\) that offers an explanation of the observed data \({\mathcal {D}}\). Based on the fundamental rules of probability:
where \(p({\mathcal {M}})\) is the prior over hypotheses/models, \(p({\mathcal {D}}{\mathcal {M}})\) is the likelihood, \(p({\mathcal {D}})\) is the prior probability of data, and \(p({\mathcal {M}}{\mathcal {D}})\) is the posterior of \({\mathcal {M}}\) given \({\mathcal {D}}\). Based on Shannon’s theory of communication, the amount of information of explaining \({\mathcal {D}}\) with \({\mathcal {M}}\) is:
where \(I(a) =  \log (p(a))\) gives the optimal code length to convey some event a whose probability is p(a). This results in an objective criterion to compare two competing models \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\) given the same data \({\mathcal {D}}\):
A possible realization of this framework is the transmission of data over a communication channel between the sender and the receiver. The sender sends \({\mathcal {D}}\) with an explanation message, so that the receiver can reconstruct back the original data losslessly from the message. The sender’s message encodes both the model \({\mathcal {M}}\) and the data residual \(p({\mathcal {D}}{\mathcal {M}})\). The receiver then reads in the model from the message, and decodes the original data from the residual.
The goal of this communication game is to minimize the length of the explanation message.
If the sender can find the best model on the data, the receiver will receive the most economic decodable explanation message; this is the basis of statistical inference based on the MML principle [33]. Therefore, according to MML principle, the best hypothesis is the one that can lead to encode the entire data set and hypothesis in the shortest possible message.
ContChordalysisMML
Rahman and Haffari [36] proposed an algorithm, named ContChordalysisMML to discover GGMs structure from highdimensional data with just a handful number of samples. This algorithm is developed based on the concept of decomposable models and minimum message length. Starting from the null graph,^{Footnote 2} ContChordalysisMML incrementally adds the best edge minimizing MMLbased score using the graphical model. The pseudocode of the ContChordalysisMML is presented as Algorithm 1. Based on the experimental results, ContChordalysisMML discovers more true edges with a lower false discovery rate and outperforms strong baselines including methods based on penalized likelihood function and graphical Lasso.
Scalable ContChordalysisMML
ContChordalysisMML is a forward selection algorithm which adds edges to the candidate graphical structure, checks the candidature of remaining edges to become candidate edges and computes the scoring function of all candidate edge. However, the edge candidature checking and score computation make the forward selection strategy slow for a very large number of random variables. According to the Theorem 1 of the previous section "Decomposable Models", the addition of an edge (a, b) affects the minimal separator between two nodes a, b: \(S_{ab}\) and creates one new clique \(C_{ab}\) and two new separators \(C_{ab}\cap C_a\) and \(C_{ab}\cap C_b\); and remaining all other separators and cliques are unchanged. Hence, it is not required to check the candidature and to compute the scoring function of all candidate edges at every step. According to [32], the addition of an edge (a, b) to the candidate model affects the minimal separators between following node pairs: (a) a and the neighbors of b, (b) b and the neighbors of a, and (c) neighbors of a and b. So that we only recompute the candidature and scoring function of abovementioned node pairs (i.e., edges) which takes O(V) times. Therefore, these candidature checking and scoring function computation make the forward selection algorithm more scalable than the existing best algorithm.
We modified ContChordalysisMML to make it more scalable based on the properties of the Decomposable GGM which are discussed above. The modifications are twofold:
 Initial step ::

At the beginning, ContChordalysisMML starts with empty graph where no nodes are connected with each other. Therefore, each node would be treated as single clique and has no separator. Adding any edge between nodes a and b will form a new cliques \(C_{ab}\) from two cliques \(C_a\) and \(C_b\) and separator \(S_{ab}\). Moreover, other cliques and separator will remain unchanged. Hence the encoding bit difference of parameter and data of the reference model \({\mathcal {M}}\) and candidate model \({\mathcal {M}}'\) of the ContChordalysisMML will be as below:
$$\begin{aligned} I(\theta ,{\mathcal {D}}{\mathcal {M}})  I(\theta ,{\mathcal {D}},{\mathcal {M}}') = I(C_{ab},{\mathcal {D}}^{C_{ab}})  I(S_{ab},{\mathcal {D}}^{S_{ab}}). \end{aligned}$$(5)Moreover, whenever we add an edge between two singlesize cliques, Eq. 5 will remain unchanged as per the above discussion. Therefore, at the initial step, we will compute the encoding bit difference of parameter \(\theta\) and data \({\mathcal {D}}\) of the reference model \({\mathcal {M}}\) and candidate model \({\mathcal {M}}'\) using the above equation.
 Edge adding step ::

In this step, we will only recompute the encoding bit difference of parameter and data of the reference model \({\mathcal {M}}\) and candidate model \({\mathcal {M}}'\) after adding any edge (a, b), for a and the neighbors of b; b and the neighbors of a; and neighbors of a and b using the equation.
$$\begin{aligned} I(\theta ,{\mathcal {D}}{\mathcal {M}})  I(\theta ,{\mathcal {D}},{\mathcal {M}}')= & {} I(C_{ab},{\mathcal {D}}^{C_{ab}}) + I(S_{ab},{\mathcal {D}}^{S_{ab}}) \nonumber \\& I(C_{ab} \cap C_b,{\mathcal {D}}^{C_{ab}\cap C_b})  I(C_{ab}\cap C_a,{\mathcal {D}}^{C_{ab}\cap C_a}). \end{aligned}$$(6)This computation will make the ContChordalysisMML more scalable and reduce the time complexity to O(V).
Discovery of the Decomposable HGGMs
Let us assume that the data have been generated from a flat mixture of multivariate Gaussian distributions, where each component corresponds to a graphical model. Our aim is to discover the unobserved structure of undirected Gaussian graphical models based on the observed data \({\mathcal {D}}\):
where \(\gamma\) is the mixing coefficient, \(\gamma _k\) is the mixing coefficient of the component/cluster k and \(g_k({\mathcal {D}}_k)\) is contextspecific distribution of the component k.
Specifically, we are interested in the undirected graphical structures \({\mathcal {G}}\ =\ \{G_0, G_1,G_2, \ldots ,G_k\}\) where \(G_k\ =\ \{V,E_k\}\ \forall _{k = \{1 \ldots K\}}\) is the contextspecific graphical structure of component i, \(G_0\) is the shared or global graphical model, V is the set of vertices corresponding to random variables (or dimensions of the input vectors), \(E_k\ \forall _{k = \{0,1 \ldots K\}}\) is the set of edges capturing shared and contextspecific statistical associations between random variables, and K is the number of components in the mixture model.
The another input to the algorithm is the number of components K believed to exist in the data. The output is then the shared and contextspecific graphical model structures. The algorithm consists of two steps: (a) the clustering step, similar to the E step in the hard EM algorithm, to partition the data and (b) the structure and parameter estimation step, similar to the Mstep of the EM algorithm. In the estimation step, initially we employ sContChordalysisMML, described in section "Scalable ContChordalysisMML". Our algorithm keeps repeating the clustering, and structure and parameter estimation steps until it converges with respect to the objective function.
Our algorithm optimizes a minimum message length (MML) as the objective function for estimating the structure of the shared and contextspecific graphical models and the clusters of the data. We call our iterative algorithm: PartitionandGraphical model discoveryIterativeAlgorithm based onMML or PaGIAM, as summarised in algorithm 3. We also present a flowchart of PaGIAM algorithm.
The Objective Function
The MMLbased objective function of our iterative algorithm: PaGIAM encodes the hypothesis in the message form including the encoding of all clusters \({\mathcal {K}}\) and the associated cluster parameters \(\theta _{{\mathcal {K}}}\), encoding of shared and contextspecific graphical models \({\mathcal {G}}\), their parameters \(\theta _{{\mathcal {G}}}\) and data \({\mathcal {D}}\) which is as below.
where \(I({\mathcal {G}},\theta _{{\mathcal {G}}},{\mathcal {D}}{\mathcal {K}})\) is the MML to encode the shared and contextspecific graphical models \({\mathcal {G}}\), their parameters \(\theta _{{\mathcal {G}}}\) and data \({\mathcal {D}}\) and \(I({\mathcal {K}},\theta _{{\mathcal {K}}})\) is the MML to encode all clusters \({\mathcal {K}}\) and the associated cluster parameters \(\theta _{{\mathcal {K}}}\).
However, in PaGIAM algorithm we initially used the sContChordalysisMML to discover the graphical models of the clusters \({\mathcal {K}}\). According to Rahman and Haffari [36]. to encode the graphical models \({\mathcal {G}}\), their parameters \(\theta _{{\mathcal {G}}}\) and data \({\mathcal {D}}\), \(I({\mathcal {G}},\theta _{{\mathcal {G}}},{\mathcal {D}}{\mathcal {K}})\), we require
where mE is the size of a complete graph i.e. \(mE = \frac{V(V1)}{2}\). \({\mathcal {C}}_k\) and \({\mathcal {S}}_k\) are the maximal clique and maximal separator set of a graph \(G_k\) of a cluster k. Therefore, we need to find out the MML of encoding of all clusters \({\mathcal {K}}\) and the associated cluster parameters \(\theta _{{\mathcal {K}}}\). In next two subsections, we will discuss about the encoding of the clusters and their parameters.
Encoding the Clusters
We now describe the encoding of clusters which includes their contents and the number of clusters in the mixture. At first, we encode the number of clusters, for which we need \(\log {(K)}\) bits. We then encode the mixing coefficient of all of the clusters and the content of clusters by encoding the cluster indicator vector \(\mathbf {z}\). Each element of cluster indicator vector \(\mathbf {z}_i\) represents a numerical value between 1 and K to indicate cluster membership of a datapoint \(X_i\) (where \(X_i \in \mathbb {R}^d\)). As each of element of the cluster indicator vector \(\mathbf {z}_i\) represents a numerical value between 1 and K, we assume that the contents of vector z are multinomial distributed. According to [41], to encode the multinomial distributed content and the clustering coefficient, we require \(\frac{K1}{2} \log {\big (\frac{n}{12}+1\big )}  \log {K!}  \sum _{k=1}^{K}{\Big ((n_k+\frac{1}{2})\log {\frac{n_k+\frac{1}{2}}{n+\frac{K}{2}}}\Big )}\) bits.^{Footnote 3} Therefore, the minimum message length to encode all of clusters is:
Encoding of the Parameters
Once clusters have been encoded, we encode parameters of clusters. According to [40], the MML encoding of parameters of the multivariate Gaussian distribution corresponding to a cluster k, denoted by \(I(\theta _k)\), is:
where \(\theta _k = (\mu _k, \varSigma _k)\). In what follows, we compute various components of \(I(\theta _k)\) of Eq. (11), i.e. the prior probability, and the Fisher information matrix.
Prior probability of the parameters: Following the previous work [11], we use a flat prior for \(\mu _k\) [30] and a conjugate inverted Wishart prior for \(\varSigma _k\) [17]. Hence, the prior joint density over the parameters is:
Fisher information of the parameters: We need to evaluate the secondorder partial derivatives of \({\mathcal {L}}({\mathcal {D}}_k\mu _k,\varSigma _k)\) to compute the Fisher information for the parameters [40], where
where, \(\mu _k = \frac{1}{n}_k\sum _{j=1}^{n_k}{x_{kj}}\) and \(\varSigma _k=\frac{1}{n_k1}\sum _{j=1}^{n_k}{(x_{kj}\mu _k)(x_{kj}\mu _k)^\mathrm{{T}}}\).
Let \({\mathcal {F}}(\mu _k,\varSigma _k)\) represent the determinant of the Fisher information matrix which is the product of \({\mathcal {F}}(\mu _k)\) and \({\mathcal {F}}(\varSigma _k)\), i.e., the determinant of Fisher information matrices of \(\mu _k\) and \(\varSigma _k\), respectively [30]. Taking the secondorder partial derivatives of \({\mathcal {L}}({\mathcal {D}}_k\mu _k,\varSigma _k)\) with respect to \(\mu _k\), we get \(\nabla _{\mu _k}^2{\mathcal {L}} = n_k\varSigma _k^{1}\). So the determinant of the Fisher information matrix for \(\mu _k\) is \({\mathcal {F}}(\mu _k) = n_k^d\varSigma _k^{1}\). To compute \({\mathcal {F}}(\varSigma _k)\), [12] derived an analytical expression using the theory of matrix derivatives based on matrix vectorization:
Hence, the determinant of the Fisher information matrix for \(\mu _k\) and \(\varSigma _k\) is
Therefore,
MML to Encode Clusters and Their Parameters
Therefore, MML to encode the clusters and their parameters is as follows
MML Score to Discover the Decomposable HGGM
Substituting the Eqs. (9) and (16) to (8), the MML score to find the best hierarchical graphical from heterogenous multivariate Gaussian data is as below.
However, the MML score encodes the clusters and the contextspecific graphical models of the given highdimensional heterogeneous Gaussian data as we use sContChordalysisMML to discover the graphical models. sContChordalysisMML can not detect the global or shared graphical structures from the data. Therefore, we design a new graphical model discovery algorithm to discover the shared graphical model along with contextspecific graphical models which we discuss in next section "MML for Discovering the Shared and ContextSpecific GGMs".
MML for Discovering the Shared and ContextSpecific GGMs
The method mentioned in section "Discovery of the Decomposable HGGMs" discovers the structure of contextspecific graphical models without discovering the shared edges among them. However, in heterogeneous data, contextspecific graphical structures share a significant number of edges. Only contextspecific graphical models discovery does not able to detect all shared edges among themselves. Hence, many important features remain undiscovered. Therefore, it is important to discover the shared edges along with contextspecific graphical models. Modeling the shared structure can help the GGM structure discovery by pooling the statistics together. Therefore, we extend our approach to discover a twolevel hierarchical Gaussian graphical models^{Footnote 4} (HGGM). We learn the model based on an MMLbased score. Our approach works with chordal graphs, leading to “decomposable” probabilistic graphical models (discussed in section "Background") to make MMLbased scoring function computationally efficient.
According to [40], the minimum message length finds the best model for the observed data by comparing the two competing models given the same data \({\mathcal {D}}\). To find the best structures, we encode the graph structures of shared edges (we call it super graph \(G_0\)) along the contextspecific GGM structures \(\{G_1,G_2,\ldots ,G_K\}\), their parameters and the data in messages and compare their message lengths. While using sContChordalysisMML in PaGIAM, we encode the graph topology \(G_k\), the parameters \(\theta _k\) and the data \({\mathcal {D}}_k\) of each cluster k and then combine them. It does not encode the super graph topology. To improve the better structures discovery from the heterogeneous data, we encode the super graph \(G_0\) topology, the topology of context specific GGMs \(\{G_1,G_2,\ldots ,G_K\}\), their parameters \(\{\theta _1,\theta _2,\ldots ,\theta _K\}\) and data \(\{{\mathcal {D}}_1,{\mathcal {D}}_2,\ldots ,{\mathcal {D}}_K\}\). To minimize the number of required bits of MML, we only encode the edges of the contextspecific GGMs which are not present in super graph, which is \(G^*_k = G_k  G_0; \forall _{k=\{1,2 \ldots K\}}\).
The MML to encode the twolevel HGGM and the data are as follows:
where \(G_0\) is the shared graphical structures which can be called as super graph structure, \(G_k\) is the contextspecific graph structure of kth component, and \(G_k^*\) is the contextspecific graphical structure of component k without shared edges. Moreover, \(\theta _{G_k}\) is the parameters of contextspecific model of component k. \(\theta _{G_k} = \{\mu _{G_k},\varSigma _{G_k}\}\) where \(\mu _{G_k}\) and \(\varSigma _{G_k}\) mean vector and covariance matrix of graphical structure of component k. According to [40], Eq. (18) deduces to
where, extra \(\log {K}\) bits are added with each datapoint to select its component id. For our twolevel GGM structure discovery setting, the encoding of the model in the message consists of the encoding of topologies of the super and contextspecific chordal graphs and the associated model parameters, which we elaborate in the rest of this section.
Encoding the Graph Structures
We now describe the encoding of super and contextspecific graphical structures. For this purpose, it is sufficient to send the number of nodes and the connected pair of edges of each graphical structures. According to [2], to encode the number of nodes, we need \(\log {n}\) bits. Let us consider a super graph having \(E_0\) number of edges. Therefore, to encode edges of super graph, we need \(\log {\left( {\begin{array}{c}mE\\ E_0\end{array}}\right) }\), where \(mE = \frac{V(V1)}{2}\). We encode the componentspecific edges of contextspecific graphical structure \(G_i\) to prevent the multiple appearances of same edges in different graph structures. The required bits to encode any contextspecific graph structure \(G^*_k\) are \(\log {\left( {\begin{array}{c}mE\\ E_kE_0\end{array}}\right) }\). Hence, to encode all graphical models including shared one, we require
Encoding the Parameters and Data
Once the graphs’ (shared and context specific) topologies have been encoded, we encode the parameters of all contextspecific graphical model structures of the mixture of GGMs as well as the data independently and then merge them. To encode parameters and data of each contextspecific graphical structure, we encode parameters and data of all maximal cliques and minimal separators separately and then combine them. Moreover, according to [36], ContChordalysisMML encodes parameters and data of all maximal cliques and minimal separators of a graphical structure separately and then combines them. Therefore according to [36], to encode the parameters of a maximal clique (or minimal separator) C of a graphical model, we require
Furthermore, we require following bits to encode the data of a maximal clique (or minimal separator) C of a contextspecific graphical model k:
MMLBased Model Selection
In forward selection, a reference model \({\mathcal {M}}\) and a candidate model \({\mathcal {M}}'\) are differed by and edge (a, b). According to MML, \({\mathcal {M}}'\) replaces \({\mathcal {M}}\) if encoding the message based on \({\mathcal {M}}'\) requires less number of bits than that of \({\mathcal {M}}\), i.e., \(I({\mathcal {M}}'{\mathcal {D}},G')I({\mathcal {M}}{\mathcal {D}},G)\ <\ 0\), where G and \(G'\) are the graphical structures of reference and candidate models, respectively. Here, we present the MMLbased scoring function to compare the reference and the candidate models of super and contextspecific graphical models. We compute two types of MML scoring functions:

1.
MML score when an edge will be added to super and all contextspecific graphs.

2.
MML score when an edge will be added to any of the contextspecific graphs.
MML Score When An Edge (a, b) is to be Added to Super Graph
When a candidate edge is added to the super graph structure, it affects graph structures of both the super and context specific and their parameters. Therefore, the MML difference between the candidate and reference graphical structures is as followed:
The addition of an edge to the super graph affects the covariance matrices of affected and newly appeared cliques and separators of all contextspecific graphical models. Therefore, we encode covariance matrices of affected and newly formed separators and cliques of all contextspecific graphical models and we need the following bits
Therefore, the MML score difference between reference and candidate models is
MML Score Difference When An Edge (a, b) to be Added to a ContextSpecific Graph
The addition of candidate edge (a, b) to \(G_k\) of cluster k affects only the corresponding graph structure and its parameters and rest of all are remain unchanged. Therefore, the difference MML between the encoded candidate and reference graphical structures is as follows:
As the edge (a, b) is added to the \(G_i\) and no change in reference and candidate model of super and rest of all contextspecific graphical structures, we encode the data and parameters of \(G_i\) is as followed:
Therefore, the MML score difference between reference and candidate models is
The Forward Selection Algorithm
To discover the shared and contextspecific GGMs, we modify the sContChordalysisMML. Initially in the algorithm, we compute the MML to encode the parameters and data of each edge. Then, it adds the best edges incrementally either in both super and contextspecific graphical structures or one of the contextspecific graphical model structures based on the MML by maintaining the chordality of the graph structures. After addition of the best edge, we update the candidature and MML to encode the parameters and data of candidate edges according to procedures mentioned in section "Scalable ContChordalysisMML". We call our MMLbased twolevel decomposable HGGM discovery algorithm as two level DecomposableGaussian graphical modelsDiscovery usingMML or HGDM. In Gaussian graphical models step (lines 9–11) of the PaGIAM algorithm (algorithm 3), we use tGDM algorithm to discover context specific graphical models instead of sContChordalysisMML. We call the updated PaGIAM algorithm as the PaGIAM–tGDM algorithm.
MML Score for PaGIAM–tGDM to Discover Two Levels HGGM
Equation (17) is used for PaGIAM algorithm which used sContChordalysisMML to discover the HGGM. However, the use of sContChordalysisMML, PaGIAM cannot discover the shared graphical model. Whereas, tGDM discovers both the shared and contextspecific graphical models. Therefore, the final MML score for PaGIAM–tGDM is also modified which is as follows:
In PaGIAM algorithm, Eq. (17) is replaced by Eq. (29) to discover the shared and contextspecific graphical models from the highdimensional heterogeneous Gaussian data. Equation (29) encodes the clusters, twolevel HGGMs (i.e. shared and contextspecific GGMs), their parameters and data. Therefore, the MML score of the Eq. (29) makes our PaGIAM–tGDM capable of discovering both shared and contextspecific GGMs.
Experiments
We compare the performance of our method: PaGIAM–tGDM with strong baselines on synthetic data and real cancer data.
Synthetic Data
Parameters for Synthetic Data
We generate synthetic multidimensional dataset based on the mixture of Gaussian. We cover a wide range of datasets with different properties by changing different aspects, as follows:

V: the number of variables, ranges in \(\{10, 100, 1000, 5000\ \text {and}\ 10{,}000\}\).

n: the number of samples, ranges in\(\{ 100, 1000, 10{,}000, 50{,}000\}\).

K: the number of clusters, ranges \(\{1, 2, 3, 4, 5\}\).

\({\mathcal {C}}\): maximal clique size for graphical structures in the mixture model. According to [4], in realworld networks, every new node is born with some edge connections with existing nodes. It produces the connected graph. Therefore, the minimal clique size would be at least 2. We consider that \({\mathcal {C}}\) varies from 2 to 6.

\(\alpha\): Controlling the spread of sampled mixing coefficients in the mixture models. There are two possibilities of mixing coefficients: (a) all clusters having equal frequencies or (b) some clusters having nonequal frequencies. Moreover, the number of samples of each cluster is discrete and multinomial distributed. Dirichlet distribution can be used as the conjugate prior to Multinomial distribution. Therefore, we assume that the mixing coefficients are Dirichlet distributed, with concentration parameter \(\alpha\) which ranges in \(\{\)100, 10, 1, 0.1\(\}\). When \(\alpha = 100\), approximately all coefficient are equal. Whereas, the frequencies (i.e., mixing coefficient) tend to be different when \(\alpha = 0.1\).

\(\delta\): Controlling the statistical associations between the random variables. The statistical association between two nodes ranges in between 1 to \(\) 1 to express the degree of associations. As it approaches zero, two variables are not statistically associated. However, it is closer to either 1 or \(\) 1, the stronger association exists between variables. In the experiment, we consider a parameter S which inversely controls the statistical association between the random variable which ranges in \(\{1,\ 5,\ 10,\ 25,\ 50,\ 100,\ 250,\ 500\}\). We mention this parameter as Inverse correlation parameter.
To assess the performance of our methods with the baselines, we vary each of the parameters as mentioned earlier, in turn, having set base configuration to V = 1000, n = 10,000, K = 3, \({\mathcal {C}}\) = 3, \(\alpha =100\) and \(\delta\) = 50. Moreover, we also assess the performance by varying the number of samples as mentioned while setting the base configuration to V=10,000, K = 3, \({\mathcal {C}}\) = 3, \(\alpha =100\) and \(\delta\) = 50.
Graph Structure Generation
For each experimental setup, we first generate the graph structures and then the dataset. To generate the graph structures, we maintain the realworld networks properties [7]: (a) many small nodes are connected with few hubs, known as the powerlaw property, (b) The average path between two nodes is short and (c) new nodes prefer to attach to wellconnected nodes over lesswell connected nodes, known as the preferential attachment property.
Barabási and Albert [4] proposed a model to generate scalefree graphs having the abovementioned properties. We use Barabási–Albert (BA) method to generate the graph structures with properties of realworld networks. This model facilitates us by controlling the number of nodes V, and the maximal clique size \({\mathcal {C}}\) which controls the edge density of the graph.
To generate the graph structures for the synthetic data, we follow the following steps:

First we generate K number of contextspecific graphs using Barabási–Albert (BA) method.

We then identify the super graph structure \(G_0 =\{V,E_0\}\) by \(E_0 = \bigcap _{i=1}^{k}{E_i}\).
Moreover, as we use decomposable models to discover graphical structures, we add an additional condition that both generated super and contextspecific graphs are chordal. Moreover, if the identified super graph \(G_0\) is not chordal, we add edges to make it chordal. We then added these new edges of super graph to all contextspecific graphs. We use candidate edge selection process of ContChordalysis algorithm [36] to maintain the chordality of generated graphs.
Synthetic Gaussian Data Generation
Having the graph structure, we generate the contextspecific precision matrix of \(G_k\) using following equation
where \(\mathrm{{adj}}_{G_k}\) is the adjacency matrix of a contextspecific graph \(G_k\). Finally, we generate data using \({\mathcal {D}} = \bigcup _{k=1}^{K}{\big \{{\mathcal {D}}_k \sim {\mathcal {N}}_d(0,\varSigma _{G_k})\big \}}\), where the number of samples of cluster k would be \(n_k = n \cdot \gamma _k\) and \(\gamma _k \sim {\mathcal {D}}ir(\alpha )\).
Real Data
We use two gene expression datasets to evaluate our methods: Breast cancer and Glioblastoma tumor data. We have verified that these real data are Gaussian distributed.
Breast Cancer
Breast cancer is a hormonerelated cancer [23] and it has two major subtypes:

Estrogen receptor positive (ER+). It is estimated that around 80% of all breast cancer are ER+. Survival rate of this cancer is better than ER−. It responds to hormone therapy.

Estrogen receptor negative (ER−). Its survival rate is poorer. Due to the absence of estrogen receptor hormone, it does not respond to hormone therapy.
Presence of estrogen receptor hormone in breast cancer plays an important role in therapeutic strategies and survival rates. We use a breast cancer dataset containing gene expression of 4512 genes from an Affymatrix HU95aV2 microarray for 148 samples which have been chemically synthesized by [34]. These breast cancer data are the mixture of estrogen receptivity (ER+/ER−) subtypes, where each tumor sample in the dataset has additional classification tags based on its estrogen receptivity (ER+/ER−). Moreover, we consider [34]’s chemically discovered genepairs as the gold standard.
Glioblastoma Tumor
Verhaak et al. [39] studied the glioblastoma tumor samples gene expression data with 173 samples and 8271 genes. Verhaak et al. [39]’s chemically synthesized a dataset containing tumor samples of four disease subtypes. They did not identify whether a genepair is present in a subtype or not. Whereas, [6, 25, 27, 29] identified 10 important genepairs that cause the appearance of glioblastoma tumor cells. In Table 1, we report them together with their presence in each disease subtype.
In the Glioblastoma tumor experiment, we investigate the performance of our methods and the baselines to predict the above mentioned 10 prominent genepairs from these large data.
Evaluation Metrics
We evaluate results using contextspecific \(\text {recall}\), \(\text {precision}\) and \(\text {FMeasure}\). \(\text {Recall}\) is the fraction of correctly predicted edges with respect to true edges. \(\text {Precision}\) is the fraction of correctly predicted edges (i.e. associations) with respect to all predicted edges. \(\text {FMeasure}\) is the harmonic mean of precision and recall, i.e. \(\text {FMeasure} = \frac{2 \times \text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}}\). The average \(\text {FMeasure}\) is assumed as the accuracy of a method.
The algorithm tGDM generates \(K+1\) number of graphical structures. The corresponding gold standard graph of a predicted graph is unknown in our synthetic data experiments, since potentially each discovered graph can be matched with each of the ground truth clusters. Therefore, we compute False Positive Rate (FPR),^{Footnote 5} False Negative Rate (FNR)^{Footnote 6} and error^{Footnote 7} for the best matched predicted graph of a gold standard graph. The predicted network G having a minimal error with respect to a gold standard \(G_\mathrm{{gold}}\) is the best matched discovered graph G of the corresponding gold standard \(G_\mathrm{{gold}}\).
PaGIAM–tGDM Variant Baselines for Synthetic Data Experiments
We also compare the performance of our MMLbased scoring approach with two scoring function: AIC (Akaike Information Criterion) [1] and BIC (Bayesian Information Criterion) [38] as variants of both PaGIAM and tGDM.
The AIC and BIC variants of PaGIAM algorithm refer to PaGIAA and PaGIAB, respectively, and their AIC and BIC scoring functions are as followed:
Similarly, the AIC and BIC variants of tGDM algorithms are referred to by tGDA and tGDB respectively, and their AIC and BIC scoring functions are as follows:
In the synthetic data experiments, we consider that PaGIAM–tGDA, PaGIAB–tGDM, PaGIAA–tGDM, and PaGIAB–tGDM are the baselines of our MMLbased methods. We also compare the performance of the PaGIAM–sContChordalysisMML approach discussed in section "Discovery of the Decomposable HGGMs".
Recent Strong Baselines
In both synthetic and real data experiments, we evaluate our method: PaGIAM–tGDM with recent strong baselines: NewSP (NewStructuralPursuit) [15] and JSEM (Joint Structural Estimation Method) [23]. However, in realdata experiments, we use PaGIAM–tGDM and PaGIAM–sContChordalysisMML^{Footnote 8} are used along with the NewSP and JSEM. NewSP and JSEM estimate contextspecific GGMs with shared edges in the framework of a Gaussian mixture model.
NewSP uses the hard EM algorithm [9] to cluster the data and [8] proposed Joint Fused Graphical Lasso method to estimate the contextspecific GGMs. Beside, JSEM uses Graphical Lasso [13] and Group Lasso [5] for inferring the contextspecific GGM structures. Both methods use penalized likelihood as objective function to discover the contextspecific GGMs.
Results and Discussion
We have implemented PaGIAM–tGDM and its flat mixture, AIC and BIC variants in Python 3.7. NewSP and JSEM are developed by rpackages and available in CRAN. All experiments are run on a desktop with Intel Core i5 3.2 GHz CPU and 24 GB of RAM.
Moreover, in the experiment, we start with our PaGIAM (algorithm 3) assuming the number of components is one in the mixture model and then keep on increasing the number of components till MML (Eq. 17) outlines the bestpartitioned data and contextspecific graphical models.
Synthetic Data
We compare PaGIAM–tGDM with its variants to discover HGGM on the synthetic data with different experimental setups.
Varying the number of variables V: In this experimental setup, we change the number of variables (i.e. the number of graph nodes) in 10, 100, 1000, 5000 and 10,000. Table 2 depicts recall Re, precision Pr and F\(\text{ }\)Measure FM of outputs of PaGIAM–tGDM and other baselines. PaGIAM–tGDM outperforms all of the competitive baselines. From the Table 2, FM of all methods decreases with the increase in the number of variables V as both Re and Pr decreases over the increment of the number of variables.
However, accurate edge detection of PaGIAM–tGDM is still higher compared to its variants in this synthetic data experience. More specifically, Re, Pr, and FM of PaGIAM–tGDM and PaGIAM–ContChordalysisMML are better than other variants. Therefore, we can say that MML outperforms AIC and BIC in terms of objective functions as MML helps the PaGIAM algorithm (Algorithm 3) to partition the data more accurately. Friedman [14] reported that BIC and AIC do not work well to partition the data and produce many wrong clustered data. Whereas, tGDM use complete data to discover the super graph so that the super graphs detected by PaGIAB–tGDM and PaGIAA–tGDM are similar to the super graph of PaGIAM–tGDM. Due to the presence of wrong data in each cluster, tGDM used inside PaGIAB and PaGIAA does not able to detect many true contextspecific edges which affect their \(\text {FMeasure}\)s.
Giraud [16] point out limitations of BIC and AIC that they do not perform well in large highdimension data. PaGIAM–tGDB and PaGIAM–tGDA use BIC and AIC as the scoring functions to add edges to the candidate graphical models. Hence, tGDB and tGDA detect less number of edges compared to tGDM and sContChordalysisMML.
sContChordalysisMML only able to detect the contextspecific graphical models. Many common or shared edges between contextspecific graphical models are not detected by sContChordalysisMML. It affects the performance of PaGIAM–sContChordalysisMML and is outperformed by PaGIAM–tGDM.
NewSP and JSEM use the Graphical Lasso (GLasso) and penalized likelihood as their objective function to find the optimal contextspecific graphical model structures only. In GLasso, the regularized parameter is not estimated properly from the data which affected the penalized likelihood and the estimation of contextspecific graphical models with their super graph. Moreover, both methods does not detects the shared edges and many true edges are not discovered by these methods. It affects the Re, Pr and FM of the outputs of the methods. Hence, NewSP and JSEM statistically do not perform significantly well as PaGIAM–tGDM does.
Varying the maximal clique size \({\mathcal {C}}\): In this experiment, we vary maximal clique sizes from 2 to 6. Table 3 depicts Re, Pr and FM of outputs of our method and other baselines. PaGIAM–tGDM outperforms all of the competitive baselines. While the maximal clique size is two, the degree of all vertices is one. All methods except PaGIAM–tGDB and PaGIAM–tGDA, detect most of the true edges and their FM are higher. Over the increment of the maximum size of cliques in the graph, FM of all methods decreases. In this experiment, maximal size of cliques in the graphs inversely affect the FM. However, among all methods, our PaGIAM–tGDM detects many true edges than other methods whatever the size of maximal cliques in the graph and therefore, FM of PaGIAM–tGDM is higher than others.
Varying the number of datapoints n: We carried out two experiments by varying the size of samples where the number of variables are 1 K (i.e., 1000) and 10 K (i.e., 10,000). In both experiments, Re, Pr, and FM results are presented in Tables 4 and 5, From both tables, we find that PaGIAM–tGDM outperforms all other methods. Over the increase of the number of samples, PaGIAM–tGDM detects many true edges accurately and increases the FM. Similar trends also found in other methods, but not as good as PaGIAM–tGDM. Hence, PaGIAM–tGDM can work on any size of multivariate Gaussian distribution data efficiently.
Varying the number of components K: Table 6 reports the performance of methods with respect to the different number of components (i.e., clusters) K in the mixture. As the number of clusters increases, Re, Pr, and FM of all methods decreases. While the number of clusters increases, the amount of wrongly clustered data also increases which affect the results of all methods. Similarly, PaGIAM–tGDM outperforms all of the competitive baselines.
Varying the spread of mixing coefficient \(\alpha\): Table 7 presents the performance of methods by varying the spread of mixing coefficient \(\alpha\). As \(\alpha\) increases, the randomness of the cluster proportion decreases and tends to uniform. It affects the results of all methods. However, our PaGIAM–tGDM outperforms all method which indicates PaGIAM–tGDM can work on any kind of heterogeneous data with different cluster proportion.
Varying the inverse correlation \(\delta\): Correlation expresses the statistical association between random variables which strongly influences covariance matrices. According to the Table 8, increase in the value of the inverse correlation parameter \(\delta\) inversely impacts the covariance matrices and causes the decrease of FMeasure. Our PaGIAM–tGDM can detect more than 55% true edges even when very small correlation exists between variables. Whereas, other methods cannot detect even 50% of true edges.
On account of clustering data and discovering the contextspecific graphical model with shared edges accurately, PaGIAM–tGDM outperforms all methods in different experimental setups. Therefore, PaGIAM–tGDM is a statistically efficient method to predict both shared and contextspecific independencies from heterogeneous data.
RealWorld Data
In realdata experiments, we compare our PaGIAM–tGDM and its flat mixture variant: PaGIAM–sContCHordalysisMML with strong baselines: NewSP and JSEM to discover the contextspecific graphical models along with shared edges from the breast cancer and the Glioblastoma tumor data.
Breast Cancer Data
Table 9 presents Re, Pr and FM of our method versus baselines. We again see the same trend that PaGIAM–tGDM outperforms other methods in terms of the performance measures.
It is known that many genepairs can be responsible for the appearance of the cancer cells in human body. We are interested to know important genepairs that have been detected by the methods. According to [35], we select 50 important genepairs that cause the appearance of cancer cell in human breast tissues. Figure 2 shows that PaGIAM–tGDM detected 22 genepairs present in both ER+ and ER− subtypes. Whereas, the strongest baseline JSEM detects just 15 genepairs in both ER+ and ER− subtypes. NewSP and PaGIAM–sContChordalysisMML discovers less than 15 genepairs. Based on this evaluation, our method PaGIAM–tGDM outperforms exiting strong baselines by discovering more true important genepairs.
Glioblastoma Tumor Data
We also test PaGIAM–tGDM on [39]’s Glioblastoma tumor data with NewSP, JSEM, and PaGIAM–sContChordalysisMML. Due to unavailability of gold standard data, we compare the appearance of 10 genepairs in Glioblastoma tumor and its subtypes discovered by different methods. Figure 3 shows the discovery of 10 genepairs in different subtypes of Glioblastoma by different methods along with gold standard. PaGIAM–tGDM detects eight genepairs including their presence in subtypes accurately. Whereas, NewSP and JSEM detects seven and six genepairs accurately. Based on the results of this experiment, PaGIAM–tGDM detects important genepairs accurately including their presence in different subtypes of Glioblastoma tumor data.
Overall, the results of synthetic and real cancer data indicate that PaGIAM–tGDM is more accurate in predicting the contextspecific dependencies compare to baselines.
Conclusion
We have proposed a statistically efficient method to discover hierarchical Gaussian graphical models (HGGM) structure of multivariate data with a very large number of variables. We introduce PaGIAM (for the flat mixture) and tGDM (for twolevel HGGM i.e., shared and context specific) based on a novel MMLbased criterion for structure discovery of HGGM. Furthermore, our PaGIAM clusters and detects the two level HGGM using the concept of EM algorithm. tGDM algorithm is a stepwise greedy algorithm which incrementally adds the best edges minimizing the MMLbased scoring functions. They work with chordal graphs and decomposable models to make the computation of the MML based test statistics efficient. We have presented extensive empirical results on synthetic and reallife cancer datasets, and shown that our PaGIAM–tGDM method outperforms strong baselines in terms of the accurate prediction of shared and contextspecific associations from the highdimensional heterogeneous Gaussian data.
However, our PaGIAM–tGDM has several windows to extend. In PaGIAM, we assumed that the number of components is userdefined. However, PaGIAM can be extended to infinite flat mixture models using Indian Buffet Process etc. Although multivariate Gaussian distributions are good approximations for many realworld phenomena, we believe that there are reallife data which may be better captured by other forms of distributions. Some of the applications (e.g., Networks of verbal autopsy data) are not Gaussian distributed, but multinomial distributed. Therefore, we are interested to extend our framework to capture a broader class of distributions governing the data.
Notes
 1.
A clique is a subset of vertices of an undirected graph such that every two distinct vertices in the clique are adjacent [42]. A maximal clique is a clique that cannot be extended by including one more adjacent vertex, that is, a clique which does not exist exclusively within the vertex set of a larger clique [42].
 2.
In graph theory, the term “null graph” refers to a graph without any edges, aka the “empty graph” [42].
 3.
In real world, the heterogeneous GGM data exhibit relatively small number of components comparing with the number of datapoints. Therefore, \(K<< n\) therefore \(\log {K!}\) does not affect the total require bits to encode the clustering coefficient and the contents.
 4.
The graphical structure at the top level is the graphical structure with shared edges. At lower level, all the contextspecific graphical structures are placed. That is why it is called twolevel hierarchical Gaussian graphical models.
 5.
\(\mathrm{{FPR}} = \frac{{\text {FP}}}{{\text {TP}}+{\text {FP}}}\) where TP is the number of the predicted edges present in gold standard and FP is the number of the predicted edges not present in gold standard.
 6.
\(\mathrm{{FNR}} = \frac{{\text {FN}}}{{\text {TN}}+{\text {FN}}}\) where TN is the number of the predicted conditional independence present in gold standard and FN is the number of the predicted conditional independence not present in gold standard.
 7.
error = FNR+FPR.
 8.
Except PaGIAM–tGDM and PaGIAM–sContchordalysisMML, all other baselines of synthetic data experiments do not perform well. For this reason, we do not use these baselines for the realworld data.
References
 1.
Akaike H. Information theory and an extension of the maximum likelihood principle. In: Second international symposium on information theory; 1973. p. 267–281.
 2.
Allisons L. Encoding General Graphs. 2017. http://www.allisons.org/ll/MML/Structured/Graph/. Accessed 1 Apr 2020.
 3.
Armstrong H, et al. Bayesian covariance matrix estimation using a mixture of decomposable graphical models. Stat Comput. 2009;19:303–16.
 4.
Barabási AL, Albert R. Statistical mechanics of complex networks. Rev Mod Phys. 2002;74(1):47–97.
 5.
Breheny P, Huang J. Penalized methods for bilevel variable selection. Stat Inference. 2009;2(3):369–80.
 6.
Brennan C, et al. The somatic genomic landscape of gliobalstoma. Cell. 2013;155(2):462–77.
 7.
Clauset A, et al. Powerlaw distributions in empirical data. SIAM Rev. 2007;51:661–703.
 8.
Danaher P, et al. The Joint Graphical Lasso for inverse covariance estimation across multiple classes. J R Stat Soc. 2014;76(2):373–97.
 9.
Dempster A, et al. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc. 1977;39(1):1–39.
 10.
Deshpande A, et al. Efficient stepwise selection in decomposable models. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence; 2001. p. 128–135.
 11.
Dowe D, et al. MML estimation of the parameters of the spherical Fisher distribution. Algorithmic Learn Theory. 1996;1160:213–27.
 12.
Dwyer P. Some applications of matrix derivatives in multivariate analysis. J Am Stat Assoc. 1967;62:607–25.
 13.
Friedman J, et al. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–41.
 14.
Friedman N. The Bayesian structural EM algorithm. In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence (UAI); 1998. p. 129–138.
 15.
Gao C, et al. Estimation of multiple networks in Gaussian mixture models. Electron J Stat. 2016;10:1133–54.
 16.
Giraud C. Introduction to highdimensional statistics. Boca Raton: Chapman and Hall/CRCs; 2014.
 17.
Guavain JL, Lee CH. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process. 1998;2(2):291–8.
 18.
Guo J, et al. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1–15.
 19.
Hao B, et al. Simultaneous clustering and estimation of heterogeneous graphical model. J Mach Learn Res. 2018;18(217):1–58.
 20.
Kumar M, Koller D. Learning a small mixture of trees. In: Advances in neural information processing systems; 2009. p. 1051–1059.
 21.
Lauritzen S. Graphical models. Oxford: Clarendon Press; 1996.
 22.
Li Z, et al. Bayesian Joint SpikeandSlab Graphical Lasso. In: Proceedings of the 36th international conference on machine learning, vol. 97; 2019. p. 3877–3885.
 23.
Ma J, Michailidis G. Joint structural estimation of multiple graphical models. J Mach Learn Res. 2016;17:1–48.
 24.
Maretic H, Frossard P. Graph Laplacian mixture model. arXiv:1810.10053. 2018.
 25.
McLendon R, et al. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455(7216):1061–8.
 26.
Meilă M, Jordan MI. Learning with mixtures of trees. J Mach Learn Res. 2000;1:1–48.
 27.
Mirzaa G, et al. De novo CCND2 mutations leading to stabilization of cyclin D2 cause megalecephaly–polymicrogyria–polydactyly–hydrocephalus syndrome. Nat Genet. 2014;46(5):510–4.
 28.
Mukherjee C, Roriguez A. GPUpowered shotgun stochastic search for dirichlet process mixtures of gaussian graphical models. J Comput Graph Stat. 2016;25(3):762–88.
 29.
Narita Y, et al. Mutant epidermal growth factor receptor signalling downregulates p27 through activation of the phosphatidylinositol 3kinase/AKT pathway in glioblastomas. Cancer Res. 2002;62(22):6764–9.
 30.
Oliver J, et al. Unsupervised learning using MML. In: Proceedings of the 13th international conference machine learning; 1996. p. 364–372.
 31.
Peterson C, et al. Bayesian inference of multiple gaussian graphical models. J Am Stat Assoc. 2015;110(509):159–74.
 32.
Petitjean F, Webb G. Scaling loglinear analysis to datasets with thousands of variables. In: SIAM international conference on data mining; 2015. p. 469–477.
 33.
Petitjean F, et al. A statistically efficient and scalable method for loglinear analysis of highdimensional data. In: Proceedings of IEEE international conference on data mining (ICDM); 2014. p. 110–119.
 34.
Pittman J, et al. Integrated modeling of clinical and gene expression information for personalized prediction ofdisease outcomes. Proc Natl Acad Sci USA. 2004;101:8431–6.
 35.
Pujana MA, et al. Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat Genet. 2007;39:1338–49.
 36.
Rahman M, Haffari G. A statistically efficient and scalable method for exploratory analysis of highdimensional data. SN Comput Sci. 2020;1(2):1–17.
 37.
Rodriguez A, et al. Sparse covariance estimation in heterogeneous samples. Electron J Stat. 2011;5:981–1014.
 38.
Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–4.
 39.
Verhaak R, et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR and NF1. Cancer Cell. 2010;17(1):98–110.
 40.
Wallace C, Boulton D. An information measure for classification. Comput J. 1968;11:185–94.
 41.
Wallace C, Dowe D. MML clustering of multistate, Poisson, von Mises circular and Gaussian distributions. J Stat Comput. 2000;10:173–83.
 42.
West DB. Introduction to graph theory. London: Pearson; 2001.
Acknowledgements
We are thankful to Monash University for the financial supports towards this research. We are also thankful to Dr. Francois Petitjean for his valuable advise on the development of two level HGGM.
Funding
This study was not funded by any external funding source.
Author information
Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rahman, M.S., Nicholson, A.E. & Haffari, G. Inferring TwoLevel Hierarchical Gaussian Graphical Models to Discover Shared and ContextSpecific Conditional Dependencies from HighDimensional Heterogeneous Data. SN COMPUT. SCI. 1, 218 (2020). https://doi.org/10.1007/s4297902000224w
Received:
Accepted:
Published:
Keywords
 Contextspecific dependencies
 Minimum message length
 Hierarchical Gaussian graphical models