Inferring Two-Level Hierarchical Gaussian Graphical Models to Discover Shared and Context-Specific Conditional Dependencies from High-Dimensional Heterogeneous Data

Abstract

Gaussian graphical models (GGM) express conditional dependencies among variables of Gaussian-distributed high-dimensional data. However, real-life datasets exhibit heterogeneity which can be better captured through the use of mixtures of GGMs, where each component captures different conditional dependencies a.k.a. context-specific dependencies along with some common dependencies a.k.a. shared dependencies. Methods to discover shared and context-specific graphical structures include joint and grouped graphical Lasso, and the EM algorithm with various penalized likelihood scoring functions. However, these methods detect graphical structures with high false discovery rates and do not detect two types of dependencies (i.e., context-specific and shared) together. In this paper, we develop a method to discover shared conditional dependencies along with context-specific graphical models via a two-level hierarchical Gaussian graphical model. We assume that the graphical models corresponding to shared and context-specific dependencies are decomposable, which leads to an efficient greedy algorithm to select edges minimizing a score based on minimum message length (MML). The MML-based score results in lower false discovery rate, leading to a more effective structure discovery. We present extensive empirical results on synthetic and real-life datasets and show that our method leads to more accurate prediction of context-specific dependencies among random variables compared to previous works. Hence, we can consider that our method is a state of the art to discover both shared and context-specific conditional dependencies from high-dimensional Gaussian heterogeneous data.

Introduction

Graphical models represent multivariate distributions by explicitly expressing conditional independencies, which is particularly suitable for the analysis of high-dimensional data [21]. Gaussian graphical models (GGM) are widely used as a framework to model structural relationships among random variables, assuming their joint distribution is Gaussian. The pattern of non-zero entries in the inverse covariance matrix in the multivariate Gaussian distribution corresponding to the edges in the corresponding GGM. In standard setting, it is assumed that all observations are generated from a single underlying multivariate Gaussian distribution. However, real-life datasets exhibit heterogeneity, which can be better modeled through the use of mixtures of GGMs to let each component exhibit different conditional dependencies among variables, a.k.a context-specific dependencies along with many common dependencies a.k.a. shared dependencies [26, 37]. For an example, the recent studies on Cancer Genome Atlas Network have found that gene expression data can be described by mixtures with the small number of components harboring different expression pathways [28]. However, typically there is far less number of samples (i.e., observations) compared to the number of variables, generated from a mixture with unknown number of components. Hence, high-dimensional heterogenous data make the conditional dependencies discovery challenging.

Meilă and Jordan [26] and Kumar and Koller [20] are pioneers to discover context-specific dependencies from high-dimensional continuous data using EM-based approaches. Armstrong et al. [3] uses a Bayesian approach by assigning a prior to graphical structures. However, Gao et al. and Rodriguez et al. [18, 37] emphasized that context-specific graphical structures share some edges, which are not well discovered by the above-mentioned methods. GGMs with shared structure have been modeled with hierarchical Gaussian graphical models (HGGM) in [18], where they developed a method to discover HGGM using hierarchical penalized likelihood and graphical Lasso. However, this method discovers only shared structure, not context-specific dependencies. Danaher et al. [8] proposed a method to discover context-specific graphical models with a shared structure using joint graphical Lasso with penalized maximum likelihood estimation as a scoring function. However, this method is heavily dependant on two tuning parameters which are user defined; furthermore, it faces the issue of predicting many false edges. To resolve this issue, Peterson et al. [31] introduced a Bayesian approach that estimates all GGMs via edge-specific informative prior over the common structure; however, this method does not infer the common structure accurately. Gao et al. [15] improved the discovery of HGGM with the joint graphical lasso and the hard EM algorithm. Ma and Michailidis [23] investigated the joint estimation problem for HGGM using group and joint graphical lasso. Both Gao et al. [15] and Ma and Michailidis [23] use Friedman [14]’s proposed graphical lasso in which the tuning parameter is not data-specific. Therefore, both methods suffer from the problem of discovering the high number of false edges. However, recently Hao et al. [19], Li et al. [22] and Maretic et al. [24] improved the graphical Lasso technique by estimating better tuning parameter. But, these methods only detect context-specific dependencies with few the shared dependencies.

In this paper, we address the above issues by proposing a novel method to learn the mixtures of HGGM based on the EM algorithm, which iterates over the following two steps. First, it clusters the data into distinct clusters. Second, it employs the forward selection algorithm [10] for structure discovery for the data in each cluster. To discover context-specific graphical models and their shared structure, our method incrementally adds the best edges maximising a scoring function in the forward selection algorithm. In both steps of our EM-style algorithm, we use minimum message length (MML) as the objective function. Our MML-based approach is an information theoretic method enjoying (a) low false discovery rate, (b) suitability for the small number of samples when discovering statistical dependencies (associations) among large number of variables, and (c) scalability to large-scale problems involving thousands of variables. We present extensive empirical results on synthetic and real-life datasets and show that our method leads to more accurate prediction of context-specific dependencies among variables, compared to the previous works.

Background

Let \({\mathcal {D}} = \{X_1,\ldots ,X_n\}\) be a training set consisting of n data points where \(X_i \in \mathbb {R}^d\) and d is the number of dimensions (equivalently, the number of random variables). Let us assume the data have been generated from a probabilistic graphical model corresponding to a graph G. Parameterization of the graphical model corresponds to multivariate functions assigned to subset of variables in the maximal cliqueFootnote 1 in G. The probability density function corresponding \(f({\mathcal {D}})\) to the graphical model is defined as

$$\begin{aligned} f({\mathcal {D}}) \propto \prod _{C \in {\mathcal {C}}} f_C({\mathcal {D}}^C), \end{aligned}$$

where \({\mathcal {C}}\) is the set of maximal cliques of G, and \(f_C({\mathcal {D}}^C)\) is a clique-specific non-negative function defined on the subset of variables appearing in a clique C. \({\mathcal {D}}^C\) is the data points of a maximal clique C. In any distribution resulted from a graphical model, two random variables are statistically independent conditioned on the variables in a cut separating the two.

In this paper, we assume that the observed input vectors have been generated from multivariate Gaussian distributions, which means the cliques are also parameterized by Gaussian distributions. Therefore, our aim is to discover the structure of these so-called Gaussian graphical models. For computational convenience, we work with chordal graphical structures, leading to decomposable models which we detail in the next section.

Decomposable Models

Decomposable Models is a subclass of undirected graphical models which provides a usefully constrained representation in which model selection and parameter estimation can be done efficiently, making it suitable for large-scale problems.

Definition 1

A graphical model is decomposable if the associated graph G is chordal. A chordal graph is one in which all cycles of four or more vertices have a chord, which is an edge that is not part of the cycle but connects two vertices of the cycle [10].

Let \({\mathcal {M}}\) be a decomposable model corresponding to G, and \(f_{{\mathcal {M}}}\) be the probability density function of a Gaussian distribution corresponding to \({\mathcal {M}}\). It can be shown that:

$$\begin{aligned} f_{{\mathcal {M}}}({\mathcal {D}}) = \frac{\prod _{C \in {\mathcal {C}}}{f({\mathcal {D}}^C)}}{\prod _{S \in {\mathcal {S}}}{f({\mathcal {D}}^S)}}, \end{aligned}$$
(1)

where \({\mathcal {C}}\) is the set of maximal cliques and \({\mathcal {S}}\) is the set of minimal separators corresponding to the chordal graph of the model \({\mathcal {M}}\). The importance of this result is that it relates the Gaussian distribution over all variables to those on the subsets of variables, i.e., Gaussian distributions over the variables involved in maximal cliques \(f({\mathcal {D}}^C)\) or minimal separators \(f({\mathcal {D}}^S)\). This amounts to a closed form solution for the maximum likelihood estimate (MLE) of the covariance matrix \({\hat{\varSigma }}\) of the Gaussian graphical model \(f_{{\mathcal {M}}}\), through the MLE of the covariance matrices of the component models.

To discover the optimal decomposable graphical structures from a given training data, typically one of the following strategies is employed [10]:

  • Forward selection: Starting with the simplest model with no edge (i.e. \(E\ =\ \emptyset\)). Edges are added incrementally, as long as the new hypothesized models are not rejected according to an appropriate test statistics.

  • Backward elimination: Starting with the complete graph over the |V| vertices, edges are deleted incrementally, as long as the new hypothesized models are not rejected according to an appropriate test statistics.

In this paper, we adopt the forward selection strategy, and add the edges incrementally. As we want the resulting model to be decomposable, the addition of an edge has to be done with care. [10] characterises the edges that can be added to a decomposable model while retaining its decomposability. Furthermore, it presents an efficient algorithm to enumerate all such edges in \(O(|V|^2)\). This is achieved by a data structure called the clique graph, which keeps track of the maximal cliques \({\mathcal {C}}\) and minimal separators \({\mathcal {S}}\). Adding an edge to the graph and updating the underlying data structures also takes \(O(|V|^2)\).

Theorem 1

If two decomposable models\({\mathcal {M}}\subset {\mathcal {M}}'\)differ only in one edge (ab), (i.e.,\((a,b) \in {\mathcal {M}}'\)and\((a,b) \not \in {\mathcal {M}}\)), then the maximal cliques and the minimal separators\(({\mathcal {C}},{\mathcal {S}})\)and\(({\mathcal {C}}',{\mathcal {S}}')\)in these two models differ as follows:

  • If\(C_a \not \subset C_{ab}\)and\(C_b \not \subset C_{ab}\), then\({\mathcal {C}}'= C + C_{ab}\)and\({\mathcal {S}}' = {\mathcal {S}} + C_{ab} \cap C_a + C_{ab} \cap C_b - S_{ab}\)

  • If\(C_a \subset C_{ab}\)and\(C_b \not \subset C_{ab}\), then\({\mathcal {C}}'= {\mathcal {C}} + C_{ab} - C_a\) and \({\mathcal {S}}' = {\mathcal {S}} + C_{ab} \cap C_b - S_{ab}\)

  • If\(C_a \not \subset C_{ab}\)and\(C_b \subset C_{ab}\), then\({\mathcal {C}}'= {\mathcal {C}} + C_{ab} - C_b\)and\({\mathcal {S}}' = {\mathcal {S}} + C_{ab} \cap C_a - S_{ab}\)

  • If\(C_a \subset C_{ab}\)and\(C_b \subset C_{ab}\), then\({\mathcal {C}}'= {\mathcal {C}} + C_{ab} - C_a - C_b\)and\({\mathcal {S}}' = {\mathcal {S}} - S_{ab}\)

[10] where\(C_{ab}\)and\(S_{ab}\)are the maximal clique and minimal separator for the nodesaandb, and\(C_a\)and\(C_b\)are the maximal cliques including each of these nodes (Fig. 1). This immediately leads to the following theorem.

Fig. 1
figure1

Structure of (i) the cliques \(C_a\), \(C_b\) and separator \(S_{ab}\) in reference model; and (ii) newly formed clique \(C_{ab}\) and separators \(C_{ab} \cap C_a\) and \(C_{ab} \cap C_b\) in candidate model

Theorem 2

If two decomposable models\({\mathcal {M}}\subset {\mathcal {M}}'\)differ only in one edge (ab), (i.e., \((a,b) \in {\mathcal {M}}'\)and\((a,b) \not \in {\mathcal {M}}\)), then

$$\begin{aligned} \frac{|{\hat{\varSigma }}|}{|{\hat{\varSigma }}'|} = \frac{|\varSigma ^{C_{ab}}| \cdot |\varSigma ^{S_{ab}}|}{|\varSigma ^{C_{ab} \cap C_a}|\cdot |\varSigma ^{C_{ab} \cap C_b}|} \end{aligned}$$
(2)

Thus, the change in the determinant of the MLE estimates of the covariance \(|\varSigma |\) after adding an edge (ab) is only dependent on the minimal separator of the two vertices \(S_{ab}\), the newly formed clique \(C_{ab}\), and the newly formed separators \(C_{ab} \cap C_a\) and \(C_{ab} \cap C_b\). This means we only have to compute the determinant terms relevant to the candidate edges that can be added to the current model for faster computation.

Minimum Message Length

Estimating decomposable model with maximum likelihood estimation requires many samples to accept correct hypotheses. Moreover, it relies on the existence of the maximum likelihood estimates, which may not exist if the number of samples is less than the size of the largest clique in the graph. These drawbacks can be overcome by using a scoring function based on the minimum message length (MML).

MML is an information-based criterion to find the best hypothesis for the observed data by controlling the false discovery rate and requiring far fewer samples to accept true hypotheses [40]. Let us consider a hypothesis (or model) \({\mathcal {M}}\) that offers an explanation of the observed data \({\mathcal {D}}\). Based on the fundamental rules of probability:

$$\begin{aligned} p({\mathcal {M}}, {\mathcal {D}}) = p({\mathcal {M}}) \times p({\mathcal {D}}|{\mathcal {M}}) = p({\mathcal {D}}) \times p({\mathcal {M}}|{\mathcal {D}}), \end{aligned}$$

where \(p({\mathcal {M}})\) is the prior over hypotheses/models, \(p({\mathcal {D}}|{\mathcal {M}})\) is the likelihood, \(p({\mathcal {D}})\) is the prior probability of data, and \(p({\mathcal {M}}|{\mathcal {D}})\) is the posterior of \({\mathcal {M}}\) given \({\mathcal {D}}\). Based on Shannons theory of communication, the amount of information of explaining \({\mathcal {D}}\) with \({\mathcal {M}}\) is:

$$\begin{aligned} I({\mathcal {M}}, {\mathcal {D}}) = I({\mathcal {M}}) + I({\mathcal {D}}|{\mathcal {M}}) = I({\mathcal {D}}) + I({\mathcal {M}}|{\mathcal {D}}), \end{aligned}$$
(3)

where \(I(a) = - \log (p(a))\) gives the optimal code length to convey some event a whose probability is p(a). This results in an objective criterion to compare two competing models \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\) given the same data \({\mathcal {D}}\):

$$\begin{aligned} I({\mathcal {M}}_1|{\mathcal {D}})-I({\mathcal {M}}_2|{\mathcal {D}}) = I({\mathcal {M}}_1)+I({\mathcal {D}}|{\mathcal {M}}_1)-I({\mathcal {M}}_2)-I({\mathcal {D}}|{\mathcal {M}}_2). \end{aligned}$$
(4)

A possible realization of this framework is the transmission of data over a communication channel between the sender and the receiver. The sender sends \({\mathcal {D}}\) with an explanation message, so that the receiver can reconstruct back the original data losslessly from the message. The sender’s message encodes both the model \({\mathcal {M}}\) and the data residual \(p({\mathcal {D}}|{\mathcal {M}})\). The receiver then reads in the model from the message, and decodes the original data from the residual.

The goal of this communication game is to minimize the length of the explanation message.

If the sender can find the best model on the data, the receiver will receive the most economic decodable explanation message; this is the basis of statistical inference based on the MML principle [33]. Therefore, according to MML principle, the best hypothesis is the one that can lead to encode the entire data set and hypothesis in the shortest possible message.

figurea
figureb

ContChordalysis-MML

Rahman and Haffari [36] proposed an algorithm, named ContChordalysis-MML to discover GGMs structure from high-dimensional data with just a handful number of samples. This algorithm is developed based on the concept of decomposable models and minimum message length. Starting from the null graph,Footnote 2 ContChordalysis-MML incrementally adds the best edge minimizing MML-based score using the graphical model. The pseudocode of the ContChordalysis-MML is presented as Algorithm 1. Based on the experimental results, ContChordalysis-MML discovers more true edges with a lower false discovery rate and outperforms strong baselines including methods based on penalized likelihood function and graphical Lasso.

Scalable ContChordalysis-MML

ContChordalysis-MML is a forward selection algorithm which adds edges to the candidate graphical structure, checks the candidature of remaining edges to become candidate edges and computes the scoring function of all candidate edge. However, the edge candidature checking and score computation make the forward selection strategy slow for a very large number of random variables. According to the Theorem 1 of the previous section "Decomposable Models", the addition of an edge (ab) affects the minimal separator between two nodes a, b: \(S_{ab}\) and creates one new clique \(C_{ab}\) and two new separators \(C_{ab}\cap C_a\) and \(C_{ab}\cap C_b\); and remaining all other separators and cliques are unchanged. Hence, it is not required to check the candidature and to compute the scoring function of all candidate edges at every step. According to [32], the addition of an edge (ab) to the candidate model affects the minimal separators between following node pairs: (a) a and the neighbors of b, (b) b and the neighbors of a, and (c) neighbors of a and b. So that we only recompute the candidature and scoring function of above-mentioned node pairs (i.e., edges) which takes O(|V|) times. Therefore, these candidature checking and scoring function computation make the forward selection algorithm more scalable than the existing best algorithm.

We modified ContChordalysis-MML to make it more scalable based on the properties of the Decomposable GGM which are discussed above. The modifications are twofold:

Initial step ::

At the beginning, ContChordalysis-MML starts with empty graph where no nodes are connected with each other. Therefore, each node would be treated as single clique and has no separator. Adding any edge between nodes a and b will form a new cliques \(C_{ab}\) from two cliques \(C_a\) and \(C_b\) and separator \(S_{ab}\). Moreover, other cliques and separator will remain unchanged. Hence the encoding bit difference of parameter and data of the reference model \({\mathcal {M}}\) and candidate model \({\mathcal {M}}'\) of the ContChordalysis-MML will be as below:

$$\begin{aligned} I(\theta ,{\mathcal {D}}|{\mathcal {M}}) - I(\theta ,{\mathcal {D}},{\mathcal {M}}') = I(C_{ab},{\mathcal {D}}^{C_{ab}}) - I(S_{ab},{\mathcal {D}}^{S_{ab}}). \end{aligned}$$
(5)

Moreover, whenever we add an edge between two single-size cliques, Eq. 5 will remain unchanged as per the above discussion. Therefore, at the initial step, we will compute the encoding bit difference of parameter \(\theta\) and data \({\mathcal {D}}\) of the reference model \({\mathcal {M}}\) and candidate model \({\mathcal {M}}'\) using the above equation.

Edge adding step ::

In this step, we will only recompute the encoding bit difference of parameter and data of the reference model \({\mathcal {M}}\) and candidate model \({\mathcal {M}}'\) after adding any edge (ab), for a and the neighbors of b; b and the neighbors of a; and neighbors of a and b using the equation.

$$\begin{aligned} I(\theta ,{\mathcal {D}}|{\mathcal {M}}) - I(\theta ,{\mathcal {D}},{\mathcal {M}}')= & {} I(C_{ab},{\mathcal {D}}^{C_{ab}}) + I(S_{ab},{\mathcal {D}}^{S_{ab}}) \nonumber \\&- I(C_{ab} \cap C_b,{\mathcal {D}}^{C_{ab}\cap C_b}) - I(C_{ab}\cap C_a,{\mathcal {D}}^{C_{ab}\cap C_a}). \end{aligned}$$
(6)

This computation will make the ContChordalysis-MML more scalable and reduce the time complexity to O(|V|).

We call the modified ContCordalysis-MML as sContChordalysis-MML which is presented as the algorithm 2. In the paper, initially we use sContChordalysis-MML algorithm to discover the context-specific GGM. However, similar to ContChordalysis, we used MML-based scoring function to discover two-level hierarchical Gaussian graphical models. Therefore, our approach is twofold: (a) partition the multivariate Gaussian data into k clusters and (b) discovering context-specific graphical models from the data of each cluster. In next two subsequent sections, we detail our approach to discover the decomposable HGGM.

Discovery of the Decomposable HGGMs

Let us assume that the data have been generated from a flat mixture of multivariate Gaussian distributions, where each component corresponds to a graphical model. Our aim is to discover the unobserved structure of undirected Gaussian graphical models based on the observed data \({\mathcal {D}}\):

$$\begin{aligned} g({\mathcal {D}}) = \sum _{k=1}^{K}{\gamma _k g_k({\mathcal {D}}_k)}, \end{aligned}$$
(7)

where \(\gamma\) is the mixing coefficient, \(\gamma _k\) is the mixing coefficient of the component/cluster k and \(g_k({\mathcal {D}}_k)\) is context-specific distribution of the component k.

Specifically, we are interested in the undirected graphical structures \({\mathcal {G}}\ =\ \{G_0, G_1,G_2, \ldots ,G_k\}\) where \(G_k\ =\ \{V,E_k\}\ \forall _{k = \{1 \ldots K\}}\) is the context-specific graphical structure of component i, \(G_0\) is the shared or global graphical model, V is the set of vertices corresponding to random variables (or dimensions of the input vectors), \(E_k\ \forall _{k = \{0,1 \ldots K\}}\) is the set of edges capturing shared and context-specific statistical associations between random variables, and K is the number of components in the mixture model.

The another input to the algorithm is the number of components K believed to exist in the data. The output is then the shared and context-specific graphical model structures. The algorithm consists of two steps: (a) the clustering step, similar to the E step in the hard EM algorithm, to partition the data and (b) the structure and parameter estimation step, similar to the M-step of the EM algorithm. In the estimation step, initially we employ sContChordalysis-MML, described in section "Scalable ContChordalysis-MML". Our algorithm keeps repeating the clustering, and structure and parameter estimation steps until it converges with respect to the objective function.

Our algorithm optimizes a minimum message length (MML) as the objective function for estimating the structure of the shared and context-specific graphical models and the clusters of the data. We call our iterative algorithm: PartitionandGraphical model discoveryIterativeAlgorithm based onMML or PaGIAM, as summarised in algorithm 3. We also present a flowchart of PaGIAM algorithm.

figurec

The Objective Function

The MML-based objective function of our iterative algorithm: PaGIAM encodes the hypothesis in the message form including the encoding of all clusters \({\mathcal {K}}\) and the associated cluster parameters \(\theta _{{\mathcal {K}}}\), encoding of shared and context-specific graphical models \({\mathcal {G}}\), their parameters \(\theta _{{\mathcal {G}}}\) and data \({\mathcal {D}}\) which is as below.

$$\begin{aligned} MML= & {} \underbrace{I({\mathcal {K}}) + I(\theta _{{\mathcal {K}}}|{\mathcal {K}})}_{I({\mathcal {K}},\theta _{{\mathcal {K}}})} \nonumber \\&+ \underbrace{I({\mathcal {G}}|{\mathcal {K}}) + I(\theta _{{\mathcal {G}}}|{\mathcal {G}},{\mathcal {K}})) + I({\mathcal {D}}|\theta _{{\mathcal {G}}},{\mathcal {G}},{\mathcal {K}})}_{I({\mathcal {G}},\theta _{{\mathcal {G}}},{\mathcal {D}}|{\mathcal {K}})}, \end{aligned}$$
(8)

where \(I({\mathcal {G}},\theta _{{\mathcal {G}}},{\mathcal {D}}|{\mathcal {K}})\) is the MML to encode the shared and context-specific graphical models \({\mathcal {G}}\), their parameters \(\theta _{{\mathcal {G}}}\) and data \({\mathcal {D}}\) and \(I({\mathcal {K}},\theta _{{\mathcal {K}}})\) is the MML to encode all clusters \({\mathcal {K}}\) and the associated cluster parameters \(\theta _{{\mathcal {K}}}\).

However, in PaGIAM algorithm we initially used the sContChordalysis-MML to discover the graphical models of the clusters \({\mathcal {K}}\). According to Rahman and Haffari [36]. to encode the graphical models \({\mathcal {G}}\), their parameters \(\theta _{{\mathcal {G}}}\) and data \({\mathcal {D}}\), \(I({\mathcal {G}},\theta _{{\mathcal {G}}},{\mathcal {D}}|{\mathcal {K}})\), we require

$$\begin{aligned}&I({\mathcal {G}},\theta _{{\mathcal {G}}},{\mathcal {D}}|{\mathcal {K}}) \nonumber \\&\quad = \sum _{k=1}^{K}\Biggl (\log {\bigg (\frac{mE-|E_k|}{mE-|E_k|-1}\bigg )} \nonumber \\&\qquad + \frac{1}{2}\sum _{C \in {\mathcal {C}}_k}\Big ( (n_k^C-1) \log {(\varSigma _{G_k}^C)} \nonumber \\&\qquad + \big (({\mathcal {D}}_k^C-\mu _{G_k}^C)\varSigma _{G_k}^{C^{-1}}({\mathcal {D}}_k^C-\mu _{G_k}^C)^\mathrm{{T}}\big ) \Big ) \nonumber \\&\qquad + \frac{1}{2}\sum _{S \in {\mathcal {S}}_k}\Big ( (n_k^S-1) \log {(\varSigma _{G_k}^S)} \nonumber \\&\qquad + \big (({\mathcal {D}}_k^S-\mu _{G_k}^S)\varSigma _{G_k}^{S^{-1}}({\mathcal {D}}_k^S-\mu _{G_k}^S)^\mathrm{{T}}\big ) \Big ) \Biggr ), \end{aligned}$$
(9)

where mE is the size of a complete graph i.e. \(mE = \frac{|V|(|V|-1)}{2}\). \({\mathcal {C}}_k\) and \({\mathcal {S}}_k\) are the maximal clique and maximal separator set of a graph \(G_k\) of a cluster k. Therefore, we need to find out the MML of encoding of all clusters \({\mathcal {K}}\) and the associated cluster parameters \(\theta _{{\mathcal {K}}}\). In next two subsections, we will discuss about the encoding of the clusters and their parameters.

Encoding the Clusters

We now describe the encoding of clusters which includes their contents and the number of clusters in the mixture. At first, we encode the number of clusters, for which we need \(\log {(K)}\) bits. We then encode the mixing coefficient of all of the clusters and the content of clusters by encoding the cluster indicator vector \(\mathbf {z}\). Each element of cluster indicator vector \(\mathbf {z}_i\) represents a numerical value between 1 and K to indicate cluster membership of a datapoint \(X_i\) (where \(X_i \in \mathbb {R}^d\)). As each of element of the cluster indicator vector \(\mathbf {z}_i\) represents a numerical value between 1 and K, we assume that the contents of vector z are multinomial distributed. According to [41], to encode the multinomial distributed content and the clustering coefficient, we require \(\frac{K-1}{2} \log {\big (\frac{n}{12}+1\big )} - \log {K!} - \sum _{k=1}^{K}{\Big ((n_k+\frac{1}{2})\log {\frac{n_k+\frac{1}{2}}{n+\frac{K}{2}}}\Big )}\) bits.Footnote 3 Therefore, the minimum message length to encode all of clusters is:

$$\begin{aligned} I({\mathcal {K}}) = \, & {} \frac{K-1}{2} \log {\Big (\frac{n}{12}+1\Big )} - \log {(K-1)!}\nonumber \\&- \sum _{k=1}^{K}{\Big (\Big (n_k+\frac{1}{2}\Big )\log {\frac{n_k+\frac{1}{2}}{n+\frac{K}{2}}}\Big )}. \end{aligned}$$
(10)

Encoding of the Parameters

Once clusters have been encoded, we encode parameters of clusters. According to [40], the MML encoding of parameters of the multivariate Gaussian distribution corresponding to a cluster k, denoted by \(I(\theta _k)\), is:

$$\begin{aligned} I(\theta _k) = \underbrace{\bigg (-\overbrace{\log (h(\theta _k))}^\text {Prior} + \overbrace{\log {\sqrt{|{\mathcal {F}}(\theta _k)|}}}^\text {Fisher information}\bigg )}_{I(\theta _k)}, \end{aligned}$$
(11)

where \(\theta _k = (\mu _k, \varSigma _k)\). In what follows, we compute various components of \(I(\theta _k)\) of Eq. (11), i.e. the prior probability, and the Fisher information matrix.

Prior probability of the parameters: Following the previous work [11], we use a flat prior for \(\mu _k\) [30] and a conjugate inverted Wishart prior for \(\varSigma _k\) [17]. Hence, the prior joint density over the parameters is:

$$\begin{aligned} h(\theta _k) \propto |\varSigma _k|^{-\frac{d+1}{2}}. \end{aligned}$$
(12)

Fisher information of the parameters: We need to evaluate the second-order partial derivatives of \(-{\mathcal {L}}({\mathcal {D}}_k|\mu _k,\varSigma _k)\) to compute the Fisher information for the parameters [40], where

$$\begin{aligned}&{\mathcal {L}}({\mathcal {D}}_k|\mu _k,\varSigma _k) \nonumber \\&\quad = -\frac{n_k}{2}\log {2\pi }-\frac{n_k}{2}\log {|\varSigma _k|}\nonumber \\&\qquad -\frac{1}{2}\sum _{j=1}^{n_k}{(x_{kj}-\mu _k) \varSigma _k^{-1}(x_{kj}-\mu _k)^\mathrm{{T}}}, \end{aligned}$$
(13)

where, \(\mu _k = \frac{1}{n}_k\sum _{j=1}^{n_k}{x_{kj}}\) and \(\varSigma _k=\frac{1}{n_k-1}\sum _{j=1}^{n_k}{(x_{kj}-\mu _k)(x_{kj}-\mu _k)^\mathrm{{T}}}\).

Let \(|{\mathcal {F}}(\mu _k,\varSigma _k)|\) represent the determinant of the Fisher information matrix which is the product of \(|{\mathcal {F}}(\mu _k)|\) and \(|{\mathcal {F}}(\varSigma _k)|\), i.e., the determinant of Fisher information matrices of \(\mu _k\) and \(\varSigma _k\), respectively [30]. Taking the second-order partial derivatives of \(-{\mathcal {L}}({\mathcal {D}}_k|\mu _k,\varSigma _k)\) with respect to \(\mu _k\), we get \(-\nabla _{\mu _k}^2{\mathcal {L}} = n_k\varSigma _k^{-1}\). So the determinant of the Fisher information matrix for \(\mu _k\) is \(|{\mathcal {F}}(\mu _k)| = n_k^d|\varSigma _k|^{-1}\). To compute \(|{\mathcal {F}}(\varSigma _k)|\), [12] derived an analytical expression using the theory of matrix derivatives based on matrix vectorization:

$$\begin{aligned} |{\mathcal {F}}(\varSigma _k)| = n_k^{\frac{d(d+1)}{2}}2^{-n_k}|\varSigma _k|^{-(d+2)}. \end{aligned}$$

Hence, the determinant of the Fisher information matrix for \(\mu _k\) and \(\varSigma _k\) is

$$\begin{aligned} |{\mathcal {F}}(\mu _k,\varSigma _k)| = n_k^{\frac{d(d+3)}{2}}2^{-n_k}|\varSigma _k|^{-(d+3)}. \end{aligned}$$
(14)

Therefore,

$$\begin{aligned} I(\theta _k) = -\frac{1}{2}\log {|\varSigma _k|} + \mathrm{{Constant}}. \end{aligned}$$
(15)

MML to Encode Clusters and Their Parameters

Therefore, MML to encode the clusters and their parameters is as follows

$$\begin{aligned} I({\mathcal {K}},\theta _{{\mathcal {K}}}) = \underbrace{\log {K} + \sum _{k=1}^{K}{\log {n_k}-\log {n}}}_{\text {Encoding the cluster}} + \underbrace{\sum _{k=1}^{K}{\frac{1}{2} \log {(|\varSigma _k|)}}}_{\text {Encoding the parameters of the clusters}}. \end{aligned}$$
(16)

MML Score to Discover the Decomposable HGGM

Substituting the Eqs. (9) and (16) to (8), the MML score to find the best hierarchical graphical from heterogenous multivariate Gaussian data is as below.

$$\begin{aligned} MML= & {} \sum _{k=1}^{K}\Biggl (\underbrace{\log {\bigg (\frac{mE-|E_k|}{mE-|E_k|-1}\bigg )}}_{\text {Encoding the graphical structures}} \nonumber \\&+ \underbrace{\frac{1}{2}\sum _{C \in {\mathcal {C}}_k}\Big ( (n_k^C-1) \log {(\varSigma _{G_k}^C)} + \big (({\mathcal {D}}_k^C-\mu _{G_k}^C)\varSigma _{G_k}^{C^{-1}}({\mathcal {D}}_k^C-\mu _{G_k}^C)^\mathrm{{T}}\big ) \Big )}_{{\text {Encoding the parameters and data of all maximal cliques of a context-specific graphical model}} G_k} \nonumber \\&+ \underbrace{\frac{1}{2}\sum _{S \in {\mathcal {S}}_k}\Big ( (n_k^S-1) \log {(\varSigma _{G_k}^S)} + \big (({\mathcal {D}}_k^S-\mu _{G_k}^S)\varSigma _{G_k}^{S^{-1}}({\mathcal {D}}_k^S-\mu _{G_k}^S)^\mathrm{{T}}\big ) \Big )}_{{\text {Encoding the parameters and data of all minimal separator of a context-specific graphical model}} G_k} \Biggr) \nonumber \\&+ \underbrace{\frac{K-1}{2} \log {\big (\frac{n}{12}+1\big )} - \log {(K-1)!} - \sum _{k=1}^{K}{\Big ((n_k+\frac{1}{2})\log {\frac{n_k+\frac{1}{2}}{n+\frac{K}{2}}}\Big )}}_{\text {Encoding the cluster}} + \underbrace{\sum _{k=1}^{K}{\frac{1}{2} \log {(|\varSigma _k|)}}}_{\text {Encoding the parameters of the clusters}}. \end{aligned}$$
(17)

However, the MML score encodes the clusters and the context-specific graphical models of the given high-dimensional heterogeneous Gaussian data as we use sContChordalysis-MML to discover the graphical models. sContChordalysis-MML can not detect the global or shared graphical structures from the data. Therefore, we design a new graphical model discovery algorithm to discover the shared graphical model along with context-specific graphical models which we discuss in next section "MML for Discovering the Shared and Context-Specific GGMs".

MML for Discovering the Shared and Context-Specific GGMs

The method mentioned in section "Discovery of the Decomposable HGGMs" discovers the structure of context-specific graphical models without discovering the shared edges among them. However, in heterogeneous data, context-specific graphical structures share a significant number of edges. Only context-specific graphical models discovery does not able to detect all shared edges among themselves. Hence, many important features remain undiscovered. Therefore, it is important to discover the shared edges along with context-specific graphical models. Modeling the shared structure can help the GGM structure discovery by pooling the statistics together. Therefore, we extend our approach to discover a two-level hierarchical Gaussian graphical modelsFootnote 4 (HGGM). We learn the model based on an MML-based score. Our approach works with chordal graphs, leading to “decomposable” probabilistic graphical models (discussed in section "Background") to make MML-based scoring function computationally efficient.

According to [40], the minimum message length finds the best model for the observed data by comparing the two competing models given the same data \({\mathcal {D}}\). To find the best structures, we encode the graph structures of shared edges (we call it super graph \(G_0\)) along the context-specific GGM structures \(\{G_1,G_2,\ldots ,G_K\}\), their parameters and the data in messages and compare their message lengths. While using sContChordalysis-MML in PaGIAM, we encode the graph topology \(G_k\), the parameters \(\theta _k\) and the data \({\mathcal {D}}_k\) of each cluster k and then combine them. It does not encode the super graph topology. To improve the better structures discovery from the heterogeneous data, we encode the super graph \(G_0\) topology, the topology of context specific GGMs \(\{G_1,G_2,\ldots ,G_K\}\), their parameters \(\{\theta _1,\theta _2,\ldots ,\theta _K\}\) and data \(\{{\mathcal {D}}_1,{\mathcal {D}}_2,\ldots ,{\mathcal {D}}_K\}\). To minimize the number of required bits of MML, we only encode the edges of the context-specific GGMs which are not present in super graph, which is \(G^*_k = G_k - G_0; \forall _{k=\{1,2 \ldots K\}}\).

The MML to encode the two-level HGGM and the data are as follows:

$$\begin{aligned} I({\mathcal {G}},\theta _{{\mathcal {G}}},{\mathcal {D}}|{\mathcal {K}}) = \overbrace{I(G_0)+\sum _{k=1}^{K}{I(G_k^*)}}^\text {{Encoding all graph structures }} I({\mathcal {G}}) + \overbrace{\underbrace{I(\theta _{G_0})}_\text {{Parameters of super graph}}+ \sum _{k=1}^{K}{\bigg (\underbrace{I(\theta _{G_k})}_\text {{ Parameters of models}}} + \underbrace{I({\mathcal {D}}_k|\theta _{G_k},G_k)}_\text {{ Data fit to the models}} \bigg )}^\text {Parameters of context specific graphs with data, }I({\mathcal {D}},\theta ), \end{aligned}$$
(18)

where \(G_0\) is the shared graphical structures which can be called as super graph structure, \(G_k\) is the context-specific graph structure of kth component, and \(G_k^*\) is the context-specific graphical structure of component k without shared edges. Moreover, \(\theta _{G_k}\) is the parameters of context-specific model of component k. \(\theta _{G_k} = \{\mu _{G_k},\varSigma _{G_k}\}\) where \(\mu _{G_k}\) and \(\varSigma _{G_k}\) mean vector and covariance matrix of graphical structure of component k. According to [40], Eq. (18) deduces to

$$\begin{aligned} I(G_0)+\sum _{k=1}^{K}{I(G_k^*)} +\sum _{k=1}^{K}{\bigg [\underbrace{-\log {\frac{p(\theta _{G_k})}{\sqrt{|{\mathcal {F}}(\theta _{G_k})|}}}}_{I(\theta _{G_k})}-\underbrace{\sum _{j=1}^{n_k}{\big ({\mathcal {L}}(D_{kj}|\theta _{G_k})}+\log {K}\big )}_{I({\mathcal {D}}_k|\theta _{G_k},G_k)}\bigg ]}, \end{aligned}$$
(19)

where, extra \(\log {K}\) bits are added with each datapoint to select its component id. For our two-level GGM structure discovery setting, the encoding of the model in the message consists of the encoding of topologies of the super and context-specific chordal graphs and the associated model parameters, which we elaborate in the rest of this section.

Encoding the Graph Structures

We now describe the encoding of super and context-specific graphical structures. For this purpose, it is sufficient to send the number of nodes and the connected pair of edges of each graphical structures. According to [2], to encode the number of nodes, we need \(\log {n}\) bits. Let us consider a super graph having \(|E_0|\) number of edges. Therefore, to encode edges of super graph, we need \(\log {\left( {\begin{array}{c}mE\\ |E_0|\end{array}}\right) }\), where \(mE = \frac{|V|(|V|-1)}{2}\). We encode the component-specific edges of context-specific graphical structure \(G_i\) to prevent the multiple appearances of same edges in different graph structures. The required bits to encode any context-specific graph structure \(G^*_k\) are \(\log {\left( {\begin{array}{c}mE\\ |E_k-E_0|\end{array}}\right) }\). Hence, to encode all graphical models including shared one, we require

$$\begin{aligned} I(G_0)+\sum _{k=1}^{K}{G_k^*} = \log {n}+\log {\left( {\begin{array}{c}mE\\ |E_0|\end{array}}\right) }+\sum _{k=1}^{K}{\log {\left( {\begin{array}{c}mE\\ |E_k-E_0|\end{array}}\right) }}. \end{aligned}$$
(20)

Encoding the Parameters and Data

Once the graphs’ (shared and context specific) topologies have been encoded, we encode the parameters of all context-specific graphical model structures of the mixture of GGMs as well as the data independently and then merge them. To encode parameters and data of each context-specific graphical structure, we encode parameters and data of all maximal cliques and minimal separators separately and then combine them. Moreover, according to [36], ContChordalysis-MML encodes parameters and data of all maximal cliques and minimal separators of a graphical structure separately and then combines them. Therefore according to [36], to encode the parameters of a maximal clique (or minimal separator) C of a graphical model, we require

$$\begin{aligned} \log {\frac{p(\theta _{G_k}^C)}{\sqrt{|{\mathcal {F}}(\theta _{G_k}^C)|}}} = \frac{1}{2}\log {|\varSigma _{G_k}^C|}+ \mathrm{{Constant}}. \end{aligned}$$
(21)

Furthermore, we require following bits to encode the data of a maximal clique (or minimal separator) C of a context-specific graphical model k:

$$\begin{aligned} {\mathcal {L}}({\mathcal {D}}^C_k|\theta _{{G_k}}^C) = -\frac{1}{2}\sum _{k=1}^{n_k}{(D^C_{kj}-\mu _{G_k}^C)\varSigma _{G_k}^C(D^C_{kj}-\mu _{G_k}^C)^\mathrm{{T}}}-\frac{n}{2}\log {|\varSigma _{G_k}^C|}. \end{aligned}$$
(22)

MML-Based Model Selection

In forward selection, a reference model \({\mathcal {M}}\) and a candidate model \({\mathcal {M}}'\) are differed by and edge (ab). According to MML, \({\mathcal {M}}'\) replaces \({\mathcal {M}}\) if encoding the message based on \({\mathcal {M}}'\) requires less number of bits than that of \({\mathcal {M}}\), i.e., \(I({\mathcal {M}}'|{\mathcal {D}},G')-I({\mathcal {M}}|{\mathcal {D}},G)\ <\ 0\), where G and \(G'\) are the graphical structures of reference and candidate models, respectively. Here, we present the MML-based scoring function to compare the reference and the candidate models of super and context-specific graphical models. We compute two types of MML scoring functions:

  1. 1.

    MML score when an edge will be added to super and all context-specific graphs.

  2. 2.

    MML score when an edge will be added to any of the context-specific graphs.

MML Score When An Edge (ab) is to be Added to Super Graph

When a candidate edge is added to the super graph structure, it affects graph structures of both the super and context specific and their parameters. Therefore, the MML difference between the candidate and reference graphical structures is as followed:

$$\begin{aligned}&I(G')-I(G)\nonumber \\&\quad = \underbrace{\log {n} + \log {\left( {\begin{array}{c}mE\\ |E_0|+1\end{array}}\right) } + \sum _{k=1}^{K}{\log {\left( {\begin{array}{c}mE\\ |E_k-E_0|+1\end{array}}\right) }}}_\text {Candidate graphical model} \nonumber \\&\qquad - \underbrace{\log {n} - \log {\left( {\begin{array}{c}mE\\ |E_0|\end{array}}\right) } - \sum _{k=1}^{K}{\log {\left( {\begin{array}{c}mE\\ |E_k-E_0|\end{array}}\right) }}}_\text {Reference graphical model} \nonumber \\&\quad = \log {\frac{mE-|E_0|}{|E_0|+1}} +\sum _{k=1}^{K}{\log {\frac{|E_k-E_0|}{mE-E_k+E_0+1}}}. \end{aligned}$$
(23)

The addition of an edge to the super graph affects the covariance matrices of affected and newly appeared cliques and separators of all context-specific graphical models. Therefore, we encode covariance matrices of affected and newly formed separators and cliques of all context-specific graphical models and we need the following bits

$$\begin{aligned}&I({\mathcal {D}},\theta '_{{\mathcal {G}}}) - I({\mathcal {D}},\theta _{\mathcal {G}})\nonumber \\&\quad = \sum _{k=1}^{K}{\big (I(\theta '_{G_k})-I(\theta _{G_k})+I({\mathcal {D}}_k|\theta '_{G'_k},G'_k) - I({\mathcal {D}}_k|\theta _{G_k},G_k)\big )} \nonumber \\&\quad = -\frac{1}{2} \sum _{k=1}^{K}{\bigg [ \log {\frac{|\varSigma _{G_k}^{C_{ab}}| \cdot |\varSigma _{G_k}^{S_{ab}}|}{|\varSigma _{G_k}^{C_{ab}\cap C_a}| \cdot |\varSigma _{G_k}^{C_{ab}\cap C_b}|}} \bigg ]} \nonumber \\&\qquad -\sum _{k=1}^{K}{\Bigg \{ \sum _{j=1}^{n_k}{\bigg ( {\mathcal {L}}(D^{C_{ab}}_{kj}|\theta _{G_k}^{C_{ab}})+{\mathcal {L}}(D^{S_{ab}}_{kj}|\theta _{G_k}^{S_{ab}})-{\mathcal {L}}(D^{C_{ab}\cap C_a}_{kj}|\theta _{G_k}^{C_{ab}\cap C_a})}} \nonumber \\&\qquad -{\mathcal {L}}(D^{C_{ab} \cap C_b}_{kj}|\theta _{G_k}^{C_{ab}\cap C_b}) + \log {K} \bigg ) \Bigg \}. \end{aligned}$$
(24)

Therefore, the MML score difference between reference and candidate models is

$$\begin{aligned}&I({\mathcal {M}}'|{\mathcal {D}},{\mathcal {G}}')-I({\mathcal {M}}|{\mathcal {D}},{\mathcal {G}})\nonumber \\&\quad = \underbrace{I({\mathcal {G}}')-I({\mathcal {G}})}_\text {Equation } 23 + \underbrace{I({\mathcal {D}},\theta ') - I({\mathcal {D}},\theta )}_\text {Equation }24. \end{aligned}$$
(25)

MML Score Difference When An Edge (ab) to be Added to a Context-Specific Graph

The addition of candidate edge (ab) to \(G_k\) of cluster k affects only the corresponding graph structure and its parameters and rest of all are remain unchanged. Therefore, the difference MML between the encoded candidate and reference graphical structures is as follows:

$$\begin{aligned}&I({\mathcal {G}}')-I({\mathcal {G}})\nonumber \\&\quad = \underbrace{\log {(n!)} + \log {\left( {\begin{array}{c}mE\\ |E_0|\end{array}}\right) } + \sum _{k=1}^{K\ \text {and}\ k\ne j}{\log {\left( {\begin{array}{c}mE\\ |E_j-E_0|\end{array}}\right) }} + \log {\left( {\begin{array}{c}mE\\ |E_k-E_0|+1\end{array}}\right) }}_\text {Candidate model} \nonumber \\&\qquad \underbrace{- \log {(n!)} - \log {\left( {\begin{array}{c}mE\\ |E_0|\end{array}}\right) } - \sum _{k=1}^{K\ \text {and}\ j\ne k}{\log {\left( {\begin{array}{c}mE\\ |E_j-E_0|\end{array}}\right) }} - \log {\left( {\begin{array}{c}mE\\ |E_k-E_0|\end{array}}\right) }}_\text {Reference model} \nonumber \\&\quad =\log {\frac{1}{|E_k-E_0|+1}}. \end{aligned}$$
(26)

As the edge (ab) is added to the \(G_i\) and no change in reference and candidate model of super and rest of all context-specific graphical structures, we encode the data and parameters of \(G_i\) is as followed:

$$\begin{aligned}&I({\mathcal {D}}_k,\theta '{G_k}) - I({\mathcal {D}}_k,\theta {G_k})\nonumber \\&\quad = I(\theta '_{G_k})-I(\theta _{G_k})+ I({\mathcal {D}}_k|\theta _{G'_k},G'_k)-I({\mathcal {D}}_k|\theta _{G_k},G_k) \nonumber \\&\qquad -\frac{1}{2}\bigg [ \log {\frac{|\varSigma _{G_k}^{C_{ab}}| \cdot |\varSigma _{G_k}^{S_{ab}}|}{|\varSigma _{G_k}^{C_{ab}\cap C_a}| \cdot |\varSigma _{G_k}^{C_{ab}\cap C_b}|}} \bigg ] - \sum _{j=1}^{n_k}{\bigg ({\mathcal {L}}({\mathcal {D}}_{kj}^{C_{ab}}|\theta _{G_k}^{C_{ab}})+{\mathcal {L}}({\mathcal {D}}_{kj}^{S_{ab}}|\theta _{G_k}^{S_{ab}})} \nonumber \\&\qquad -{\mathcal {L}}({\mathcal {D}}_{kj}^{C_{ab}\cap C_a}|\theta _{G_k}^{C_{ab}\cap C_a})-{\mathcal {L}}({\mathcal {D}}_{kj}^{C_{ab}\cap C_b}|\theta _{G_k}^{C_{ab}\cap C_b})+ \log {K} \bigg ). \end{aligned}$$
(27)

Therefore, the MML score difference between reference and candidate models is

$$\begin{aligned}&I({\mathcal {M}}'|{\mathcal {D}},{\mathcal {G}}')-I({\mathcal {M}}|{\mathcal {D}},{\mathcal {G}})\nonumber \\&\quad = \underbrace{I({\mathcal {G}}')-I({\mathcal {G}})}_\text {Equation } 26 + \underbrace{I({\mathcal {D}},\theta '_{G'_k}) - I({\mathcal {D}},\theta _{G_k})}_\text {Equation }27. \end{aligned}$$
(28)
figured

The Forward Selection Algorithm

To discover the shared and context-specific GGMs, we modify the sContChordalysis-MML. Initially in the algorithm, we compute the MML to encode the parameters and data of each edge. Then, it adds the best edges incrementally either in both super and context-specific graphical structures or one of the context-specific graphical model structures based on the MML by maintaining the chordality of the graph structures. After addition of the best edge, we update the candidature and MML to encode the parameters and data of candidate edges according to procedures mentioned in section "Scalable ContChordalysis-MML". We call our MML-based two-level decomposable HGGM discovery algorithm as two level DecomposableGaussian graphical modelsDiscovery usingMML or HGDM. In Gaussian graphical models step (lines 9–11) of the PaGIAM algorithm (algorithm 3), we use tGDM algorithm to discover context specific graphical models instead of sContChordalysis-MML. We call the updated PaGIAM algorithm as the PaGIAM–tGDM algorithm.

MML Score for PaGIAM–tGDM to Discover Two Levels HGGM

Equation (17) is used for PaGIAM algorithm which used sContChordalysis-MML to discover the HGGM. However, the use of sContChordalysis-MML, PaGIAM cannot discover the shared graphical model. Whereas, tGDM discovers both the shared and context-specific graphical models. Therefore, the final MML score for PaGIAM–tGDM is also modified which is as follows:

$$\begin{aligned} \mathrm{{MML}} \, = \, & {} \underbrace{\log {n}+\log {\left( {\begin{array}{c}mE\\ |E_0|\end{array}}\right) }+\sum _{k=1}^{K}{\log {\left( {\begin{array}{c}mE\\ |E_k-E_0|\end{array}}\right) }}}_{\text {Encoding the graph structure}} \nonumber \\&+ \sum _{k=1}^{K}\Biggl ( \underbrace{\frac{1}{2}\sum _{C \in {\mathcal {C}}_k}\Big ( (n_k^C-1) \log {(\varSigma _{G_k}^C)} + \big (({\mathcal {D}}_k^C-\mu _{G_k}^C)\varSigma _{G_k}^{C^{-1}}({\mathcal {D}}_k^C-\mu _{G_k}^C)^\mathrm{{T}}\big ) \Big )}_{\text {Encoding the parameters and data of the maximal cliques of the graphs}} \nonumber \\&+ \underbrace{\frac{1}{2}\sum _{S \in {\mathcal {S}}_k}\Big ( (n_k^S-1) \log {(\varSigma _{G_k}^S)} + \big (({\mathcal {D}}_k^S-\mu _{G_k}^S)\varSigma _{G_k}^{S^{-1}}({\mathcal {D}}_k^S-\mu _{G_k}^S)^\mathrm{{T}}\big ) \Big )}_{\text {Encoding the parameters and data of the minimal separator of the graphs}} \Biggr ) \nonumber \\&+ \underbrace{\frac{K-1}{2} \log {\big (\frac{n}{12}+1\big )} - \log {(K-1)!} - \sum _{k=1}^{K}{\Big (\Big (n_k+\frac{1}{2}\Big )\log {\frac{n_k+\frac{1}{2}}{n+\frac{K}{2}}}\Big )}}_{\text {Encoding the cluster}} + \underbrace{\sum _{k=1}^{K}{\frac{1}{2} \log {(|\varSigma _k|)}}}_{{I(D_k,\theta _k)}}. \end{aligned}$$
(29)

In PaGIAM algorithm, Eq. (17) is replaced by Eq. (29) to discover the shared and context-specific graphical models from the high-dimensional heterogeneous Gaussian data. Equation (29) encodes the clusters, two-level HGGMs (i.e. shared and context-specific GGMs), their parameters and data. Therefore, the MML score of the Eq. (29) makes our PaGIAM–tGDM capable of discovering both shared and context-specific GGMs.

Experiments

We compare the performance of our method: PaGIAM–tGDM with strong baselines on synthetic data and real cancer data.

Synthetic Data

Parameters for Synthetic Data

We generate synthetic multi-dimensional dataset based on the mixture of Gaussian. We cover a wide range of datasets with different properties by changing different aspects, as follows:

  • |V|: the number of variables, ranges in \(\{10, 100, 1000, 5000\ \text {and}\ 10{,}000\}\).

  • n: the number of samples, ranges in\(\{ 100, 1000, 10{,}000, 50{,}000\}\).

  • K: the number of clusters, ranges \(\{1, 2, 3, 4, 5\}\).

  • \(|{\mathcal {C}}|\): maximal clique size for graphical structures in the mixture model. According to [4], in real-world networks, every new node is born with some edge connections with existing nodes. It produces the connected graph. Therefore, the minimal clique size would be at least 2. We consider that \(|{\mathcal {C}}|\) varies from 2 to 6.

  • \(\alpha\): Controlling the spread of sampled mixing coefficients in the mixture models. There are two possibilities of mixing coefficients: (a) all clusters having equal frequencies or (b) some clusters having non-equal frequencies. Moreover, the number of samples of each cluster is discrete and multinomial distributed. Dirichlet distribution can be used as the conjugate prior to Multinomial distribution. Therefore, we assume that the mixing coefficients are Dirichlet distributed, with concentration parameter \(\alpha\) which ranges in \(\{\)100, 10, 1, 0.1\(\}\). When \(\alpha = 100\), approximately all coefficient are equal. Whereas, the frequencies (i.e., mixing coefficient) tend to be different when \(\alpha = 0.1\).

  • \(\delta\): Controlling the statistical associations between the random variables. The statistical association between two nodes ranges in between 1 to \(-\) 1 to express the degree of associations. As it approaches zero, two variables are not statistically associated. However, it is closer to either 1 or \(-\) 1, the stronger association exists between variables. In the experiment, we consider a parameter S which inversely controls the statistical association between the random variable which ranges in \(\{1,\ 5,\ 10,\ 25,\ 50,\ 100,\ 250,\ 500\}\). We mention this parameter as Inverse correlation parameter.

To assess the performance of our methods with the baselines, we vary each of the parameters as mentioned earlier, in turn, having set base configuration to |V| = 1000, n = 10,000, K = 3, \(|{\mathcal {C}}|\) = 3, \(\alpha =100\) and \(\delta\) = 50. Moreover, we also assess the performance by varying the number of samples as mentioned while setting the base configuration to |V|=10,000, K = 3, \(|{\mathcal {C}}|\) = 3, \(\alpha =100\) and \(\delta\) = 50.

Graph Structure Generation

For each experimental setup, we first generate the graph structures and then the dataset. To generate the graph structures, we maintain the real-world networks properties [7]: (a) many small nodes are connected with few hubs, known as the power-law property, (b) The average path between two nodes is short and (c) new nodes prefer to attach to well-connected nodes over less-well connected nodes, known as the preferential attachment property.

Barabási and Albert [4] proposed a model to generate scale-free graphs having the above-mentioned properties. We use Barabási–Albert (BA) method to generate the graph structures with properties of real-world networks. This model facilitates us by controlling the number of nodes |V|, and the maximal clique size \(|{\mathcal {C}}|\) which controls the edge density of the graph.

To generate the graph structures for the synthetic data, we follow the following steps:

  • First we generate K number of context-specific graphs using Barabási–Albert (BA) method.

  • We then identify the super graph structure \(G_0 =\{V,E_0\}\) by \(E_0 = \bigcap _{i=1}^{k}{E_i}\).

Moreover, as we use decomposable models to discover graphical structures, we add an additional condition that both generated super and context-specific graphs are chordal. Moreover, if the identified super graph \(G_0\) is not chordal, we add edges to make it chordal. We then added these new edges of super graph to all context-specific graphs. We use candidate edge selection process of ContChordalysis algorithm [36] to maintain the chordality of generated graphs.

Synthetic Gaussian Data Generation

Having the graph structure, we generate the context-specific precision matrix of \(G_k\) using following equation

$$\begin{aligned} \varSigma _{G_k}^{-1} \sim \left\{ \begin{array}{cc} (1/\delta ) \cdot 1/|V| \cdot \mathrm{{adj}}_{G_k}(x,y) &{} \text {if node }x\ \ne \text { node }y\text { and }\mathrm{{adj}}_{G_k}(x,y) = 1 \\ 1 \cdot \mathrm{{adj}}_{G_k}(x,y) &{} \text {if node }x\ =\text { node }y\text { and }\mathrm{{adj}}_{G_k}(x,y) = 1 \\ 0 &{} \mathrm{{adj}}_i(x,y) \ne 1, \end{array} \right. \end{aligned}$$

where \(\mathrm{{adj}}_{G_k}\) is the adjacency matrix of a context-specific graph \(G_k\). Finally, we generate data using \({\mathcal {D}} = \bigcup _{k=1}^{K}{\big \{{\mathcal {D}}_k \sim {\mathcal {N}}_d(0,\varSigma _{G_k})\big \}}\), where the number of samples of cluster k would be \(n_k = n \cdot \gamma _k\) and \(\gamma _k \sim {\mathcal {D}}ir(\alpha )\).

Real Data

We use two gene expression datasets to evaluate our methods: Breast cancer and Glioblastoma tumor data. We have verified that these real data are Gaussian distributed.

Breast Cancer

Breast cancer is a hormone-related cancer [23] and it has two major subtypes:

  • Estrogen receptor positive (ER+). It is estimated that around 80% of all breast cancer are ER+. Survival rate of this cancer is better than ER−. It responds to hormone therapy.

  • Estrogen receptor negative (ER−). Its survival rate is poorer. Due to the absence of estrogen receptor hormone, it does not respond to hormone therapy.

Presence of estrogen receptor hormone in breast cancer plays an important role in therapeutic strategies and survival rates. We use a breast cancer dataset containing gene expression of 4512 genes from an Affymatrix HU95aV2 microarray for 148 samples which have been chemically synthesized by [34]. These breast cancer data are the mixture of estrogen receptivity (ER+/ER−) subtypes, where each tumor sample in the dataset has additional classification tags based on its estrogen receptivity (ER+/ER−). Moreover, we consider [34]’s chemically discovered gene-pairs as the gold standard.

Glioblastoma Tumor

Verhaak et al. [39] studied the glioblastoma tumor samples gene expression data with 173 samples and 8271 genes. Verhaak et al. [39]’s chemically synthesized a dataset containing tumor samples of four disease subtypes. They did not identify whether a gene-pair is present in a subtype or not. Whereas, [6, 25, 27, 29] identified 10 important gene-pairs that cause the appearance of glioblastoma tumor cells. In Table 1, we report them together with their presence in each disease subtype.

Table 1 Presence (cross marked) of genes and their pathways in each subtype of glioblastoma tumor cells

In the Glioblastoma tumor experiment, we investigate the performance of our methods and the baselines to predict the above mentioned 10 prominent gene-pairs from these large data.

Evaluation Metrics

We evaluate results using context-specific \(\text {recall}\), \(\text {precision}\) and \(\text {FMeasure}\). \(\text {Recall}\) is the fraction of correctly predicted edges with respect to true edges. \(\text {Precision}\) is the fraction of correctly predicted edges (i.e. associations) with respect to all predicted edges. \(\text {FMeasure}\) is the harmonic mean of precision and recall, i.e. \(\text {FMeasure} = \frac{2 \times \text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}}\). The average \(\text {FMeasure}\) is assumed as the accuracy of a method.

The algorithm tGDM generates \(K+1\) number of graphical structures. The corresponding gold standard graph of a predicted graph is unknown in our synthetic data experiments, since potentially each discovered graph can be matched with each of the ground truth clusters. Therefore, we compute False Positive Rate (FPR),Footnote 5 False Negative Rate (FNR)Footnote 6 and errorFootnote 7 for the best matched predicted graph of a gold standard graph. The predicted network G having a minimal error with respect to a gold standard \(G_\mathrm{{gold}}\) is the best matched discovered graph G of the corresponding gold standard \(G_\mathrm{{gold}}\).

PaGIAM–tGDM Variant Baselines for Synthetic Data Experiments

We also compare the performance of our MML-based scoring approach with two scoring function: AIC (Akaike Information Criterion) [1] and BIC (Bayesian Information Criterion) [38] as variants of both PaGIAM and tGDM.

The AIC and BIC variants of PaGIAM algorithm refer to PaGIAA and PaGIAB, respectively, and their AIC and BIC scoring functions are as followed:

$$\begin{aligned} \mathrm{{AIC}}_\mathrm{{PaGIAA}}= & {} -2{\mathcal {L}}({\mathcal {D}}|\theta )+2k\ \ \ \text {and}\\ \mathrm{{BIC}}_\mathrm{{PaGIAB}}= & {} -2{\mathcal {L}}({\mathcal {D}}|\theta )+k\log {n} \end{aligned}$$

Similarly, the AIC and BIC variants of tGDM algorithms are referred to by tGDA and tGDB respectively, and their AIC and BIC scoring functions are as follows:

$$\begin{aligned} \mathrm{{AIC}}_\mathrm{{tGDA}}= & {} -2{\mathcal {L}}({\mathcal {D}}|\theta )+2|E|\ \ \ \text {and}\\ \mathrm{{BIC}}_\mathrm{{tGDB}}= & {} -2{\mathcal {L}}({\mathcal {D}}|\theta )+|E|\log {n} \end{aligned}$$

In the synthetic data experiments, we consider that PaGIAM–tGDA, PaGIAB–tGDM, PaGIAA–tGDM, and PaGIAB–tGDM are the baselines of our MML-based methods. We also compare the performance of the PaGIAM–sContChordalysis-MML approach discussed in section "Discovery of the Decomposable HGGMs".

Recent Strong Baselines

In both synthetic and real data experiments, we evaluate our method: PaGIAM–tGDM with recent strong baselines: New-SP (New-Structural-Pursuit) [15] and JSEM (Joint Structural Estimation Method) [23]. However, in real-data experiments, we use PaGIAM–tGDM and PaGIAM–sContChordalysis-MMLFootnote 8 are used along with the New-SP and JSEM. New-SP and JSEM estimate context-specific GGMs with shared edges in the framework of a Gaussian mixture model.

New-SP uses the hard EM algorithm [9] to cluster the data and [8] proposed Joint Fused Graphical Lasso method to estimate the context-specific GGMs. Beside, JSEM uses Graphical Lasso [13] and Group Lasso [5] for inferring the context-specific GGM structures. Both methods use penalized likelihood as objective function to discover the context-specific GGMs.

Results and Discussion

We have implemented PaGIAM–tGDM and its flat mixture, AIC and BIC variants in Python 3.7. New-SP and JSEM are developed by r-packages and available in CRAN. All experiments are run on a desktop with Intel Core i5 3.2 GHz CPU and 24 GB of RAM.

Moreover, in the experiment, we start with our PaGIAM (algorithm 3) assuming the number of components is one in the mixture model and then keep on increasing the number of components till MML (Eq. 17) outlines the best-partitioned data and context-specific graphical models.

Synthetic Data

We compare PaGIAM–tGDM with its variants to discover HGGM on the synthetic data with different experimental setups.

Table 2 Performance of PaGIAM–tGDM and its variants and strong baselines on the synthetic data by varying the number of variables |V|
Table 3 Performance of PaGIAM–tGDM and its variants and strong baselines on the synthetic data by varying the size of maximal cliques \(|{\mathcal {C}}|\)

Varying the number of variables |V|: In this experimental setup, we change the number of variables (i.e. the number of graph nodes) in 10, 100, 1000, 5000 and 10,000. Table 2 depicts recall Re, precision Pr and F\(\text{- }\)Measure FM of outputs of PaGIAM–tGDM and other baselines. PaGIAM–tGDM outperforms all of the competitive baselines. From the Table 2, FM of all methods decreases with the increase in the number of variables |V| as both Re and Pr decreases over the increment of the number of variables.

However, accurate edge detection of PaGIAM–tGDM is still higher compared to its variants in this synthetic data experience. More specifically, Re, Pr, and FM of PaGIAM–tGDM and PaGIAM–ContChordalysis-MML are better than other variants. Therefore, we can say that MML outperforms AIC and BIC in terms of objective functions as MML helps the PaGIAM algorithm (Algorithm 3) to partition the data more accurately. Friedman [14] reported that BIC and AIC do not work well to partition the data and produce many wrong clustered data. Whereas, tGDM use complete data to discover the super graph so that the super graphs detected by PaGIAB–tGDM and PaGIAA–tGDM are similar to the super graph of PaGIAM–tGDM. Due to the presence of wrong data in each cluster, tGDM used inside PaGIAB and PaGIAA does not able to detect many true context-specific edges which affect their \(\text {FMeasure}\)s.

Giraud [16] point out limitations of BIC and AIC that they do not perform well in large high-dimension data. PaGIAM–tGDB and PaGIAM–tGDA use BIC and AIC as the scoring functions to add edges to the candidate graphical models. Hence, tGDB and tGDA detect less number of edges compared to tGDM and sContChordalysis-MML.

sContChordalysis-MML only able to detect the context-specific graphical models. Many common or shared edges between context-specific graphical models are not detected by sContChordalysis-MML. It affects the performance of PaGIAM–sContChordalysis-MML and is outperformed by PaGIAM–tGDM.

New-SP and JSEM use the Graphical Lasso (GLasso) and penalized likelihood as their objective function to find the optimal context-specific graphical model structures only. In GLasso, the regularized parameter is not estimated properly from the data which affected the penalized likelihood and the estimation of context-specific graphical models with their super graph. Moreover, both methods does not detects the shared edges and many true edges are not discovered by these methods. It affects the Re, Pr and FM of the outputs of the methods. Hence, New-SP and JSEM statistically do not perform significantly well as PaGIAM–tGDM does.

Varying the maximal clique size \(|{\mathcal {C}}|\): In this experiment, we vary maximal clique sizes from 2 to 6. Table 3 depicts Re, Pr and FM of outputs of our method and other baselines. PaGIAM–tGDM outperforms all of the competitive baselines. While the maximal clique size is two, the degree of all vertices is one. All methods except PaGIAM–tGDB and PaGIAM–tGDA, detect most of the true edges and their FM are higher. Over the increment of the maximum size of cliques in the graph, FM of all methods decreases. In this experiment, maximal size of cliques in the graphs inversely affect the FM. However, among all methods, our PaGIAM–tGDM detects many true edges than other methods whatever the size of maximal cliques in the graph and therefore, FM of PaGIAM–tGDM is higher than others.

Varying the number of datapoints n: We carried out two experiments by varying the size of samples where the number of variables are 1 K (i.e., 1000) and 10 K (i.e., 10,000). In both experiments, Re, Pr, and FM results are presented in Tables 4 and 5, From both tables, we find that PaGIAM–tGDM outperforms all other methods. Over the increase of the number of samples, PaGIAM–tGDM detects many true edges accurately and increases the FM. Similar trends also found in other methods, but not as good as PaGIAM–tGDM. Hence, PaGIAM–tGDM can work on any size of multivariate Gaussian distribution data efficiently.

Table 4 Performance of PaGIAM–tGDM and its variants and strong baselines on the synthetic data by changing the number of datapoints when |V| = 1000
Table 5 Performance of PaGIAM–tGDM and its variants and strong baselines on the synthetic data by changing the number of datapoints when |V| = 10,000

Varying the number of components K: Table 6 reports the performance of methods with respect to the different number of components (i.e., clusters) K in the mixture. As the number of clusters increases, Re, Pr, and FM of all methods decreases. While the number of clusters increases, the amount of wrongly clustered data also increases which affect the results of all methods. Similarly, PaGIAM–tGDM outperforms all of the competitive baselines.

Table 6 Performance of PaGIAM–tGDM and its variants and strong baselines on the synthetic data by varying the number of components |K|

Varying the spread of mixing coefficient \(\alpha\): Table 7 presents the performance of methods by varying the spread of mixing coefficient \(\alpha\). As \(\alpha\) increases, the randomness of the cluster proportion decreases and tends to uniform. It affects the results of all methods. However, our PaGIAM–tGDM outperforms all method which indicates PaGIAM–tGDM can work on any kind of heterogeneous data with different cluster proportion.

Table 7 Performance of PaGIAM–tGDM and its variants and strong baselines on the synthetic data by varying the spread of sampled mixing coefficients \(\alpha\)

Varying the inverse correlation \(\delta\): Correlation expresses the statistical association between random variables which strongly influences covariance matrices. According to the Table 8, increase in the value of the inverse correlation parameter \(\delta\) inversely impacts the covariance matrices and causes the decrease of FMeasure. Our PaGIAM–tGDM can detect more than 55% true edges even when very small correlation exists between variables. Whereas, other methods cannot detect even 50% of true edges.

Table 8 Performance of PaGIAM–tGDM and its variants and strong baselines on the synthetic data by varying  the inverse correlation parameter 

On account of clustering data and discovering the context-specific graphical model with shared edges accurately, PaGIAM–tGDM outperforms all methods in different experimental setups. Therefore, PaGIAM–tGDM is a statistically efficient method to predict both shared and context-specific independencies from heterogeneous data.

Real-World Data

In real-data experiments, we compare our PaGIAM–tGDM and its flat mixture variant: PaGIAM–sContCHordalysis-MML with strong baselines: New-SP and JSEM to discover the context-specific graphical models along with shared edges from the breast cancer and the Glioblastoma tumor data.

Breast Cancer Data

Table 9 presents Re, Pr and FM of our method versus baselines. We again see the same trend that PaGIAM–tGDM outperforms other methods in terms of the performance measures.

Table 9 Performance of PaGIAM–tGDM and its sContChordalysis-MML variants and strong baselines on the breast cancer data

It is known that many gene-pairs can be responsible for the appearance of the cancer cells in human body. We are interested to know important gene-pairs that have been detected by the methods. According to [35], we select 50 important gene-pairs that cause the appearance of cancer cell in human breast tissues. Figure 2 shows that PaGIAM–tGDM detected 22 gene-pairs present in both ER+ and ER− subtypes. Whereas, the strongest baseline JSEM detects just 15 gene-pairs in both ER+ and ER− subtypes. New-SP and PaGIAM–sContChordalysis-MML discovers less than 15 gene-pairs. Based on this evaluation, our method PaGIAM–tGDM outperforms exiting strong baselines by discovering more true important gene-pairs.

Fig. 2
figure2

Discovery of 50 important breast cancer causing gene-pairs present in ER+ and ER− subtypes by competing methods

Glioblastoma Tumor Data

We also test PaGIAM–tGDM on [39]’s Glioblastoma tumor data with New-SP, JSEM, and PaGIAM–sContChordalysis-MML. Due to unavailability of gold standard data, we compare the appearance of 10 gene-pairs in Glioblastoma tumor and its subtypes discovered by different methods. Figure 3 shows the discovery of 10 gene-pairs in different subtypes of Glioblastoma by different methods along with gold standard. PaGIAM–tGDM detects eight gene-pairs including their presence in subtypes accurately. Whereas, New-SP and JSEM detects seven and six gene-pairs accurately. Based on the results of this experiment, PaGIAM–tGDM detects important gene-pairs accurately including their presence in different subtypes of Glioblastoma tumor data.

Fig. 3
figure3

Discovery of 10 important gene-pairs of Glioblastoma tumor by competing methods

Overall, the results of synthetic and real cancer data indicate that PaGIAM–tGDM is more accurate in predicting the context-specific dependencies compare to baselines.

Conclusion

We have proposed a statistically efficient method to discover hierarchical Gaussian graphical models (HGGM) structure of multivariate data with a very large number of variables. We introduce PaGIAM (for the flat mixture) and tGDM (for two-level HGGM i.e., shared and context specific) based on a novel MML-based criterion for structure discovery of HGGM. Furthermore, our PaGIAM clusters and detects the two level HGGM using the concept of EM algorithm. tGDM algorithm is a step-wise greedy algorithm which incrementally adds the best edges minimizing the MML-based scoring functions. They work with chordal graphs and decomposable models to make the computation of the MML based test statistics efficient. We have presented extensive empirical results on synthetic and real-life cancer datasets, and shown that our PaGIAM–tGDM method outperforms strong baselines in terms of the accurate prediction of shared and context-specific associations from the high-dimensional heterogeneous Gaussian data.

However, our PaGIAM–tGDM has several windows to extend. In PaGIAM, we assumed that the number of components is user-defined. However, PaGIAM can be extended to infinite flat mixture models using Indian Buffet Process etc. Although multivariate Gaussian distributions are good approximations for many real-world phenomena, we believe that there are real-life data which may be better captured by other forms of distributions. Some of the applications (e.g., Networks of verbal autopsy data) are not Gaussian distributed, but multinomial distributed. Therefore, we are interested to extend our framework to capture a broader class of distributions governing the data.

Notes

  1. 1.

    A clique is a subset of vertices of an undirected graph such that every two distinct vertices in the clique are adjacent [42]. A maximal clique is a clique that cannot be extended by including one more adjacent vertex, that is, a clique which does not exist exclusively within the vertex set of a larger clique [42].

  2. 2.

    In graph theory, the term “null graph” refers to a graph without any edges, aka the “empty graph” [42].

  3. 3.

    In real world, the heterogeneous GGM data exhibit relatively small number of components comparing with the number of datapoints. Therefore, \(K<< n\) therefore \(-\log {K!}\) does not affect the total require bits to encode the clustering coefficient and the contents.

  4. 4.

    The graphical structure at the top level is the graphical structure with shared edges. At lower level, all the context-specific graphical structures are placed. That is why it is called two-level hierarchical Gaussian graphical models.

  5. 5.

    \(\mathrm{{FPR}} = \frac{{\text {FP}}}{{\text {TP}}+{\text {FP}}}\) where TP is the number of the predicted edges present in gold standard and FP is the number of the predicted edges not present in gold standard.

  6. 6.

    \(\mathrm{{FNR}} = \frac{{\text {FN}}}{{\text {TN}}+{\text {FN}}}\) where TN is the number of the predicted conditional independence present in gold standard and FN is the number of the predicted conditional independence not present in gold standard.

  7. 7.

    error = FNR+FPR.

  8. 8.

    Except PaGIAM–tGDM and PaGIAM–sContchordalysis-MML, all other baselines of synthetic data experiments do not perform well. For this reason, we do not use these baselines for the real-world data.

References

  1. 1.

    Akaike H. Information theory and an extension of the maximum likelihood principle. In: Second international symposium on information theory; 1973. p. 267–281.

  2. 2.

    Allisons L. Encoding General Graphs. 2017. http://www.allisons.org/ll/MML/Structured/Graph/. Accessed 1 Apr 2020.

  3. 3.

    Armstrong H, et al. Bayesian covariance matrix estimation using a mixture of decomposable graphical models. Stat Comput. 2009;19:303–16.

    MathSciNet  Article  Google Scholar 

  4. 4.

    Barabási AL, Albert R. Statistical mechanics of complex networks. Rev Mod Phys. 2002;74(1):47–97.

    MathSciNet  MATH  Article  Google Scholar 

  5. 5.

    Breheny P, Huang J. Penalized methods for bi-level variable selection. Stat Inference. 2009;2(3):369–80.

    MathSciNet  MATH  Google Scholar 

  6. 6.

    Brennan C, et al. The somatic genomic landscape of gliobalstoma. Cell. 2013;155(2):462–77.

    Article  Google Scholar 

  7. 7.

    Clauset A, et al. Power-law distributions in empirical data. SIAM Rev. 2007;51:661–703.

    MathSciNet  MATH  Article  Google Scholar 

  8. 8.

    Danaher P, et al. The Joint Graphical Lasso for inverse covariance estimation across multiple classes. J R Stat Soc. 2014;76(2):373–97.

    MathSciNet  Article  Google Scholar 

  9. 9.

    Dempster A, et al. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc. 1977;39(1):1–39.

    MathSciNet  MATH  Google Scholar 

  10. 10.

    Deshpande A, et al. Efficient stepwise selection in decomposable models. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence; 2001. p. 128–135.

  11. 11.

    Dowe D, et al. MML estimation of the parameters of the spherical Fisher distribution. Algorithmic Learn Theory. 1996;1160:213–27.

    MathSciNet  MATH  Google Scholar 

  12. 12.

    Dwyer P. Some applications of matrix derivatives in multivariate analysis. J Am Stat Assoc. 1967;62:607–25.

    MathSciNet  MATH  Article  Google Scholar 

  13. 13.

    Friedman J, et al. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–41.

    MATH  Article  Google Scholar 

  14. 14.

    Friedman N. The Bayesian structural EM algorithm. In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence (UAI); 1998. p. 129–138.

  15. 15.

    Gao C, et al. Estimation of multiple networks in Gaussian mixture models. Electron J Stat. 2016;10:1133–54.

    MathSciNet  MATH  Article  Google Scholar 

  16. 16.

    Giraud C. Introduction to high-dimensional statistics. Boca Raton: Chapman and Hall/CRCs; 2014.

    Google Scholar 

  17. 17.

    Guavain JL, Lee CH. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process. 1998;2(2):291–8.

    Article  Google Scholar 

  18. 18.

    Guo J, et al. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1–15.

    MathSciNet  MATH  Article  Google Scholar 

  19. 19.

    Hao B, et al. Simultaneous clustering and estimation of heterogeneous graphical model. J Mach Learn Res. 2018;18(217):1–58.

    MathSciNet  Google Scholar 

  20. 20.

    Kumar M, Koller D. Learning a small mixture of trees. In: Advances in neural information processing systems; 2009. p. 1051–1059.

  21. 21.

    Lauritzen S. Graphical models. Oxford: Clarendon Press; 1996.

    Google Scholar 

  22. 22.

    Li Z, et al. Bayesian Joint Spike-and-Slab Graphical Lasso. In: Proceedings of the 36th international conference on machine learning, vol. 97; 2019. p. 3877–3885.

  23. 23.

    Ma J, Michailidis G. Joint structural estimation of multiple graphical models. J Mach Learn Res. 2016;17:1–48.

    MathSciNet  MATH  Google Scholar 

  24. 24.

    Maretic H, Frossard P. Graph Laplacian mixture model. arXiv:1810.10053. 2018.

  25. 25.

    McLendon R, et al. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455(7216):1061–8.

    Article  Google Scholar 

  26. 26.

    Meilă M, Jordan MI. Learning with mixtures of trees. J Mach Learn Res. 2000;1:1–48.

    MathSciNet  MATH  Google Scholar 

  27. 27.

    Mirzaa G, et al. De novo CCND2 mutations leading to stabilization of cyclin D2 cause megalecephaly–polymicrogyria–polydactyly–hydrocephalus syndrome. Nat Genet. 2014;46(5):510–4.

    Article  Google Scholar 

  28. 28.

    Mukherjee C, Roriguez A. GPU-powered shotgun stochastic search for dirichlet process mixtures of gaussian graphical models. J Comput Graph Stat. 2016;25(3):762–88.

    MathSciNet  Article  Google Scholar 

  29. 29.

    Narita Y, et al. Mutant epidermal growth factor receptor signalling down-regulates p27 through activation of the phosphatidylinositol 3-kinase/AKT pathway in glioblastomas. Cancer Res. 2002;62(22):6764–9.

    Google Scholar 

  30. 30.

    Oliver J, et al. Unsupervised learning using MML. In: Proceedings of the 13th international conference machine learning; 1996. p. 364–372.

  31. 31.

    Peterson C, et al. Bayesian inference of multiple gaussian graphical models. J Am Stat Assoc. 2015;110(509):159–74.

    MathSciNet  MATH  Article  Google Scholar 

  32. 32.

    Petitjean F, Webb G. Scaling log-linear analysis to datasets with thousands of variables. In: SIAM international conference on data mining; 2015. p. 469–477.

  33. 33.

    Petitjean F, et al. A statistically efficient and scalable method for log-linear analysis of high-dimensional data. In: Proceedings of IEEE international conference on data mining (ICDM); 2014. p. 110–119.

  34. 34.

    Pittman J, et al. Integrated modeling of clinical and gene expression information for personalized prediction ofdisease outcomes. Proc Natl Acad Sci USA. 2004;101:8431–6.

    Article  Google Scholar 

  35. 35.

    Pujana MA, et al. Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat Genet. 2007;39:1338–49.

    Article  Google Scholar 

  36. 36.

    Rahman M, Haffari G. A statistically efficient and scalable method for exploratory analysis of high-dimensional data. SN Comput Sci. 2020;1(2):1–17.

    Article  Google Scholar 

  37. 37.

    Rodriguez A, et al. Sparse covariance estimation in heterogeneous samples. Electron J Stat. 2011;5:981–1014.

    MathSciNet  MATH  Article  Google Scholar 

  38. 38.

    Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–4.

    MathSciNet  MATH  Article  Google Scholar 

  39. 39.

    Verhaak R, et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR and NF1. Cancer Cell. 2010;17(1):98–110.

    Article  Google Scholar 

  40. 40.

    Wallace C, Boulton D. An information measure for classification. Comput J. 1968;11:185–94.

    MATH  Article  Google Scholar 

  41. 41.

    Wallace C, Dowe D. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. J Stat Comput. 2000;10:173–83.

    Google Scholar 

  42. 42.

    West DB. Introduction to graph theory. London: Pearson; 2001.

    Google Scholar 

Download references

Acknowledgements

We are thankful to Monash University for the financial supports towards this research. We are also thankful to Dr. Francois Petitjean for his valuable advise on the development of two level HGGM.

Funding

This study was not funded by any external funding source.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mohammad S. Rahman.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rahman, M.S., Nicholson, A.E. & Haffari, G. Inferring Two-Level Hierarchical Gaussian Graphical Models to Discover Shared and Context-Specific Conditional Dependencies from High-Dimensional Heterogeneous Data. SN COMPUT. SCI. 1, 218 (2020). https://doi.org/10.1007/s42979-020-00224-w

Download citation

Keywords

  • Context-specific dependencies
  • Minimum message length
  • Hierarchical Gaussian graphical models