# Logistic regression and Ising networks: prediction and estimation when violating lasso assumptions

- 163 Downloads

## Abstract

The Ising model was originally developed to model magnetisation of solids in statistical physics. As a network of binary variables with the probability of becoming ’active’ depending only on direct neighbours, the Ising model appears appropriate for many other processes. For instance, it was recently applied in psychology to model co-occurrences of mental disorders. It has been shown that the connections between the variables (nodes) in the Ising network can be estimated with a series of logistic regressions. This naturally leads to questions of how well such a model predicts new observations and how well parameters of the Ising model can be estimated using logistic regressions. Here we focus on the high-dimensional setting with more parameters than observations and consider violations of assumptions of the lasso. In particular, we determine the consequences for both prediction and estimation when the sparsity and restricted eigenvalue assumptions are not satisfied. We explain using the idea of connected copies (extreme multicollinearity) the fact that prediction becomes better when either sparsity or multicollinearity is not satisfied. We illustrate these results with simulations.

## Keywords

Ising model Bernoulli graphs Variational inference## 1 Introduction

Logistic regression models the ratio of success versus failure for binary variables. These models are convenient and useful in many situations. To name one example, in genome wide association scans (GWAS), a combination of alleles on single nucleotide polymorphisms (SNPs) is either present or not (Cantor et al. 2010). Recently, it became clear that logistic regression can also be used to obtain estimates of connections in a binary network (e.g. Ravikumar et al. 2010; Bühlmann and van de Geer 2011). A particular version of a binary network is the Ising model, in which the probability of a node being ’active’ is determined only by its direct neighbours (pairwise interactions only). The Ising model originated in statistical physics and was used to model magnetisation of solids (Kindermann et al. 1980; Cipra 1987; Baxter 2007), and was investigated extensively by Besag (1974) and Cressie (1993) and recently by Marsman et al. (2017), amongst others, in a statistical modelling context. Recently, the Ising model has also been applied to modelling networks of mental disorders (Borsboom et al. 2011; van Borkulo et al. 2014). The objective in models of psychopathology is to both explain and predict certain observations such as co-occurrences of disorders (comorbidity).

Here we focus on violations of the assumptions of lasso in logistic regression with high-dimensional data (more parameters than observations, \(p>n\)). In particular, we consider the consequences for both prediction and estimation when violating the assumptions of sparsity and restricted eigenvalues (multicollinearity). For sparse models and \(p>n\), it has been shown that statistical guarantees about the underlying network and its coefficients can be obtained with certain assumptions for Gaussian data (Meinshausen and Bühlmann 2006; Bickel et al. 2009; Hastie et al. 2015), for discrete data (Loh and Wainwright 2012), and for exponental family distributions (Bühlmann and van de Geer 2011). Specifically, Ravikumar et al. (2010) show that, under strong regularity conditions, using a series of regressions for the conditional probability in the Ising model (logistic regression), the correct structure (topology) of a network can be obtained in the high-dimensional setting.

In many practical settings, it is uncertain whether the assumptions of the lasso for accurate network estimation hold. Specifically, the assumptions of sparsity and the restricted eigenvalues (Bickel et al. 2009) are in many situations untestable. We therefore investigate here how estimation and prediction in Ising networks are affected by violating the sparsity and restricted eigenvalue assumptions. The setting of logistic regression and nodewise estimation of the Ising model parameters allows us to clearly determine how and why prediction and estimation are affected. We use the idea of connected nodes in a graph that are identical in the observations (and call them connected copies) to show why prediction is better for graph structures that violate the restricted eigenvalue or sparsity assumption. These connected copies represent the idea of extreme multicollinearity. One way to view connected copies is obtaining edge weights that lead to a network with perfect correlations between nodes (variables). We therefore compare in terms of prediction and estimation different situations where we violate the restricted eigenvalue or sparsity assumption based on different data generating processes. An example of a setting where near connected copies in networks are found is in high resolution functional magnetic resonance imaging. Here time series of contiguous voxels that are connected (also physically, see e.g. Johansen-Berg et al. 2004), are near exact copies of one another. The concept of connected copies allows us to determine the consequences for prediction loss, using the fact that subsets of connected copies do not change the risk or \(\ell _{1}\) norm. We also show that prediction loss is a lower bound for estimation error (in \(\ell _{1}\)) and so by consequence, if prediction loss increases, so does estimation error.

We first provide some background in Sect. 2 on the Ising model and its relation to logistic regression. To show the consequences of violating the assumptions of multicollinearity and sparsity, we discuss these assumptions at length in Sect. 3. We also show how they provide the statistical guarantees for the lasso (e.g. Negahban et al. 2012; Bühlmann and van de Geer 2011; Ravikumar et al. 2010). Then armed with these intuitions, we give in Sect. 4 some insight into the consequences for prediction and estimation when the sparsity or restricted eigenvalue assumption is violated. We also provide some simulations to confirm our results.

## 2 Logistic regression and the Ising model

*G*be a graph consisting of nodes in \(V=\{1,2,\ldots ,p\}\) and edges (

*s*,

*t*) in \(E\subseteq V\times V\). To each node \(s\in V\), a random variable \(X_{s}\) is associated with values in \(\{0,1\}\). The probability of each configuration

*x*depends on a main effect (external field) and pairwise interactions. It is sometimes referred to as the auto logistic function (Besag 1974), or a pairwise Markov random field to emphasise that the parameter and sufficient statistic space are limited to pairwise interactions (Wainwright and Jordan 2008). Each \(x_{s}\in \{0,1\}\) has conditional on all remaining variables (nodes) \(X_{\backslash s}\) probability of success \(\pi _{s}:=\mathbb {P}(X_{s}=1\mid x_{\backslash s})\), where \(x_{\backslash s}\) contains all nodes except

*s*. Let \(\xi =(m,A)\) contain all parameters, where the \(p\times p\) matrix

*A*contains the pairwise interaction strengths and the

*p*vector

*m*is the main effects (external field). The distribution for configuration

*x*of the Ising model is then

*p*parameters \(m_{s}\) and \((A_{st}, t\in V\backslash \{s\})\) in the vector \(\theta\). Note that the log odds \(\theta \mapsto \mu _{\theta }\) is a linear function, and so if \(x=(1,x_{\backslash s})\) then \(\mu _{\theta }=x^\mathsf{T}\theta\). The theory of generalised linear models (GLM) can therefore immediately be applied to yield consistent and efficient estimators of \(\theta\) when sufficient observations are available, i.e. \(p<n\) (Nelder and Wedderburn 1972; Demidenko 2004). To obtain an estimate of \(\theta\) when \(p>n\), we require regularisation or another method (Hastie et al. 2015; Bühlmann et al. 2013).

### 2.1 Nodewise logistic regression

Meinshausen and Bühlmann (2006) showed that for sparse models the true neighbourhood of a graph can be obtained with high probability by performing a series of conditional regressions with Gaussian random variables. For each node \(s\in V\), the set of nodes with nonzero \(A_{st}\) are determined, culminating in a neighbourhood for each node. Combining these results leads to the complete graph, even when the number of nodes *p* is much larger than the number of observations *n*. This is called neighbourhood selection, or nodewise regression. This idea was extended to Bernoulli (Ising) graphs by Ravikumar et al. (2010), but see also van de Geer (2011, chapters 6 and 13). Nodewise regression allows us to use standard logistic regression to determine the neighbourhood for each node. This framework, of course, comes at a cost, and two strong assumptions are required. We discuss these assumptions in Sect. 3.

To estimate the coefficients, Meinshausen and Bühlmann (2006) used a sequential regression procedure for Gaussian data where each node in turn is treated as the dependent variable and the remaining ones as independent variables. By repeating this analysis for all nodes in *V*, a total of \(p-1\) neighbourhood estimates of nonzero parameters are obtained for all nodes \(s\in V\). Since each node is considered twice, the estimates are often combined by either an *and*-rule, where an edge is obtained if \(\hat{A}_{st}\ne 0\) and \(\hat{A}_{ts}\ne 0\), or an *or*-rule, where either parameter estimate can be nonzero (Meinshausen and Bühlmann 2006).

*p*dimensional vector \(\theta\) are \(m_{s}\) for the intercept (external field) and \((A_{st},t\in V\backslash \{s\})\), representing the connectivity parameters for node

*s*based on all remaining nodes \(V\backslash \{s\}\). Let the \(n\times p\) matrix \(X_{\backslash s}=(1_{n},X_{1}, \ldots ,X_{p})\) be the matrix with the vector of 1s in \(1_{n}\) and the remaining variables without \(X_{s}\). We write \(y_{i}\) for the observation \(x_{i,s}\) of node

*s*and \(x_{i}=(1,x_{i,\backslash s})\) and \(\mu _{i}:=\mu _{i,\theta _{s}}(x_{i,\backslash s})\), basically leaving out the subscript

*s*to index the node, and only use the node index

*s*whenever circumstances demand it. Let the loss function be the negative log of the conditional probability \(\pi\) in (2), known as a pseudo log-likelihood (Besag 1974)

Once the parameters are obtained it turns out that inference on network parameters is in general difficult with \(\ell _{1}\) regularisation (Pötscher and Leeb 2009). One solution is to desparsify it by adding a projection of the residuals (van de Geer et al. 2013; Javanmard and Montanari 2014; Zhang and Zhang 2014; Waldorp 2015), which is sometimes referred to as the desparsified lasso. Another type of inference is one where clusters of nodes obtained from the lasso are interpreted instead of individual nodes (Lockhart et al. 2014).

*igraph*package in R was used with

*erdos.renyi.game*(Csardi and Nepusz 2006). To generate data (\(n=50\) observations of the \(p=100\) nodes) from the Ising model, the package

*IsingSampler*was used, and to obtain estimates of the parameters the package

*IsingFit*was used (by Epskamp, see van Borkulo et al. 2014) in combination with the

*and*rule.

The recall (true positive rate) for this example was 0.69 and the precision (positive predictive value) was 0.42. So we see that about 30% of the true edges are missing and about 60% of the estimated edges is incorrect. This is not surprising given that we have 4950 possible edges to determine and only 50 observations. (More details on the simulation are in Sect. 4.2.)

## 3 Assumptions for prediction and estimation

To determine the consequences of violating the assumptions of the lasso in logistic regression, we discuss the assumptions for accurate prediction and estimation. Both prediction and estimation require that the solution is sparse; informally, that the number of non-zero edges in the graph is relatively small (see Assumption 1 below). For accurate estimation we also require an assumption on the covariance between the nodes in the graph. Several types of assumptions have been proposed (see van de Geer et al. 2009, for an excellent over view and additional results on obtaining the lasso solution), but here we focus on the restricted eigenvalue assumption because of its direct connection to multicollinearity.

### 3.1 Sparsity

Central to lasso estimation is the assumption that the underlying problem is low dimensional (Bühlmann and van de Geer 2011; Giraud 2014). This is the assumption of sparsity. This is essential because whenever \(p>n\) there is no unique solution to the empirical risk \(R_{n,\psi }(\mu )\) defined in (6) (Wainwright 2009). Sparsity can be defined in different ways. The most common is a restriction on the number of nonzero edges, sometimes referred to as coordinate sparsity (Giraud 2014). Let \(S_{0}\) denote the support containing the indices of the nonzero coefficients, i.e., \(S_{0}:=\{j: \theta _{j}\ne 0\}\) and its size \(s_{0}=|S_{0}|\).

### Assumption 1

(Coordinate sparsity) The size \(s_{0}\) of the set of nonzero coefficients \(S_{0}\) in \(\theta ^{*}\) is of order \(o(\sqrt{n/\log p})\).

There are other forms of sparsity, such as the fused sparsity, where the support is defined as \(\{j:\theta _{j}-\theta _{j-1}\ne 0\}\). This ensures that there are relatively few jumps in, for instance, a piecewise continuous function (see Giraud 2014, for more details). Another form of sparsity is where the \(\ell _{1}\) size of the parameter vector \(\theta\) is restricted. We use this to show that prediction (classification) in logistic regression is accurate.

### Assumption 2

(\(\ell _{1}\)-sparsity) The \(\ell _{1}\) norm of the coefficients \(\theta ^{*}\) is of order \(o(\sqrt{n/\log p})\), i.e. \(||\theta ^{*}||_{1}=o(\sqrt{n/\log p})\).

### 3.2 Restricted eigenvalues

Next to sparsity, the second assumption for the lasso is related to the problem that when \(p>n\) the empirical risk \(R_{n,\psi }\) is not strongly convex and hence no unique solution is available. It turns out that we need to consider a subset of lasso estimation errors \(\delta =\hat{\theta }-\theta ^{*}\) such that strong convexity holds for that subset (Negahban et al. 2012).

Because we have \(p>n\) we cannot obtain strong convexity in general, and we need to relax the assumption. This is how we get to the restricted eigenvalue assumption. Let \(\nabla _{j}\psi (y_{i},x_{i}^\mathsf{T}\theta )\) be the first derivative with respect \(\theta _{j}\) and \(\nabla ^{2}_{jj}\psi (y_{i},x_{i}^\mathsf{T}\theta )\) the second derivative with respect to \(\theta _{j}\). Then demanding strong convexity means that if we consider the \(s_{0}\times s_{0}\) submatrix \(\nabla ^{2}_{S_{0}}\psi _{n}(\theta )\) then we need that \(\nabla ^{2}_{S_{0}}\psi _{n}(\theta )\ge \gamma I\), where *I* is the identity matrix and we used \(\psi (\theta )\) instead of \(\psi (y,\mu )\) to emphasise dependence on \(\theta\) (and \(\mu =x^\mathsf{T}\theta\)). This we can never get (see the Appendix for more details on strong convexity). But from (10) it follows that if the directions of the lasso error \(\delta =\hat{\theta }-\theta ^{*}\) follow a cone shaped region with \(||\delta _{S_{0}^{c}}||_{1}\le \alpha ||\delta _{S_{0}}||_{1}\) (see Theorem 1 in the Appendix), then within these directions strong convexity holds. We refer to this set as \(\mathbb {C}_{\alpha }=\{\delta \in \mathbb {R}^{p}:||\delta _{S_{0}^{c}}||_{1}\le \alpha ||\delta _{S_{0}}||_{1}\}\). In the directions where the cone shape holds so that \(\delta \in \mathbb {C}_{\alpha }\), the loss function is strictly larger than 0, except at \(\delta =0\), but is flat and can be 0 if \(\delta \notin \mathbb {C}_{\alpha }\) (see Negahban et al. (2012) or Hastie et al. (2015) for an excellent discussion). This assumption is called the restricted eigenvalue assumption.

*s*simultaneously the lower bound \(\gamma _{G}>0\) is sufficient and \(\alpha =1\). We emphasise the nodewise estimation of all edges in

*E*using \(\psi _{s}\) and \(\delta _{s}\).

### Assumption 3

The restricted eigenvalue assumption has been investigated in the context of Gaussian data (Bickel et al. 2009; Wainwright 2009; Raskutti et al. 2010; Hastie et al. 2015, chapter11), in the setting of the Ising model (Ravikumar et al. 2010, Lemma 3), and in generalised linear models (Van de Geer 2008; Bühlmann and van de Geer 2011, chapter 6). The original restricted eigenvalue assumption as presented in Bickel et al. (2009) is slightly stronger than the compatibility assumption of van de Geer et al. (2009). See van de Geer et al. (2009) for more details on the compatibility and other assumptions to bound estimation error in the lasso. Here we use the RE assumption because of its connection to multicollinearity, discussed in Sect. 4.

*S*. The RE assumption implies that the \(s_{0}\times s_{0}\) submatrix \(\nabla ^{2}_{S_{0}}\psi _{n}(\theta )\) indexed by \(S_{0}\) has smallest eigenvalue \(>0\). This can be seen as follows. RE implies that there is a \(\delta\) such that \(\delta _{S_{0}}\ne 0\), \(\delta _{S_{0}^{c}}=0\), implying \(||\delta _{S_{0}^{c}}||_{1}\le ||\delta _{S_{0}}||_{1}\), and \(\delta ^\mathsf{T}(\nabla ^{2}\psi _{n})\delta >0\). This implies that for some \(\gamma _{G}>0\)

The bounds on prediction and estimation are important to know the circumstances for the statistical guarantees. However, in many practical situations, we cannot be certain of the assumptions of sparsity and restricted eigenvalues. These assumptions cannot be checked. And so it becomes relevant to know what the consequences for prediction and estimation are when the assumptions are not satisfied. This is what we investigate next.

## 4 Violation of sparsity and restricted eigenvalues

If we violate either the sparsity or restricted eigenvalue assumption, then we would expect that lasso estimation error becomes worse, and indeed this happens. However, this is not so clear for prediction. In fact, it turns out that prediction becomes better for non-sparse models that violate the restricted eigenvalue (RE) assumption. Our main result is that violating the RE or sparsity assumption leads to a decrease in empirical risk, and hence in loss. The RE assumption is violated by an extreme case of multicollinearity, namely where some nodes are copies of other nodes. When such copies are connected we call them connected copies. In connected copies, the coefficients are proportional to the original ones, such that we do not arbitrarily change the data generating process. One way to view connected copies is to find multiplicative constants for the edge weights that lead to a network with perfect correlations between nodes. We therefore compare prediction and estimation of different situations where we violate the RE or sparsity assumption based on different data generating processes. Proposition 1 shows that the number of connected copies co-determines the decrease in empirical risk, and hence, violating the RE assumption leads to a decrease in risk. Next, in Corollary 1, we show that violating the sparsity assumption leads to either a decrease or increase of empirical risk depending on whether the set of coefficients in the different subsets of nodes is positive or negative, respectively. We illustrate the theoretical results with some simulations in Sect. 4.2.

### 4.1 Connected copies

*i*. Then the coefficients obtained with the lasso using the quadratic approximation to the logistic loss in coordinate descent will be identical, i.e. \(\hat{\theta }_{s}=\hat{\theta }_{t}\) (Hastie et al. 2015, see also the Appendix for a discussion of the coordinate descent algorithm). This can be seen from the following considerations. By (13) we have that element (

*s*,

*s*) of the second derivative matrix is

*t*,

*t*) since \(x_{s}=x_{t}\). Similarly, for the

*s*th element \(\nabla _{s}\psi _{n}\), we obtain

*j*,

*j*) of the inverse of the second order derivative matrix \(\nabla ^{2}\psi ^{q}\) for step

*q*in the coordinate descent algorithm. Then we obtain in the coordinate descent algorithm \((\nabla ^{2}_{ss}\psi ^{q}_{n})^{-1}\nabla _{s}\psi ^{q}_{n}\) at each step

*q*for both nodes

*s*and

*t*, implying that the coefficients are the same. So for each node in the nodewise regressions, we obtain a Fisher matrix where column

*s*is the same as column

*t*. Now if both

*s*and

*t*are in \(S_{0}\), then the smallest eigenvalue of \(\nabla ^{2}\psi _{n,S_{0}}\) is 0, and hence, the RE assumption is violated. We will use this idea of identical nodes to explain why prediction loss becomes better when we violate the RE assumption.

We call a node *t* in the subset \(L\subset V\) a connected copy of \(s\in K=V\backslash L\) if \((s,t)\in E\) and \(x_{t}=x_{s}\). This says that two directly connected nodes are identical to each other for all *n* observations. Note that the coefficient between a connected copy and its original must be positive; if the coefficient was negative, then the connected copy would also have to be the reverse of its original, which cannot be true because the variables are defined to be identical. We know from estimation that if a node is a connected copy then the lasso solution is no longer unique (Hastie et al. 2015). In fact, if *t* is a connected copy of *s*, then all solutions with \(\alpha \hat{\theta }_{s}\) and \((1-\alpha )\hat{\theta }_{t}\), with \(0\le \alpha \le 1\) and \(\hat{\theta }_{s}\), \(\hat{\theta }_{t}\) are estimates of the parameters of nodes *s* and *t*, respectively, result in identical empirical risk \(R_{n,\psi }\) as when those connected copies have been deleted. Similarly, we will have the same \(\ell _{1}\) norm as when the connected copies have been deleted. As a consequence, we cannot distinguish between the situation with or without the connected copy in \(\ell _{1}\) optimisation. We denote by \(L_{t}\) the set of all connected copies \(s\in L_{t}\) of \(t\in K\), which defines an equivalence relation on *L*, such that \(L_{t}\cap L_{s}=\varnothing\) and \(\cup L_{t}=L\). We denote the parameter vector where the connected copies in *L* have been deleted by \(\theta _{\backslash L}\) and correspondingly \(\mu _{\backslash L}=x^\mathsf{T}_{\backslash L}\theta _{\backslash L}\).

### Lemma 1

In the Ising graph \(G=(V,E)\) suppose nodes in \(L\subset V\) are connected copies of nodes in \(K=V\backslash L\). Furthermore, the nodewise lasso solutions \(\hat{\theta }\) are obtained with (7) where for each connected copy \(t\in L_{t}\) of node \(s\in K\), with \(\alpha _{t}\hat{\theta }_{t}\), we have that \(\sum _{t\in L_{t}}\alpha _{t}=1\). Then the empirical risk \(R_{n,\psi }(\hat{\mu })\) and \(\ell _{1}\) norm of \(\hat{\theta }\) are the same as when the connected copies in *L* are deleted, i.e. \(R_{n,\psi }(\hat{\mu })=R_{n,\psi }(\hat{\mu }_{\backslash L})\) and \(||\hat{\theta }||_{1}=||\hat{\theta }_{\backslash L}||_{1}\).

So we have that the non-uniqueness of the lasso in case of a connected copy, results in the exact same value for the empirical risk whether we delete it or take any one of the weighted versions such that the coefficients sum to 1. Note that we do not change the underlying process in any arbitrary way; the nodes are connected and the coefficients remain proportional to the original ones. We immediately obtain that the size |*L*| of the set of connected copies co-determines the prediction loss. We obtain this result because the coefficients of the connected copies with respect to their originals are positive.

### Proposition 1

For the Ising graph, let \(L_{1}\) and \(L_{2}\) be subsets of connected copies of nodes in \(V\backslash L_{1}\cup L_{2}\) such that \(L_{1}\subset L_{2}\) and hence \(|L_{1}|< |L_{2}|\). Then we have for the prediction loss that the sum of coefficients in \(L_{1}^{c}\cap L_{2}\) is \(>0\), and the risk \(R_{n,\psi }(\hat{\mu }_{\backslash L_{1}})\ge R_{n,\psi }(\hat{\mu }_{\backslash L_{2}})\).

This follows from Lemma 1 directly, since there we saw that the prediction loss including connected copies is equal to the prediction error when those connected copies are deleted. This idea explains why the empirical risk decreases as a function of an increasing number of connected copies.

The same idea can be used to determine why prediction becomes better for non-sparse sets. Proposition 1 can be altered such that a similar result holds for sparsity, where we do not need the connected copies. The only requirement is that we know what the sum of the coefficients is that are in the larger set of connected nodes, because the nodes need not be connected in this case. Let the \(S_{a}\) be a set of nodes with a possibly non-sparse set of nonzero edges in the sense that \(|S_{a}|>O(\sqrt{n/\log p})\). Suppose that \(S_{0}\subset S_{a}\) so that \(|S_{0}|<|S_{a}|\).

### Corollary 1

- (1)
if the sum of coefficients in \(S_{0}^{c}\cap S_{a}\) is \(>0\), then \(R_{n,\psi }(\hat{\mu }_{\backslash S_{0}})\ge R_{n,\psi }(\hat{\mu }_{\backslash S_{a}})\);

- (2)
if the sum of coefficients in \(S_{0}^{c}\cap S_{a}\) is \(<0\), then \(R_{n,\psi }(\hat{\mu }_{\backslash S_{0}})\le R_{n,\psi }(\hat{\mu }_{\backslash S_{a}})\).

We see that by eliminating the requirement of connectedness, we find that prediction loss decreases given that the coefficients in the remaining set of non-zero coefficients are positive.

We focus here on prediction loss because by (11) we have that the \(\ell _{1}\) estimation error is larger than prediction loss (given that the penalty parameter \(\lambda\) is of the right order), and hence if we find that prediction loss becomes higher, it follows that \(\ell _{1}\) estimation error becomes larger.

The above presented ideas of violating the sparsity assumption or restricted eigenvalue assumption are confirmed by some numerical illustrations.

### 4.2 Numerical illustration

To show the effects of non-sparse underlying representations and violation of the restricted eigenvalue assumption (multicollinearity), we performed some simulation studies. Here 0–1 data were generated by a Metropolis–Hastings algorithm, implemented in the R package *IsingSampler* (van Borkulo et al. 2014), according to a random graph (Erdös–Renyi) with \(p=100\) nodes and \(n=50\) observations. All edge coefficients were positive, so that we expect the prediction error to improve with increasing collinearity. Sparsity of the graph was varied by the probability of an edge from \(p_{e}=0.025\), which complies with the sparsity assumption, to the probability of an edge of \(p_{e}=0.2\), which does not comply with the sparsity assumption. For interpretation we defined sparsity in these simulations as \(1-p_{e}\), so that high sparsity means few non-zero edges. Multicollinearity was induced by equating two columns of the data *X* if there was an edge in the edge set of the true graph for a percentage \(\alpha\), ranging from 0 to 0.6. This ensured that the smallest \(\alpha s_{0}\) eigenvalues of the submatrix \(\nabla ^{2}\psi _{n,S_{0}}\) are 0, thereby violating the RE assumption.

*m*and for interactions in

*A*were estimated by nodewise logistic regressions, implemented in

*IsingFit*(by Epskamp, see van Borkulo et al. 2014). Here the extended Bayesian information criterion (EBIC) is used to determine the optimal \(\lambda\) for each logistic regression separately (Foygel and Drton 2013). This procedure was run 100 times and the averages across these runs (and nodes) are presented. We evaluated estimation accuracy by recall (\(|\hat{S}\cap S_{0}|/ |S_{0}|\)) and precision (\(|\hat{S}\cap S_{0}|/ |\hat{S}|\)). We also used a scaled \(\ell _{1}\) norm for the estimation error \(||\delta ||_{1}/u\), where \(\delta =\hat{\theta }-\theta ^{*}\) and

*u*is the maximal value obtained. Prediction was evaluated by logistic loss \(\psi\) and Bayes loss \(\mathcal {C}\). We determine loss for data \(z_{i}\) independent from data \(y_{i}\), upon which the estimate \(\hat{\theta }\) is based (predictive risk).

Figure 2b shows that recovery of parameters is accurate when sparsity is high (few non-zero edges), but recovery becomes poor when sparsity does not hold; from sparsity 0.95 and lower. This is seen in all three measures: recall, precision and the scaled \(\ell _{1}\) norm. In contrast, the 0–1 loss from (8) and the logistic loss in (4) actually become better (the loss decreases) when the data generating process is no longer sparse, as can be seen in Fig. 2a. This corresponds to Corollary 1, which shows that sparsity is not necessary for accurate prediction. We do require that the penalty parameter \(\lambda\) is of the appropriate order (i.e. \(\lambda =O(\sqrt{\log p/n})\)); here \(\lambda\) was selected by the EBIC (Foygel and Drton 2013) which ensured such a penalty. The EBIC has an additional hyperparameter \(\gamma\) to control the impact of the size of the search domain; we set \(\gamma\) to 0.25 in line with the reasonable performance obtained in Foygel and Drton (2013). Prediction loss is high at high sparsity because in the simulation there are only about 2–3 edges, which means that prediction of other nodes is extremely difficult.

*X*are present for connected nodes. This leads to more similar behaviour of connected nodes in the Ising network and hence to better prediction.

These results demonstrate that when either the sparsity assumption or multicollinearity (RE) assumption is violated, the prediction loss decreases, making prediction better. But also that estimation error increases. Hence, the estimated network that predicts well, will not be similar to the true underlying network. On the other hand, if the assumptions of sparsity and RE hold, then many of the edges in the Ising model are estimated correctly but because of the high-dimensional setting many true edges are also missed. And since in sparse settings fewer edges are present that determine the prediction, prediction is poorer.

## 5 Discussion

Logistic regression is an appropriate tool for prediction and estimation of parameters of the Ising model. Statistical guarantees have been given for prediction and estimation of the parameters of the Ising model using a sequence of logistic regressions whenever at least the assumptions of sparsity and restricted eigenvalues hold. Here we focused on violations of these assumptions and showed why prediction becomes better whenever sparsity or restricted eigenvalues do not hold. Intuitively, for predicting the underlying structure of the graph is irrelevant and when nodes behave similarly, prediction becomes easier. To confirm these intuitions we showed, using connected copies, that prediction loss can decrease as a function of multicollinearity and sparsity. When multicollinearity increases or sparsity decreases, then prediction loss decreases. By consequence of the fact that prediction loss can be considered as a lower bound for estimation error (albeit not a tight bound), estimation error is seen to become worse (increase) as multicollinearity increases or sparsity decreases. Our simulations support these findings and additionally show that recovery in terms of precision and recall becomes worse when violating the assumption of sparsity and multicollinearity.

The concept of connected copies used here is of course an idealisation of reality. Connected copies can be seen as a way to compare prediction and estimation for different structures (topologies) of graphs, where a connected copy is an extreme case in which the correlation between two variables is 1. We required this idealisation to obtain the analytical results. In practice, we will not encounter \(x_{s}=x_{t}\) but \(x_{s}\approx x_{t}\). This case is much more difficult to treat analytically. In the case where \(x_{s}\approx x_{t}\) then the parameter estimates will not be equal and the result would depend on the exact differences in estimates. But if we suppose that the sign of all the coefficients is positive, say, then we would expect similar behaviour of the empirical risk based on the results of Proposition 1 and Corollary 1.

We showed here the consequences of violating the restricted eigenvalue and sparsity assumptions in the Ising model using logistic regression. The next step is obviously to generalise these results to exponential family distributions. This will require additional restrictions such as the margin condition. The margin condition bridges the gap between estimation error and prediction loss. Because for logistic regression we have the linear functional \(\mu =\theta ^\mathsf{T}x\), we obtain a quadratic margin. For logistic regression, the margin condition then implies that \(||\hat{\mu }-\mu ^{*}||_{2}^{2} \ge \gamma ||\delta ||_{2}^{2}\), where \(\delta =\hat{\theta }-\theta ^{*}\) and using strong convexity on \(\frac{1}{n}\sum _{i=1}^{n}x_{i}x_{i}^\mathsf{T}\). But the margin condition does not hold in general and so requires additional assumptions (see Bühlmann and van de Geer 2011) to apply the current analysis of the consequences of violating RE and sparsity on estimation and prediction.

## Notes

### Compliance with ethical standards

### Conflict of interest statement

On behalf of all authors, I declare that none of the authors has a conflict of interest.

## References

- Bartlett PL, Jordan MI, McAuliffe JD (2003) Large margin classifiers: convex loss, low noise, and convergence rates. In: NIPSGoogle Scholar
- Baxter RJ (2007) Exactly solved models in statistical mechanics. Courier corporationGoogle Scholar
- Bertsimas D, Tsitsiklis J (1997) Introduction to linear optimization. Athena Scientific and Dynamic Ideas, BelmontGoogle Scholar
- Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. J R Stat Soc Ser B (Methodol) 36(2):192–236MathSciNetzbMATHGoogle Scholar
- Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of lasso and dantzig selector. Ann Stat 37:1705–1732MathSciNetCrossRefzbMATHGoogle Scholar
- Borsboom D, Cramer AOJ, Schmittmann VD, Epskamp S, Waldorp LJ (2011) The small world of psychopathology. PLoS One 6(11):e27407CrossRefGoogle Scholar
- Bousquet O, Boucheron S, Lugosi G (2004) Introduction to statistical learning theory. Advanced lectures on machine learning. Springer, Berlin, pp 169–207CrossRefGoogle Scholar
- Boyd S, Vandenberghe L (2004a) Convex optimization. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
- Boyd S, Vandenberghe L (2004b) Convex optimization. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
- Brown L (1986) Fundamentals of statistical exponential families. Inst of Math StatGoogle Scholar
- Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods. Springer, Theory and Applications, BerlinCrossRefzbMATHGoogle Scholar
- Bühlmann P et al (2013) Statistical significance in high-dimensional linear models. Bernoulli 19(4):1212–1242MathSciNetCrossRefzbMATHGoogle Scholar
- Cantor RM, Lange K, Sinsheimer JS (2010) Prioritizing gwas results: a review of statistical methods and recommendations for their application. Am J Human Genet 86(1):6–22CrossRefGoogle Scholar
- Cipra B (1987) An introduction to the ising model. Am Math Mon 94(10):937–959MathSciNetCrossRefGoogle Scholar
- Cressie N (1993) Statistics for spatial data. Wiley, HobokenzbMATHGoogle Scholar
- Csardi G, Nepusz T (2006) The igraph software package for complex network research. InterJournal Complex Systems 1695(5):1–9 http://igraph.org
- Demidenko E (2004) Mixed models: Theory and applications. Wiley, HobokenCrossRefzbMATHGoogle Scholar
- Foygel R, Drton M (2013) Bayesian model choice and information criteria in sparse generalized linear models. University of Chicago, Tech. rep., ChicagoGoogle Scholar
- Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22CrossRefGoogle Scholar
- Giraud C (2014) Introduction to high-dimensional statistics, vol 138. CRC Press, Boca RatonzbMATHGoogle Scholar
- Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer-Verlag, New YorkCrossRefzbMATHGoogle Scholar
- Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. CRC Press, Boca RatonCrossRefzbMATHGoogle Scholar
- Javanmard A, Montanari A (2014) Confidence intervals and hypothesis testing for high-dimensional regression. Tech. rep., arXiv:1306.317
- Johansen-Berg H, Behrens TEJ, Robson MD, Drobnjak I, Rushworth MFS, Brady JM, Smith SM, Higham DJ, Matthews PM (2004) Changes in connectivity profiles define functionally distinct regions in human medial frontal cortex. Proc Nat Acad Sci Am 101(36):13335–13340CrossRefGoogle Scholar
- Kindermann R, Snell JL et al (1980) Markov random fields and their applications, vol 1. American Mathematical Society Providence, ProvidenceCrossRefzbMATHGoogle Scholar
- Kolaczyk ED (2009) Statistical analysis of network data: methods and models. Springer, New YorkCrossRefzbMATHGoogle Scholar
- Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R et al (2014) A significance test for the lasso. Ann Stat 42(2):413–468MathSciNetCrossRefzbMATHGoogle Scholar
- Loh P-L, Wainwright M (2012) High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. Ann Stat 40(3):1637–1664MathSciNetCrossRefzbMATHGoogle Scholar
- Marsman M, Waldorp L, Maris G (2017) A note on large-scale logistic prediction: Using an approximate graphical model to deal with collinearity and missing data. Behaviormetrika 44(2):513–534CrossRefGoogle Scholar
- Meinshausen N, Bühlmann P (2006) High-dimensional graphs and variable selection with the lasso. Ann Stat 34(3):1436–1462MathSciNetCrossRefzbMATHGoogle Scholar
- Negahban SN, Ravikumar P, Wainwright MJ, Yu B (2012) A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Stat Sci 27(4):538–557MathSciNetCrossRefzbMATHGoogle Scholar
- Nelder JA, Wedderburn RWM (1972) Generalized linear models. J R Stat Soc Ser A (Gen) 135(3):370–384CrossRefGoogle Scholar
- Pötscher BM, Leeb H (2009) On the distribution of penalized maximum likelihood estimators: The lasso, scad, and thresholding. J Multivar Anal 100(9):2065–2082MathSciNetCrossRefzbMATHGoogle Scholar
- Raskutti G, Wainwright MJ, Yu B (2010) Restricted eigenvalue properties for correlated gaussian designs. J Mach Learn Res 11:2241–2259MathSciNetzbMATHGoogle Scholar
- Ravikumar P, Wainwright M, Lafferty J (2010) High-dimensional ising model selection using \(\ell_1\)-regularized logistic regression. Ann Stati 38(3):1287–1319CrossRefzbMATHGoogle Scholar
- van Borkulo CD, Borsboom D, Epskamp S, Blanken TF, Boschloo L, Schoevers RA, Waldorp LJ (2014) A new method for constructing networks from binary data. Scientific reports 4Google Scholar
- van de Geer S, Bühlmann P, Ritov Y (2013) On asymptotically optimal confidence regions and tests for high-dimensional models. arXiv preprint arXiv:1303.0518
- Van de Geer SA (2008) High-dimensional generalized linear models and the lasso. Ann Stat 36:614–645MathSciNetCrossRefzbMATHGoogle Scholar
- van de Geer SA, Bühlmann P et al (2009) On the conditions used to prove oracle results for the lasso. Electron J Stat 3:1360–1392MathSciNetCrossRefzbMATHGoogle Scholar
- Venkatesh S (2013) The theory of probability. Cambridge University Press, CambridgezbMATHGoogle Scholar
- Wainwright MJ (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). Inform Theory IEEE Trans 55(5):2183–2202MathSciNetCrossRefzbMATHGoogle Scholar
- Wainwright MJ, Jordan MI (2008) Graphical models, exponential families, and variational inference. Found Trends Mach Learn 1(1–2):1–305zbMATHGoogle Scholar
- Waldorp L (2015) Testing for graph differences using the desparsified lasso in high-dimensional data. (submitted)Google Scholar
- Young G, Smith R (2005) Essentials of statistical inference. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
- Zhang C-H, Zhang SS (2014) Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc 76(1):217–242MathSciNetCrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.