1 Introduction

The missing data (or incomplete data) problem, characterized by the absence of values for one or more variables in a dataset is a major impediment to both theoretical and empirical research and leaves no branch of experimental science untouched. The vast amount of literature on missing data problems in such diverse fields as computer science, geology, archeology, biology, statistics and epidemiology attests to both its extent and pervasiveness [8, 12, 15, 32]. Simply ignoring the problem by deleting all tuples with missing values will, in most cases, significantly distort the outcome of a study, regardless of the size of the dataset [1, 6].

Existing methods of dealing with missing data such as Expectation Maximization Algorithm and Multiple Imputation are based on the theoretical work of Rubin [27] and Little and Rubin [28] who formulated conditions under which the damage of missingness would be minimized. However, theoretical guarantees are provided only for a subset of problems falling into the Missing At Random (MAR) category thereby leaving the vast space of MNAR problems relatively unexplored.

In this paper we view missingness from a causal perspective and take the following steps to answer questions pertaining to consistent estimation of queries of interest. Given an incomplete dataset our first step is to postulate a model based on causal assumptions of the underlying data generation process. Our second step is to determine whether the data rejects the postulated model by identifiable testable implications of that model. Our third and final step, which is also the primary focus of this paper, is to determine from the postulated model if any method exists that produces consistent estimates of the queries of interest? A negative answer confirms the presence of a theoretical impediment to estimation. In other words, a bias is inevitable.

2 Missingness Graphs

Missingness graphs as discussed below was first defined in [17] and we adopt the same notations. Let \(G (\mathbb {V},E) \) be the causal DAG where \(\mathbb {V}=V\cup U \cup V^* \cup \mathbb {R}\). V is the set of observable nodes. Nodes in the graph correspond to variables in the data set. U is the set of unobserved nodes (also called latent variables). E is the set of edges in the DAG. We use bi-directed edges as a shorthand notation to denote the existence of a U variable as common parent of two variables in \(V \cup \mathbb {R}\). V is partitioned into \({V}_{o}\) and \({V_m}\) such that \(V_o \subseteq V\) is the set of variables that are observed in all records in the population and \(V_m \subseteq V\) is the set of variables that are missing in at least one record. Variable X is termed as fully observed if \(X \in V_o\), partially observed if \(X \in V_m\) and substantive if \(X \in V_o \cup V_m\). Associated with every partially observed variable \(V_i \in V_m\) are two other variables \(R_{v_i}\) and \(V_i^*\), where \(V_i^*\) is a proxy variable that is actually observed, and \(R_{v_i}\) represents the status of the causal mechanism responsible for the missingness of \(V_i^*\); formally,

$$\begin{aligned} v_i^*=f (r_{v_i},v_i) = \left\{ \begin{array}{l l} v_i &{} \quad \text {if }r_{v_i}=0\\ m &{} \quad \text {if }r_{v_i}=1\\ \end{array} \right. \end{aligned}$$
(1)

\(V^*\) is the set of all proxy variables and \(\mathbb {R}\) is the set of all causal mechanisms that are responsible for missingness. R variables may not be parents of variables in \(V \cup U\). We call this graphical representation Missingness Graph (or m-graph). An example of an m-graph is given in Fig. 1. We use the following shorthand. For any variable X, let \(X'\) be a shorthand for \(X=0\). For any set \(W\subseteq V_m\cup V_o\cup R\), let \(W_r\), \(W_o\) and \(W_m\) be the shorthand for \(W \cap R\), \(W \cap V_o\) and \(W \cap V_m\) respectively. Let \(R_w\) be a shorthand for \(R_{V_m \cap W}\) i.e. \(R_w\) is the set containing missingness mechanisms of all partially observed variables in W. Note that \(R_w\) and \(W_r\) are not the same. \(G_{\underline{X}}\) and \(G_{\overline{X}}\) represent graphs formed by removing from G all edges leaving and entering X, respectively.

A manifest distribution \(P(V_o,V^*,R)\) is the distribution that governs the available dataset. An underlying distribution \(P(V_o,V_m,R)\) is said to be compatible with a given manifest distribution \(P(V_o,V^*,R)\) if the latter can be obtained from the former using Eq. 1. Manifest distribution \(P_m\) is compatible with a given underlying distribution \(P_u\) if \(\forall X\), \(X \subseteq V_m\) and \(Y= V_m {\setminus } X\), the following equality holds true.

$$\begin{aligned} P_m(R'_x, R_y, X^*,Y^*,V_o)&= P_u(R'_x, R_y, X,V_o) \end{aligned}$$

where \(R'_x\) denotes \(R_x=0\) and \(R_y\) denotes \(R_y=1\).

3 Recoverability

Given a manifest distribution \(P(V^*,V_o, R)\) and an m-graph G that depicts the missingness process, query Q is recoverable if we can compute a consistent estimate of Q as if no data were missing. Formally,

Definition 1

(Recoverability). Given a m-graph G, and a target relation Q defined on the variables in V, Q is said to be recoverable in G if there exists an algorithm that produces a consistent estimate of Q for every dataset D such that P(D) is (1) compatible with G and (2) strictly positive i.e. \(P(V_o,V^*,\mathbb {R})>0\).

For an introduction to the notion of recoverability see, [17, 20].

Fig. 1.
figure 1

An m-graph depicting MAR category missingness

3.1 Recovering from MCAR and MAR Data

Examine the m-graph in Fig. 1, X is the treatment and Y is the outcome. Let us assume that some patients who underwent treatment are not likely to report the outcome, and hence the arrow \(X \rightarrow R_y\). Under these circumstances, can we recover P (XY)?

From the manifest distribution, we can compute \(P (X,Y^*,R_y) \). From the m-graph G, we see that \(Y^*\) is a collider and X is a fork. Hence by d-separation, \(Y\bot \!\!\!\bot R_y|X\). Thus

$$\begin{aligned} P (X,Y)&=P (Y|X) P (X) \\&= P (Y|X,R_y=0) P (X)\,\,(\text {using}\, Y \bot \!\!\!\bot R_y|X) \\&= P (Y^*|X,R_y=0) P (X)\,\,(\mathrm {using\;Eq.\,1}) \end{aligned}$$

Since both factors in the estimand are estimable from the manifest distribution, P (XY) is recoverable.

The scenario discussed above is a typical instance of Missing At Random (MAR). When data are Missing At Random (MAR), we have \(\mathbb {R}\bot \!\!\!\bot V_m | V_o\). Therefore \(P(V)=P(V_m|V_o)P(V_o)=P(V_m|V_o, R=0)P(V_o)\). In other words, the joint distribution P(V) is recoverable given MAR data. Estimation methods applicable to MAR are applicable to MCAR as well because by the weak union axiom of graphoids, Missing Completely at Random (MCAR: \((V_m, V_o) \bot \!\!\!\bot R\)) implies Missing At Random (MAR: \(V_m \bot \!\!\!\bot R|V_o\)). Therefore, it implicitly follows that queries (such as joint distribution and (identifiable) causal effects) that are recoverable given MAR datasets are recoverable given MCAR datasets as well.

4 Recoverability Procedures for MNAR Data

Data that are neither MAR nor MCAR fall into the Missing Not At Random (MNAR) category. In this section we will detail with examples three distinct recovery procedures.

4.1 Sequential Factorization

Consider an observational study that measured the variables XYW and Z where we wish to estimate the effect of treatment (X) on outcome (Y). The interactions between the variables and the underlying missingness process are depicted in Fig. 2. We notice that all variables are corrupted by missing values. The least bothersome missingness is that of Y which is caused by a random process such as an accidental deletion of cases while the most troubling missingness is that of W which is caused by its own underlying value- a typical example is the case of very rich and very poor people being reluctant to reveal their income.

Fig. 2.
figure 2

MNAR model in which P(Y|do(x)) is recoverable by sequential factorization

Recovering Causal Effect of X on Y: By backdoor criterion [19], we have two admissible sets, \(\{Z\}\) and \(\{W\}\) which yield the following estimands, respectively:

$$\begin{aligned} P(y|do(x))&= \sum _z P(y|xz) P(z) \\&= \sum _w P(y|xw) P(w) \end{aligned}$$

We choose the first estimand over the second because the latter contains P(W) which we know to be non-recoverable [17].Footnote 1 Therefore, to recover the causal effect we have to recover both P(y|xz) and P(z).

Recovering P(z) In order to d-separate Z from \(R_z\), one needs to condition of X and to d-separate X from \(R_x\) one needs to condition on Y. Therefore, we can write:

$$\begin{aligned} P(z)&= \sum _{x,y}P(z,x,y)\nonumber \\&=\sum _{x,y}P(z|x,y)P(x|y) P(y)\\&= \sum _{x,y}P(z|x,y,R_x=0,R_y=0,R_z=0)P(x|y,R_x=0,R_y=0) P(y|R_y=0)\nonumber \\&\text{(Using } Z \bot \!\!\!\bot (R_{z},R_x,R_y)|(X,Y)\text{, } X \bot \!\!\!\bot (R_x,R_y)|Y \text{ and } Y \bot \!\!\!\bot R_y\text{, } \text{ respectively) }\nonumber \\&= \sum _{x,y}P(z^*|x^*,y^*,R_x=0,R_y=0,R_z=0)P(x^*|y^*,R_x=0,R_y=0) P(y^*|R_y=0)\nonumber \end{aligned}$$
(2)

In the process of recovering P(z) we have in fact recovered P(xyz). Therefore it follows that P(y|xz) is recoverable. Finally, the causal effect may be recovered as:

$$\begin{aligned} P(y|do(x)) = \sum _z&\frac{P(z^*|x^*,y^*,R_x=0,R_y=0,R_z=0)P(x^*|y^*,R_x=0,R_y=0) P(y^*|R_y=0)}{\sum _y P(z^*|x^*,y^*,R_x=0,R_y=0,R_z=0)P(x^*|y^*,R_x=0,R_y=0) P(y^*|R_y=0)} \\&\times \\&\sum _{x,y}P(z^*|x^*,y^*,R_x=0,R_y=0,R_z=0)P(x^*|y^*,R_x=0,R_y=0) P(y^*|R_y=0) \end{aligned}$$

Recovery Procedure: Given an m-graph with no edges between R variables, a sufficient condition for recoverability of query Q is that it be decomposable into sub-queries of the form P(Y|X) such that \(Y \bot \!\!\!\bot (R_x,R_y)|X\). This recovery procedure called as seuential factorization (generalized in Theorem 1 below) is sensitive to the ordering of variables in the factorization, which in turn is dictated by the graph. For instance, in Eq. 2 had we factorized P(xyz) as P(y|xz) P(x|z) P(z), we would not have had the permission to insert the R terms in any factor.

Recovering in the presence of edges between R variables: A quick inspection reveals that the factorization in Eq. 2 guarantees recoverability even when an edge \(R_x \rightarrow R_z\) is added. However, addition of the (reversed) edge \(R_z \rightarrow R_x\) would require conditioning on \(R_z\) and Y to d-separate X from \(R_x\). The procedure for recovering the marginal distribution P(Z) is presented below:

$$\begin{aligned} P(z)&= \sum _{x,y,r_z}P(z,x,y,r_z)\nonumber \\&=\sum _{x,y,r_z}P(z|x,y,r_z)P(x|y,r_z) P(y|r_z)P(r_z)\\&= \sum _{x,y}P(z|x,y,R_x=0,R_y=0,r_z=0)\sum _{r_z}P(x|y,R_x=0,R_y=0,r_z) P(y|R_y=0,r_z)P(r_z)\nonumber \\&\text{(Using } Z \bot \!\!\!\bot (R_{z},R_x,R_y)|(X,Y)\text{, } X \bot \!\!\!\bot (R_x,R_y)|(Y,R_z) \text{ and } Y \bot \!\!\!\bot R_y|R_z\text{, } \text{ respectively) }\nonumber \end{aligned}$$
(3)

The following definition and theorem in [18] formalizes the preceding recovery procedure.

Definition 2

(General Ordered factorization). Given a graph G and a set O of ordered \(V \cup R\) variables \(Y_1<Y_2 < \ldots < Y_k\), a general ordered factorization relative to G, denoted by f(O), is a product of conditional probabilities \(f(O)= \prod _i P(Y_i|X_i)\) where \(X_i \subseteq \{Y_{i+1}, \ldots , Y_n\}\) is a minimal set such that \(Y_i\bot \!\!\!\bot (\{Y_{i+1}, \ldots , Y_n\}{\setminus } X_i)|X_i\) holds in G.

Theorem 1

(Sequential Factorization). A sufficient condition for recoverability of a relation Q defined over substantive variables is that Q be decomposable into a general ordered factorization, or a sum of such factorizations, such that every factor \(Q_i=P(Y_i | X_i)\) satisfies, (1) \(Y_i \bot \!\!\!\bot (R_{y_i}, R_{x_i}) | X_i{\setminus } \{R_{y_i}, R_{x_i}\}\), if \(Y_i \in (V_o \cup V_m)\) and (2) \(R_z \bot \!\!\!\bot R_{X_i}|X_i \) if \(Y_i=R_z\) for any \(Z \in V_m\), \(Z \notin X_i\) and \(X_r \cap R_{X_m}=\emptyset \).

Fig. 3.
figure 3

MNAR Model in which P(YX) and P(XYZ) are recoverable

4.2 R-Factorization

Consider the model in Fig. 3(a) in which missingness in X is caused by Y and vice-versa. This type of missingness model is called entangled because in order to d-separate any variable from its missingness mechanism one needs to condition on the other. Factorizing P(xy) as P(x|y)P(y) or P(y|x)P(x) does not satisfy sequential factorization criterion since neither \(X \bot \!\!\!\bot (R_x,R_y)|Y\) nor \(Y \bot \!\!\!\bot (R_x,R_y)|X\) holds in the graph. This deadlock can however be disentangled by the following method:

$$\begin{aligned} P(X,Y)&= P(X,Y) \frac{P(R_x=0,R_y=0|X,Y)}{P(R_x=0,R_y=0|X,Y)} \\&= \frac{P(R_x=0,R_y=0) P(X,Y|R_x=0,R_y=0)}{P(R_x=0,R_y=0|X,Y)}\\&= \frac{P(R_x=0,R_y=0) P(X,Y|R_x=0,R_y=0)}{P(R_x=0|Y,R_y=0)P(R_y=0|X,R_x=0)}\\&\text{(using } R_x \bot \!\!\!\bot (R_y,X) |Y \text{ and } R_y \bot \!\!\!\bot (R_x,Y)|X\text{) }\\&=\frac{P(R_x=0,R_y=0) P(X^*,Y^*|R_x=0,R_y=0)}{P(R_x=0|Y^*,R_y=0)P(R_y=0|X^*,R_x=0)}\\ \end{aligned}$$

The following theorem generalizes this recovery procedure:

Theorem 2

(R-factorization). Given a m-graph G with no edges between the R variables and no latent variables as parents of R variables, a necessary and sufficient condition for recovering the joint distribution P(V) is that no variable X be a parent of its missingness mechanism \(R_{X}\). Moreover, when recoverable, P(V) is given by

$$\begin{aligned} P(v)=\frac{P(R=0,v)}{\prod _i P(R_i=0|pa^o_{r_i},pa^m_{r_i}, R_{Pa^m_{r_i}}=0)}, \end{aligned}$$
(4)

where \(Pa^o_{r_i}\subseteq V_o\) and \(Pa^m_{r_i}\subseteq V_m\) are the parents of \(R_i\).

Interestingly, given a model in which R variables are connected by an edge sometimes we have to use a combination of sequential and R factorization. Examine the model in Fig. 3(b). The query of interest is the joint distribution P(xyz) and the recovery procedure inspired by Theorem 2 follows:

$$\begin{aligned} P(x,y,z)&= \frac{P(x,y,z,r_x=0,r_y=0,r_z=0)}{P(r_x=0|y)P(r_z=0|x,r_x=0)P(r_y=0|z,r_x=0,r_z=0)} \end{aligned}$$

In order to recover \(P(r_x=0|y)\) we rely on sequential factorization as shown below:

$$\begin{aligned} P(y,r_x)&= \sum _{x,z} P(x,y,z,r_x)\\&=\sum _{x,z}\frac{P(x,y,,z,r_x,r_z=0,r_y=0)}{P(r_z=0|x,r_x=0)P(r_y=0|z,r_x,r_z=0)}\\&=\sum _{x,z}\frac{P(x|y,z,r_x=0,r_z=0,r_y=0)P(y,z,r_x,r_z=0,r_y=0)}{P(r_z=0|x,r_x=0)P(r_y=0|z,r_x,r_z=0)}\\&\text{(using } X \bot \!\!\!\bot R_x|(Y,Z,R_y,R_z) \text{ i.e. } \text{ sequential } \text{ factorization) } \end{aligned}$$

Recoverability of \(P(y,r_x)\) implies that \(P(r_x=0|y)\) is recoverable. Hence joint distribution P(xyz) is recoverable given Fig. 3(b).

Fig. 4.
figure 4

(a) MNAR model in which joint distribution is recoverable, (b) mutilated model corresponding to (a) obtained by intervening on Z

4.3 Interventional Factorization

Consider the model in Fig. 4. Let the query of interest be P(wxyz). We will first factorize P(wxyz) in a manner similar to R factorization:

$$\begin{aligned} P(w,x,y,z)&= \frac{P(w,x,y,z,r_x=0,r_y=0)}{P(r_x=0|r_y=0,y,z)P(r_y=0|x,z)} \end{aligned}$$

The recovery of the joint distribution depends on the recovery of \(P(r_y=0|x,z)\). We notice that

$$\begin{aligned} P(R_y|do(Z=z),X)&= P(R_y |Z=z, X) \text{(using } \text{ rule-2 } \text{ of } \text{ do-calculus) } \end{aligned}$$

The interventional distribution can be computed as given below:

$$\begin{aligned} P(x^*,y^*,w,r_x,r_y|do(z))&= \frac{P(x^*,y^*,w,r_x,r_y,z}{P(z|w)} \nonumber \\ P(r_y,x^*,r_x |do(z))&= \sum _{w,y^*}\frac{P(x^*,y^*,w,r_x,r_y|do(z))}{P(z|w)} \end{aligned}$$
(5)

In order to recover \(P(r_y=0|x,z)\), we will recover \(P(x,r_y|do(z))\) and express it in terms of proxy variables.

$$\begin{aligned} P(x,r_y|do(z))&= P(x|r_y,r_x=0, do(z)) P(r_y|do(z))\nonumber \\&= P(x^*|r_y,r_x=0, do(z)) P(r_y|do(z)) \end{aligned}$$
(6)

Each factor in Eq. 6 can be computed from the intervential distribution derived in Eq. 5.

A general algorithm incorporating all these three recovery procedures in a slightly more relaxed setting is discussed in [26].

Fig. 5.
figure 5

MNAR models in which the joint distribution is not recoverable. Variables denoted by L serve as candidates for auxiliary variables.

5 Recourses to Non-recoverability

Joint distribution is not recoverable given the m-graphs in Fig. 5 [18]. In this section we will show how auxiliary variables and external data can be utilized to aid recoverability.

Auxiliary variables are variables that are anciliary to the substantive research questions but are potential correlates of missingness mechanisms or partially observed variables [6]. However as noted in [29], not all variables satisfying this criterion may be used as auxiliary variables.

Selection Criteria For Auxiliary Variables: Firstly an auxiliary variable should not be a collider or a descendant of a collider on the path from a partially observed variable to its missingness mechanism. For example in Fig. 5(b) neither Y nor its descendants may serve as auxiliary variables while recovering P(X). Secondly, in the presence of an inducing path between X and \(R_x\) as shown in Fig. 5(c), the ideal auxiliary variables are latent variables \(L_1\) or \(L_2\). Conditioning on either of these will d-separate X from \(R_x\) and facilitate the recovery of P(X).

Recovery Aided By External Data: It is often the case that incorporating data from external sources can aid recovery. For example, consider a manifest distribution in which age is a partially observed variable. Distribution of age for a given population may be easily available from an external agency such as the census bureau. The question we ask is how can this data be combined with the existing missing data set to recover a query of interest.

Consider the Fig. 5(a), suppose the query of interest is P(XY). P(Y|X) is recoverable by sequential factorization. If from an external source we obtain P(X), then P(yx) may be recovered as \(P(y|x^*,r_x=0) P(x)\). In Fig. 5(b) however, P(Y) and P(X) are recoverable. If we can obtain either P(y|x) or P(x|y) from an external source, then P(xy) can be recovered.

6 Perils of Model Blind Recovery Procedures

Model-blind algorithms are algorithms that attempt to handle missing-data problems on the basis of the data alone, without making any assumptions about the structure of the missingness process. We unveil a fundamental limitation of model-blind algorithms by presenting two statistically indistinguishable models such that a given query is recoverable in one and non-recoverable in the other.

Fig. 6.
figure 6

Statistically indistinguishable graphs. (a) P (XY) is recoverable (b) P (XY) is not recoverable (c) P(X) is recoverable

The two graphs in Fig. 6(a) and (b) cannot be distinguished by any statistical means, since Fig. 6(a) has no testable implications [16] and Fig. 6(b) is a complete graph. However in Fig. 6(a) \(P (X,Y)=P(X^*|Y,R_x)P(Y)\) is recoverable while in Fig. 6(b) P (XY) is not recoverable (by Theorem 2 in [17]).

An even stronger limitation is demonstrated below. We show that no model-blind algorithm exists even in those cases where recoverability is feasible. We exemplify our claim below by constructing two statistically indistinguishable models, \(G_1\) and \(G_2\), dictating different estimation procedure \(E_1\) and \(E_2\) respectively; yet Q is not recoverable in \(G_1\) by \(S_2\) or in \(G_2\) by \(S_1\).

Consider the graphs in Fig. 6(a) and (c); they are statistically indistinguishable since neither has testable implications. Let the target relation of interest be \(Q=P (X) \). In Fig. 6(a), Q may be estimated as \(P (X) = \sum _y P (X|Y,R_x=0) P (Y) \) since \(X \bot \!\!\!\bot R_x|Y\) and in Fig. 6(b), Q can be derived as \(P (X) = P (X|R_x=0) \) since \(X \bot \!\!\!\bot R_x\).

7 Related Work

Deletion based methods such as listwise deletion that are easy to understand as well as implement, guarantee consistent estimates only for certain categories of missingness such as MCAR [24]. Maximum Likelihood method is known to yield consistent estimates under MAR assumption; expectation maximization algorithm and gradient based algorithms are widely used for searching for ML estimates under incomplete data [4, 5, 10, 11]. Most work in machine learning assumes MAR and proceeds with ML or Bayesian inference. However, there are exceptions such as recent work on collaborative filtering and recommender systems which develop probabilistic models that explicitly incorporate missing data mechanism [1315].

Other methods for handling missing data can be classified into two: (a) Inverse Probability Weighted Methods and (b) Imputation based methods [23]. Inverse Probability Weighing methods analyze and assign weights to complete records based on estimated probabilities of completeness [22, 32]. Imputation based methods substitute a reasonable guess in the place of a missing value [1] and Multiple Imputation [12] is an imputation method that is less sensitive to a bad starting point.

Missing data is a special case of coarsened data and data are said to be coarsened at random (CAR) if the coarsening mechanism is only a function of the observed data [9]. [21] introduced a methodology for parameter estimation from data structures for which full data has a non-zero probability of being fully observed and their methodology was later extended to deal with censored data in which complete data on subjects are never observed [31].

The use of graphical models for handling missing data is a relatively new development. [3] used graphical models for analyzing missing information in the form of missing cases (due to sample selection bias). Attrition is a common occurrence in longitudinal studies and arises when subjects drop out of the study [7, 25, 30] analysed the problem of attrition using causal graphs. [27, 28] cautioned the practitioner that contrary to popular belief (as stated in [2, 6]), not all auxiliary variables reduce bias. Both [7, 28] associate missingness with a single variable and interactions among several missingness mechanisms are unexplored.

[17] employed a formal representation called Missingness Graphs to depict the missingness process, defined the notion of recoverability and derived conditions under which queries would be recoverable when datasets are categorized as Missing Not At Random (MNAR). Tests to detect misspecifications in the m-graph are discussed in [16].

8 Conclusions

This chapter presents the missing data problem from a causal perspective and provided procedures for estimating queries of interest for datasets falling into the MNAR (Missing Not At Random) Category. We demonstrated how auxiliary variables and data from external sources can be used to circumvent theoretical impediments to recoverability. Finally we showed that model-blind recovery techniques such as Multiple Imputation are prone to error and are insufficient to guarantee consistent estimates.