Abstract
This paper applies graph based causal inference procedures for recovering information from missing data. We establish conditions that permit and prohibit recoverability. In the event of theoretical impediments to recoverability, we develop graph based procedures using auxiliary variables and external data to overcome such impediments. We demonstrate the perils of model-blind recovery procedures both in determining whether or not a query is recoverable and in choosing an estimation procedure when recoverability holds.
1 Introduction
The missing data (or incomplete data) problem, characterized by the absence of values for one or more variables in a dataset is a major impediment to both theoretical and empirical research and leaves no branch of experimental science untouched. The vast amount of literature on missing data problems in such diverse fields as computer science, geology, archeology, biology, statistics and epidemiology attests to both its extent and pervasiveness [8, 12, 15, 32]. Simply ignoring the problem by deleting all tuples with missing values will, in most cases, significantly distort the outcome of a study, regardless of the size of the dataset [1, 6].
Existing methods of dealing with missing data such as Expectation Maximization Algorithm and Multiple Imputation are based on the theoretical work of Rubin [27] and Little and Rubin [28] who formulated conditions under which the damage of missingness would be minimized. However, theoretical guarantees are provided only for a subset of problems falling into the Missing At Random (MAR) category thereby leaving the vast space of MNAR problems relatively unexplored.
In this paper we view missingness from a causal perspective and take the following steps to answer questions pertaining to consistent estimation of queries of interest. Given an incomplete dataset our first step is to postulate a model based on causal assumptions of the underlying data generation process. Our second step is to determine whether the data rejects the postulated model by identifiable testable implications of that model. Our third and final step, which is also the primary focus of this paper, is to determine from the postulated model if any method exists that produces consistent estimates of the queries of interest? A negative answer confirms the presence of a theoretical impediment to estimation. In other words, a bias is inevitable.
2 Missingness Graphs
Missingness graphs as discussed below was first defined in [17] and we adopt the same notations. Let \(G (\mathbb {V},E) \) be the causal DAG where \(\mathbb {V}=V\cup U \cup V^* \cup \mathbb {R}\). V is the set of observable nodes. Nodes in the graph correspond to variables in the data set. U is the set of unobserved nodes (also called latent variables). E is the set of edges in the DAG. We use bi-directed edges as a shorthand notation to denote the existence of a U variable as common parent of two variables in \(V \cup \mathbb {R}\). V is partitioned into \({V}_{o}\) and \({V_m}\) such that \(V_o \subseteq V\) is the set of variables that are observed in all records in the population and \(V_m \subseteq V\) is the set of variables that are missing in at least one record. Variable X is termed as fully observed if \(X \in V_o\), partially observed if \(X \in V_m\) and substantive if \(X \in V_o \cup V_m\). Associated with every partially observed variable \(V_i \in V_m\) are two other variables \(R_{v_i}\) and \(V_i^*\), where \(V_i^*\) is a proxy variable that is actually observed, and \(R_{v_i}\) represents the status of the causal mechanism responsible for the missingness of \(V_i^*\); formally,
\(V^*\) is the set of all proxy variables and \(\mathbb {R}\) is the set of all causal mechanisms that are responsible for missingness. R variables may not be parents of variables in \(V \cup U\). We call this graphical representation Missingness Graph (or m-graph). An example of an m-graph is given in Fig. 1. We use the following shorthand. For any variable X, let \(X'\) be a shorthand for \(X=0\). For any set \(W\subseteq V_m\cup V_o\cup R\), let \(W_r\), \(W_o\) and \(W_m\) be the shorthand for \(W \cap R\), \(W \cap V_o\) and \(W \cap V_m\) respectively. Let \(R_w\) be a shorthand for \(R_{V_m \cap W}\) i.e. \(R_w\) is the set containing missingness mechanisms of all partially observed variables in W. Note that \(R_w\) and \(W_r\) are not the same. \(G_{\underline{X}}\) and \(G_{\overline{X}}\) represent graphs formed by removing from G all edges leaving and entering X, respectively.
A manifest distribution \(P(V_o,V^*,R)\) is the distribution that governs the available dataset. An underlying distribution \(P(V_o,V_m,R)\) is said to be compatible with a given manifest distribution \(P(V_o,V^*,R)\) if the latter can be obtained from the former using Eq. 1. Manifest distribution \(P_m\) is compatible with a given underlying distribution \(P_u\) if \(\forall X\), \(X \subseteq V_m\) and \(Y= V_m {\setminus } X\), the following equality holds true.
where \(R'_x\) denotes \(R_x=0\) and \(R_y\) denotes \(R_y=1\).
3 Recoverability
Given a manifest distribution \(P(V^*,V_o, R)\) and an m-graph G that depicts the missingness process, query Q is recoverable if we can compute a consistent estimate of Q as if no data were missing. Formally,
Definition 1
(Recoverability). Given a m-graph G, and a target relation Q defined on the variables in V, Q is said to be recoverable in G if there exists an algorithm that produces a consistent estimate of Q for every dataset D such that P(D) is (1) compatible with G and (2) strictly positive i.e. \(P(V_o,V^*,\mathbb {R})>0\).
For an introduction to the notion of recoverability see, [17, 20].
3.1 Recovering from MCAR and MAR Data
Examine the m-graph in Fig. 1, X is the treatment and Y is the outcome. Let us assume that some patients who underwent treatment are not likely to report the outcome, and hence the arrow \(X \rightarrow R_y\). Under these circumstances, can we recover P (X, Y)?
From the manifest distribution, we can compute \(P (X,Y^*,R_y) \). From the m-graph G, we see that \(Y^*\) is a collider and X is a fork. Hence by d-separation, \(Y\bot \!\!\!\bot R_y|X\). Thus
Since both factors in the estimand are estimable from the manifest distribution, P (X, Y) is recoverable.
The scenario discussed above is a typical instance of Missing At Random (MAR). When data are Missing At Random (MAR), we have \(\mathbb {R}\bot \!\!\!\bot V_m | V_o\). Therefore \(P(V)=P(V_m|V_o)P(V_o)=P(V_m|V_o, R=0)P(V_o)\). In other words, the joint distribution P(V) is recoverable given MAR data. Estimation methods applicable to MAR are applicable to MCAR as well because by the weak union axiom of graphoids, Missing Completely at Random (MCAR: \((V_m, V_o) \bot \!\!\!\bot R\)) implies Missing At Random (MAR: \(V_m \bot \!\!\!\bot R|V_o\)). Therefore, it implicitly follows that queries (such as joint distribution and (identifiable) causal effects) that are recoverable given MAR datasets are recoverable given MCAR datasets as well.
4 Recoverability Procedures for MNAR Data
Data that are neither MAR nor MCAR fall into the Missing Not At Random (MNAR) category. In this section we will detail with examples three distinct recovery procedures.
4.1 Sequential Factorization
Consider an observational study that measured the variables X, Y, W and Z where we wish to estimate the effect of treatment (X) on outcome (Y). The interactions between the variables and the underlying missingness process are depicted in Fig. 2. We notice that all variables are corrupted by missing values. The least bothersome missingness is that of Y which is caused by a random process such as an accidental deletion of cases while the most troubling missingness is that of W which is caused by its own underlying value- a typical example is the case of very rich and very poor people being reluctant to reveal their income.
Recovering Causal Effect of X on Y: By backdoor criterion [19], we have two admissible sets, \(\{Z\}\) and \(\{W\}\) which yield the following estimands, respectively:
We choose the first estimand over the second because the latter contains P(W) which we know to be non-recoverable [17].Footnote 1 Therefore, to recover the causal effect we have to recover both P(y|xz) and P(z).
Recovering P(z) In order to d-separate Z from \(R_z\), one needs to condition of X and to d-separate X from \(R_x\) one needs to condition on Y. Therefore, we can write:
In the process of recovering P(z) we have in fact recovered P(x, y, z). Therefore it follows that P(y|x, z) is recoverable. Finally, the causal effect may be recovered as:
Recovery Procedure: Given an m-graph with no edges between R variables, a sufficient condition for recoverability of query Q is that it be decomposable into sub-queries of the form P(Y|X) such that \(Y \bot \!\!\!\bot (R_x,R_y)|X\). This recovery procedure called as seuential factorization (generalized in Theorem 1 below) is sensitive to the ordering of variables in the factorization, which in turn is dictated by the graph. For instance, in Eq. 2 had we factorized P(x, y, z) as P(y|x, z) P(x|z) P(z), we would not have had the permission to insert the R terms in any factor.
Recovering in the presence of edges between R variables: A quick inspection reveals that the factorization in Eq. 2 guarantees recoverability even when an edge \(R_x \rightarrow R_z\) is added. However, addition of the (reversed) edge \(R_z \rightarrow R_x\) would require conditioning on \(R_z\) and Y to d-separate X from \(R_x\). The procedure for recovering the marginal distribution P(Z) is presented below:
The following definition and theorem in [18] formalizes the preceding recovery procedure.
Definition 2
(General Ordered factorization). Given a graph G and a set O of ordered \(V \cup R\) variables \(Y_1<Y_2 < \ldots < Y_k\), a general ordered factorization relative to G, denoted by f(O), is a product of conditional probabilities \(f(O)= \prod _i P(Y_i|X_i)\) where \(X_i \subseteq \{Y_{i+1}, \ldots , Y_n\}\) is a minimal set such that \(Y_i\bot \!\!\!\bot (\{Y_{i+1}, \ldots , Y_n\}{\setminus } X_i)|X_i\) holds in G.
Theorem 1
(Sequential Factorization). A sufficient condition for recoverability of a relation Q defined over substantive variables is that Q be decomposable into a general ordered factorization, or a sum of such factorizations, such that every factor \(Q_i=P(Y_i | X_i)\) satisfies, (1) \(Y_i \bot \!\!\!\bot (R_{y_i}, R_{x_i}) | X_i{\setminus } \{R_{y_i}, R_{x_i}\}\), if \(Y_i \in (V_o \cup V_m)\) and (2) \(R_z \bot \!\!\!\bot R_{X_i}|X_i \) if \(Y_i=R_z\) for any \(Z \in V_m\), \(Z \notin X_i\) and \(X_r \cap R_{X_m}=\emptyset \).
4.2 R-Factorization
Consider the model in Fig. 3(a) in which missingness in X is caused by Y and vice-versa. This type of missingness model is called entangled because in order to d-separate any variable from its missingness mechanism one needs to condition on the other. Factorizing P(x, y) as P(x|y)P(y) or P(y|x)P(x) does not satisfy sequential factorization criterion since neither \(X \bot \!\!\!\bot (R_x,R_y)|Y\) nor \(Y \bot \!\!\!\bot (R_x,R_y)|X\) holds in the graph. This deadlock can however be disentangled by the following method:
The following theorem generalizes this recovery procedure:
Theorem 2
(R-factorization). Given a m-graph G with no edges between the R variables and no latent variables as parents of R variables, a necessary and sufficient condition for recovering the joint distribution P(V) is that no variable X be a parent of its missingness mechanism \(R_{X}\). Moreover, when recoverable, P(V) is given by
where \(Pa^o_{r_i}\subseteq V_o\) and \(Pa^m_{r_i}\subseteq V_m\) are the parents of \(R_i\).
Interestingly, given a model in which R variables are connected by an edge sometimes we have to use a combination of sequential and R factorization. Examine the model in Fig. 3(b). The query of interest is the joint distribution P(x, y, z) and the recovery procedure inspired by Theorem 2 follows:
In order to recover \(P(r_x=0|y)\) we rely on sequential factorization as shown below:
Recoverability of \(P(y,r_x)\) implies that \(P(r_x=0|y)\) is recoverable. Hence joint distribution P(x, y, z) is recoverable given Fig. 3(b).
4.3 Interventional Factorization
Consider the model in Fig. 4. Let the query of interest be P(w, x, y, z). We will first factorize P(w, x, y, z) in a manner similar to R factorization:
The recovery of the joint distribution depends on the recovery of \(P(r_y=0|x,z)\). We notice that
The interventional distribution can be computed as given below:
In order to recover \(P(r_y=0|x,z)\), we will recover \(P(x,r_y|do(z))\) and express it in terms of proxy variables.
Each factor in Eq. 6 can be computed from the intervential distribution derived in Eq. 5.
A general algorithm incorporating all these three recovery procedures in a slightly more relaxed setting is discussed in [26].
5 Recourses to Non-recoverability
Joint distribution is not recoverable given the m-graphs in Fig. 5 [18]. In this section we will show how auxiliary variables and external data can be utilized to aid recoverability.
Auxiliary variables are variables that are anciliary to the substantive research questions but are potential correlates of missingness mechanisms or partially observed variables [6]. However as noted in [29], not all variables satisfying this criterion may be used as auxiliary variables.
Selection Criteria For Auxiliary Variables: Firstly an auxiliary variable should not be a collider or a descendant of a collider on the path from a partially observed variable to its missingness mechanism. For example in Fig. 5(b) neither Y nor its descendants may serve as auxiliary variables while recovering P(X). Secondly, in the presence of an inducing path between X and \(R_x\) as shown in Fig. 5(c), the ideal auxiliary variables are latent variables \(L_1\) or \(L_2\). Conditioning on either of these will d-separate X from \(R_x\) and facilitate the recovery of P(X).
Recovery Aided By External Data: It is often the case that incorporating data from external sources can aid recovery. For example, consider a manifest distribution in which age is a partially observed variable. Distribution of age for a given population may be easily available from an external agency such as the census bureau. The question we ask is how can this data be combined with the existing missing data set to recover a query of interest.
Consider the Fig. 5(a), suppose the query of interest is P(X, Y). P(Y|X) is recoverable by sequential factorization. If from an external source we obtain P(X), then P(y, x) may be recovered as \(P(y|x^*,r_x=0) P(x)\). In Fig. 5(b) however, P(Y) and P(X) are recoverable. If we can obtain either P(y|x) or P(x|y) from an external source, then P(x, y) can be recovered.
6 Perils of Model Blind Recovery Procedures
Model-blind algorithms are algorithms that attempt to handle missing-data problems on the basis of the data alone, without making any assumptions about the structure of the missingness process. We unveil a fundamental limitation of model-blind algorithms by presenting two statistically indistinguishable models such that a given query is recoverable in one and non-recoverable in the other.
The two graphs in Fig. 6(a) and (b) cannot be distinguished by any statistical means, since Fig. 6(a) has no testable implications [16] and Fig. 6(b) is a complete graph. However in Fig. 6(a) \(P (X,Y)=P(X^*|Y,R_x)P(Y)\) is recoverable while in Fig. 6(b) P (X, Y) is not recoverable (by Theorem 2 in [17]).
An even stronger limitation is demonstrated below. We show that no model-blind algorithm exists even in those cases where recoverability is feasible. We exemplify our claim below by constructing two statistically indistinguishable models, \(G_1\) and \(G_2\), dictating different estimation procedure \(E_1\) and \(E_2\) respectively; yet Q is not recoverable in \(G_1\) by \(S_2\) or in \(G_2\) by \(S_1\).
Consider the graphs in Fig. 6(a) and (c); they are statistically indistinguishable since neither has testable implications. Let the target relation of interest be \(Q=P (X) \). In Fig. 6(a), Q may be estimated as \(P (X) = \sum _y P (X|Y,R_x=0) P (Y) \) since \(X \bot \!\!\!\bot R_x|Y\) and in Fig. 6(b), Q can be derived as \(P (X) = P (X|R_x=0) \) since \(X \bot \!\!\!\bot R_x\).
7 Related Work
Deletion based methods such as listwise deletion that are easy to understand as well as implement, guarantee consistent estimates only for certain categories of missingness such as MCAR [24]. Maximum Likelihood method is known to yield consistent estimates under MAR assumption; expectation maximization algorithm and gradient based algorithms are widely used for searching for ML estimates under incomplete data [4, 5, 10, 11]. Most work in machine learning assumes MAR and proceeds with ML or Bayesian inference. However, there are exceptions such as recent work on collaborative filtering and recommender systems which develop probabilistic models that explicitly incorporate missing data mechanism [13–15].
Other methods for handling missing data can be classified into two: (a) Inverse Probability Weighted Methods and (b) Imputation based methods [23]. Inverse Probability Weighing methods analyze and assign weights to complete records based on estimated probabilities of completeness [22, 32]. Imputation based methods substitute a reasonable guess in the place of a missing value [1] and Multiple Imputation [12] is an imputation method that is less sensitive to a bad starting point.
Missing data is a special case of coarsened data and data are said to be coarsened at random (CAR) if the coarsening mechanism is only a function of the observed data [9]. [21] introduced a methodology for parameter estimation from data structures for which full data has a non-zero probability of being fully observed and their methodology was later extended to deal with censored data in which complete data on subjects are never observed [31].
The use of graphical models for handling missing data is a relatively new development. [3] used graphical models for analyzing missing information in the form of missing cases (due to sample selection bias). Attrition is a common occurrence in longitudinal studies and arises when subjects drop out of the study [7, 25, 30] analysed the problem of attrition using causal graphs. [27, 28] cautioned the practitioner that contrary to popular belief (as stated in [2, 6]), not all auxiliary variables reduce bias. Both [7, 28] associate missingness with a single variable and interactions among several missingness mechanisms are unexplored.
[17] employed a formal representation called Missingness Graphs to depict the missingness process, defined the notion of recoverability and derived conditions under which queries would be recoverable when datasets are categorized as Missing Not At Random (MNAR). Tests to detect misspecifications in the m-graph are discussed in [16].
8 Conclusions
This chapter presents the missing data problem from a causal perspective and provided procedures for estimating queries of interest for datasets falling into the MNAR (Missing Not At Random) Category. We demonstrated how auxiliary variables and data from external sources can be used to circumvent theoretical impediments to recoverability. Finally we showed that model-blind recovery techniques such as Multiple Imputation are prone to error and are insufficient to guarantee consistent estimates.
Notes
- 1.
The presence of a non-recoverable factor in a summand does not always imply the non-recoverability of the summand. See Example-3 in [18].
References
Allison, P.D.: Missing Data Series: Quantitative Applications in the Social Sciences (2002)
Collins, L.M., Schafer, J.L., Kam, C.-M.: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol. Methods 6(4), 330 (2001)
Daniel, R.M., Kenward, M.G., Cousens, S.N., De Stavola, B.L.: Using causal diagrams to guide analysis in missing data problems. Stat. Methods Med. Res. 21(3), 243–256 (2012)
Darwiche, A.: Modeling and Reasoning with Bayesian Networks. Cambridge University Press, New York (2009)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc. B. (Methodol.) 39(1), 1–38 (1977)
Enders, C.K.: Applied Missing Data Analysis. Guilford Publications, New York (2010)
Garcia, F.M.: Definition and diagnosis of problematic attrition in randomized controlled experiments. Working paper, April 2013. http://ssrn.com/abstract=2267120
Graham, J.W.: Missing Data: Analysis and Design. Statistics for Social and Behavioral Sciences. Springer, New York (2012)
Heitjan, D.F., Rubin, D.B.: Ignorability and coarse data. Ann. Stat. 19(4), 2244–2253 (1991)
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. Cambridge University Press, New York (2009)
Lauritzen, S.L.: The EM algorithm for graphical association models with missing data. Comput. Stat. Data Anal. 19(2), 191–201 (1995)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (2002)
Marlin, B.M., Zemel, R.S.: Collaborative prediction and ranking with non-random missing data. In: Proceedings of the Third ACM Conference on Recommender Systems, pp. 5–12. ACM (2009)
Marlin, B.M., Zemel, R.S., Roweis, S., Slaney, M.: Collaborative filtering and the missing at random assumption. In: UAI (2007)
Marlin, B.M., Zemel, R.S., Roweis, S.T., Slaney, M.: Recommender systems: missing data and statistical model estimation. In: IJCAI (2011)
Mohan, K., Pearl, J.: On the testability of models with missing data. In: Proceedings of AISTAT (2014)
Mohan, K., Pearl, J., Tian, J.: Graphical models for inference with missing data. Adv. Neural Inf. Process. Syst. 26, 1277–1285 (2013)
Mohan, K., Pearl J.: Graphical models for recovering probabilistic and causal queries from missing data. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 1520–1528 (2014)
Pearl, J.: Causality: Models, Reasoning and Inference. Cambridge University Press, New York (2009)
Pearl, J., Mohan, K.: Recoverability and testability of missing data: Introduction and summary of results. Technical report R-417, UCLA (2013). http://ftp.cs.ucla.edu/pub/stat_ser/r417.pdf
Robins, J.M., Rotnitzky, A.: Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell, N.P., Dietz, K., Farewell, V.T. (eds.) AIDS Epidemiology, pp. 297–331. Springer, New York (1992)
Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89(427), 846–866 (1994)
Rothman, K.J., Greenland, S., Lash, T.L.: Modern Epidemiology. Lippincott Williams & Wilkins, Philadelphia (2008)
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)
Shadish, W.R.: Revisiting field experimentation: field notes for the future. Psychol. Methods 7(1), 3 (2002)
Shpitser, I., Mohan, K., Pearl, J.: Missing data as a causal and probabilistic problem. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (2015)
Thoemmes, F., Mohan, K.: Graphical representation of missing data problems. Struct. Equ. Model. Multi. J. 37(1), 1–13 (2015)
Thoemmes, F., Rose, N.: Selection of auxiliary variables in missing data problems: Not all auxiliary variables are created equal. Technical report R-002, Cornell University (2013)
Thoemmes, F., Mohan, K.: Graphical representation of missing data problems. Struct. Equ. Model. Multi. J. 22(4), 1–13 (2015)
Twisk, J., de Vente, W.: Attrition in longitudinal studies: how to deal with missing data. J. clin. epidemiol. 55(4), 329–337 (2002)
Van Der Laan, M.J., Robins, J.M.: Locally efficient estimation with current status data and time-dependent covariates. J. Am. Stat. Assoc. 93(442), 693–701 (1998)
Van der Laan, M.J., Robins, J.M.: Unified Methods for Censored Longitudinal Data and Causality. Springer, New York (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Mohan, K., Pearl, J. (2015). Missing Data from a Causal Perspective. In: Suzuki, J., Ueno, M. (eds) Advanced Methodologies for Bayesian Networks. AMBN 2015. Lecture Notes in Computer Science(), vol 9505. Springer, Cham. https://doi.org/10.1007/978-3-319-28379-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-28379-1_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28378-4
Online ISBN: 978-3-319-28379-1
eBook Packages: Computer ScienceComputer Science (R0)