Introduction to the foundations of causal discovery
 1.4k Downloads
 6 Citations
Abstract
This article presents an overview of several known approaches to causal discovery. It is organized by relating the different fundamental assumptions that the methods depend on. The goal is to indicate that for a large variety of different settings the assumptions necessary and sufficient for causal discovery are now well understood.
Keywords
Causality Graphical models Causal discovery1 Introduction
Like many scientific concepts, causal relations are not features that can be directly read off from the data but have to be inferred. The field of causal discovery is concerned with this inference and the assumptions that support it. We might have measures of different quantities obtained from, say, a crosssectional study, on the amount of wine consumption (for some unit of time) and the prevalence of cardiovascular disease, and be interested in whether wine consumption is a cause of cardiovascular disease (positivey or negatively), and not just whether it is correlated with it. That is, we would like to know whether the observed dependence between wine consumption and cardiovascular disease (suppose there is one) persists even if we change, say, in an experiment, the amount of wine that is consumed (see Fig. 1). The observed dependence between wine consumption and cardiovascular disease may, after all, be due to a common cause, such as socioeconomicstatus (SES), where those people with a higher SES consume more wine and are able to afford better health care, whereas those with a lower SES do not consume as much wine and have poorer healthcare^{1}. The example illustrates the common mantra that “correlation does not imply causation” and suggests that causal relations can be identified in an experimental setting, such as a randomized controlled trial where each individual in the experiment is randomly assigned to either the treatment or control group (in this case, to different levels of wine consumption) and the effect on cardiovascular disease is measured. The randomized assignment makes the wine consumption independent of its normal causes (at least in the large sample limit) and thereby destroys the “confounding” effect of SES. Naturally, there are many concerns about such an analysis, starting from the ethical concerns of such a study, the compliance with treatment, the precise treatment levels, the representativeness of the experimental population with respect to the larger population, etc., but the general methodological reason, explicitly emphasized in Fisher’s [6] wellknown work on experimental design, of why randomized controlled trials are useful for causal discovery becomes evident: randomization breaks confounding, whether due to an observed or unobserved common cause.
Causal relations are of interest because only an understanding of the underlying causal relations can support predictions about how a system will behave when it is subject to intervention. If moderate wine consumption, in fact, causes the reduction in the risk of cardiovascular disease (this article takes no stand on the truth of this claim), then a health policy that suggests moderate wine consumption can be expected to be effective in reducing cardiovascular disease (with due note to all the other concerns about implementation). But if the observed dependence is only due to some common cause, such as SES, then a policy that changes wine consumption independently of SES would have no effect on cardiovascular disease.
A purely probabilistic representation of these relations is ambiguous with respect to the underlying causal relations: That is, if we let wine consumption be X and cardiovascular disease be Y, then, without further specification, P(YX), the conditional probability of cardiovascular disease given a particular level of wine consumption, is ambiguous with regard to whether it describes the relation in an experimental setting in which the wine consumption was determined by randomization or whether it describes observational relations, such as in the initial example of a crosssectional study. Pearl [31] introduced the do(.)operator as a notation to distinguish the two cases. Thus, P(YX) is the observational conditional probability describing how the probability of Y would change if one observed X (e.g., in a crosssectional study) while P(Ydo(X)) is the interventional conditional probability, describing the probability of Y when X has been set experimentally. Of course, not all data can be classified cleanly as observational vs. interventional, since there might well be experiments that do not fully determine the value of the intervened variable. But for the sake of this article, the distinction will suffice (see [28] and [5] for further discussion).
In light of the general underdetermination of causal relations given any probability distribution, it is useful to represent the causal structure explicitly in terms of a directed graph. Unlike other graphical models with directed or undirected edges, which merely represent an independence structure, causal graphical models support a very a strong interpretation: For a given set of variables \(\mathbf{{V}}= \{X_1,\ldots , X_n\}\), a causal graph \(G = \{\mathbf{{V}}, \mathbf{E}\}\) represents the causal relations over the set of variables \(\mathbf{{V}}\), in the sense that for any directed edge \(e= X_i \rightarrow X_j\) in \(\mathbf{E}\), \(X_i\) is a direct cause of \(X_j\) relative to variables in \(\mathbf{{V}}\). So the claim of an edge in G is that even if you randomize all other variables in \(\mathbf{{V}}\setminus \{X_i,X_j\}\), thereby breaking any causal connection between \(X_i\) and \(X_j\) through these other variables, \(X_i\) still has a causal effect on \(X_j\). Moreover, the causal graph characterizes the effect of an intervention on \(X_i\) on the remaining variables precisely in terms of the subgraph that results when all directed edges into \(X_i\) are removed from G. Thus, a causal graph not only makes claims about the causal pathways active in an observational setting, but also indicates which causal pathways are active in any experiment on a set of variables in \(\mathbf{{V}}\). Naturally, a direct cause between \(X_i\) and \(X_j\) may no longer be direct once additional variables are introduced—hence the relativity to the set \(\mathbf{{V}}\).
We use intuitive (and standard) terminology to refer to particular features of the graph: A path between two variables X and Y in G is defined as a nonrepeating sequence of edges (oriented in either direction) in G where any two adjacent edges in the sequence share a common endpoint and the first edge “starts” with X and the last “ends” with Y. A directed path is a path whose edges all point in the same direction. A descendent of a vertex Z is a vertex \(W \in \mathbf{{V}}\), such that there is a directed path \(Z \rightarrow \cdots \rightarrow W\) in the graph G. Correspondingly, Z is ancestor of X. The parents of a vertex X are the vertices in \(\mathbf{{V}}\) with a directed edge oriented into X, similarly for the children of a vertex.^{2} A collider on a path p is a vertex on p whose adjacent edges both point into the vertex, i.e., \(\rightarrow Z\leftarrow \). A noncollider on p is a vertex on p that is not a collider, i.e., it is a mediator (\(\rightarrow Z \rightarrow \)) or a common cause (\(\leftarrow Z \rightarrow \)). Note that a vertex can take on different roles with respect to different paths.
2 Basic assumptions of causal discovery
Given the representation of causal relations over a set of variables in terms of causal graphs, causal discovery can be characterized as the problem of identifying as much as possible about the causal relations of interest (ideally the whole graph G) given a dataset of measurements over the variables \(\mathbf{{V}}\). To separate the causal part from the statistical part of the inference it is—at least for an introduction—useful to think of causal discovery as the inference task from the joint distribution \(P(\mathbf{{V}})\) to the graph G, leaving the task of estimating \(P(\mathbf{{V}})\) from the finite data to the statistician.^{3} In principle, there is no a priori reason for the joint distribution \(P(\mathbf{{V}})\) to constrain the possible true generating causal structures at all. We noted earlier that correlation does not imply causation (and similarly, the converse is not true either, though that may not be as obvious initially). Yet, we do take both dependencies and independencies as indicators of causal relations (or the lack thereof). For example, it seemed perfectly reasonable above to claim that if a dependence between X and Y was detected in a randomized controlled trial where X was subject to intervention, then X is a cause of Y (again modulo the many other assumptions about successful experiment implementation). Similarly, in the observational case, the dependence between X and Y, if it was not a result of a direct cause, was explained by a common cause. Consequently, there seem to be principles we use—more or less explicitly—that connect probabilistic relations to causal relations.
Two such principles that have received wide application in the methods of causal discovery are the causal Markov and the causal faithfulness conditions. The highlevel idea is that the causal Markov and faithfulness conditions together imply a correspondence between the (conditional) independences in the probability distribution and the causal connectivity relations within the graph G. Causal connectivity in a graph is defined in terms of dseparation and dconnection [30]: A path p between X and Y dconnects X and Y given a conditioning set \(\mathbf{C}\subseteq \mathbf{{V}}\setminus \{X, Y\}\) if and only if (i) all colliders on p are in \(\mathbf{C}\) or have a descendent in \(\mathbf{C}\) and (ii) no noncolliders of p are in \(\mathbf{C}\). X and Y are dseparated if and only if there are no dconnecting paths between them. Dseparation is often denoted by the single turnstile ‘\(\bot \)’.
Before proceeding, it is worth making explicit what causal Markov and causal faithfulness claim, and under what circumstances they may be false. The causal Markov condition states that every vertex X in the graph G is probabilistically independent of its nondescendents given its parents, i.e., Open image in new window . The causal Markov assumption appears to be a very fundamental assumption of our understanding of causality, since it is quite difficult to come up with situations that we consider to be causal and yet violate causal Markov. There are many ways in which a system may appear to violate causal Markov. For example, if one only considers two variables X and Y, but in fact there is an unmeasured common cause L of X and Y, i.e., \(X \leftarrow L \rightarrow Y\), then Y is a nondescendent of X but X and Y will be dependent. Naturally, this situation is quickly remedied once L is included in the model and L is conditioned on (as a parent of X). Similar cases of “modelmisspecifications” can lead to apparent violations of the Markov conditions when we have mixtures of different populations, there is sample selection bias, misspecified variables or variables that have been excessively coarsegrained (see [13] for more discussion). But in all these cases an appropriate specification of the underlying causal model will provide a causal system that is consistent with the Markov condition. To my knowledge, only in the case of quantum mechanics do we have systems for which we have good reasons to think they are causal and yet there does not appear to be a representation that respects the Markov condition. It is not entirely clear what to make of such cases. As Clark Glymour puts it, “[The Aspect experiments (that test the EinsteinPodolskiRosen predictions)] create associations that have no causal explanation consistent with the Markov assumption, and the Markov assumption must be applied [...] to obtain that conclusion. You can say that there is no causal explanation of the phenomenon, or that there is a causal explanation but it doesn’t satisfy the Markov assumption. I have no trouble with either alternative.” [10].
The situation is quite different with regard to causal faithfulness. It states the converse of the Markov condition, i.e., if a variable X is independent of Y given a conditioning set \(\mathbf{C}\) in the probability distribution \(P(\mathbf{{V}})\), then X is dseparated from Y given \(\mathbf{C}\) in the graph G. Faithfulness can be thought of as a simplicity assumption and it is relatively easy to find violations of it—there only have to be causal connections that do not exhibit a dependence. For example, if two causal pathways cancel out each other’s effects exactly, then the causally connected variables will remain independent. A practical example is a backup generator: Normally the machine is powered by electricity from the grid, but when the grid fails, a backup generator kicks in to supply the energy, thereby making the operation of the machine independent of the grid, even though of course the grid normally causes the machine to work or when it fails it causes the generator to switch on, which causes the machine to work.^{4} While such failures of faithfulness require an exact cancelation of the causal pathways, with finite data two variables may often appear independent despite the fact that they are (weakly) causally connected (see [47]).
To keep the present introduction to causal discovery simple initially, we can add additional assumptions about the underlying causal structure. Two commonly used assumptions are that the causal structure is assumed to be acyclic, i.e., that there is no directed path from a vertex back to itself in G, and causal sufficiency, i.e., that there are no unmeasured common causes of any pair of variables in \(\mathbf{{V}}\). Both of these assumptions are obviously not true in many domains (e.g., biology, social sciences, etc.) and below we will see how methods have been developed that do not depend on them. For now, they help to keep the causal discovery task more tractable and easy to illustrate.^{5}
With these conditions in hand (Markov, faithfulness, acyclicity and causal sufficiency), we can now ask what one can learn about the underlying causal relations given the (estimated) joint distribution \(P(\mathbf{{V}})\) over a set of variables \(\mathbf{{V}}\). Can we learn anything about the causal relation at all without performing experiments or having information about the time order of variables?
In fact, substantial information can be learned about the underlying causal structure from an observational probability distribution \(P(\mathbf{{V}})\) given these assumptions alone. In 1990, Verma and Pearl [32] and Frydenberg [7] independently showed that any two acyclic causal structures (without unmeasured variables) that are Markov and faithful to the same distribution \(P(\mathbf{{V}})\) share the same adjacencies (the same undirected graphical skeleton) and the same unshielded colliders. An unshielded collider is a collider whose two parents are not adjacent in G. Thus, Markov and faithfulness imply an equivalence structure over directed acyclic graphs, where graphs that are in the same equivalence class have the same (conditional) independence structure, the same adjacencies and the same unshielded colliders. For three variables, the Markov equivalence classes are shown in Fig. 2. Note that the graph \(X\rightarrow Z\leftarrow Y\) is in its own equivalence class. That means that independence constraints alone are sufficient to uniquely determine the true causal structure G if it is of the form \(X\rightarrow Z\leftarrow Y\) (given the conditions stated). This is rather significant, since it implies that sometimes no time order information or experiment is necessary to uniquely determine the causal structure over a set of variables. More generally, knowing the Markov equivalence class of the true causal structure substantively reduces the underdetermination. In general, no closed form is known for how many equivalence classes there are or how many graphs there are per equivalence class, but large scale simulations have been run [9, 11]. It is worth noting that for any number of variables N, there will always be several singleton equivalence classes (e.g., the empty graph, or those containing only unshielded colliders), but that there will also always be at least one equivalence class that contains N! graphs, namely the class containing all the graphs for which each pair of variables is connected by an edge—the set of complete graphs.
Algorithms have been developed that use conditional independence tests to determine the Markov equivalence class of causal structures consistent with a given dataset. For example, the PCalgorithm [41] was developed on the basis of exactly the set of assumptions just discussed (Markov, faithfulness, acyclicity and causal sufficiency) and uses a sequence of carefully selected (conditional) independence tests to both identify as much as possible about the causal structure and to perform as few tests as possible. In a certain sense, the PCalgorithm is complete: it extracts all information about the underlying causal structure that is available in the statements of conditional (in)dependence. Or more formally, this bound can be characterized in terms of a limiting result due to Geiger and Pearl [8] and Meek [26]:
Theorem 1
(Markov completeness) For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class is complete.
That is, if the causal relations between the causes and effects in G can be characterized either by a linear Gaussian relation of the form \(x_i = \sum _{j \ne i} a_jx_j+ \epsilon _i\) with \(\epsilon _i\sim N(\mu _i, \sigma _i^2)\) or by conditional distributions \(P(X_i { \;  \; }pa(X_i))\) that are multinomial, then the PCalgorithm, which in the large sample limit identifies the Markov equivalence class of the true causal model, identifies as much as there is to identify about the underlying causal model.
 1.
One could weaken the assumptions, thereby (in general) increasing the underdetermination of what one will be able to discover about the underlying causal structure. For example, the FCIalgorithm [41] drops the assumption of causal sufficiency and allows for unmeasured common causes of the observed variables; the CCDalgorithm [36] drops the assumption of acyclicity and allows for feedback, and the SATbased causal discovery methods discussed below can drop both assumptions. Alternatively, Zhang and Spirtes [49] have worked on weakening the assumption of faithfulness, with corresponding algorithms presented in a paper in this issue. In all cases, the aim of these more general approaches is to develop causal discovery methods that identify as much as possible about the underlying causal relations.
 2.
The limits to causal discovery described in Theorem 1 apply to restricted cases—multinomials and linear Gaussian parameterizations. One can exclude these cases and ask what happens when the distributions are not linear Gaussian or not multinomial. We consider several such approaches below.
 3.
One could consider more general data collection setups to help reduce the underdetermination. For example, one could consider the inclusion of specific experimental data to reduce the underdetermination or use additional “overlapping” datasets that share some but perhaps not all the observed variables (see [44] for an overview).
3 Linear nonGaussian models
Theorem 2
(Linear NonGaussian) Under the assumption of causal Markov, acyclicity and a linear nonGaussian parameterization (Eq. 2), the causal structure can be uniquely determined.
Not even faithfulness is required here. Thus, merely the assumption that the causal relations are linear and that the added noise is anything but Gaussian guarantees in the large sample limit that the true causal model can be uniquely identified.
Theorem 3
Taking the contrapositive and substituting the variables of the above example, if x and \(\epsilon _y\) are independent, nondegenerate random variables that are not normally distributed, then the two linear combinations y and \(\tilde{\epsilon }_X\) (Eqs. 3 and 5) are not independent. That is, if we mistakenly fit a backwards model to data that in fact came from a forwards model, then we would find that y and the residuals on x would be dependent, i.e., Open image in new window , despite the fact that the independence is required by dseparation on the backwards model. In other words, we would notice our mistake and would be able to correctly identify the true (in this case, forwards) model. Of course, this only proves the point for two variables, but the more general proofs can be found in [39] with also some alternative graphical demonstrations that may help the intuition underlying this identifiability result. It should also be noted that the DarmoisSkitovich theorem underlies the method of Independent Component Analysis [20].
These powerful identifiability results have been implemented in causal discovery algorithms that go by the acronym of LinGaM, for Linear nonGaussian Models, and have been generalized (with slight weakenings of the identifiability) to settings where either causal sufficiency [15] or acyclicity [23] is dropped, or where the data generating process satisfies the LinGaM assumptions, but the actual data is the result of an invertible nonlinear transformation, resulting in the socalled postnonlinear model [50, 51].
4 Nonlinear additive noise models
We know (from the previous section) that when the \(f_j\) are linear, then identifiability requires that the error distributions are nonGaussian. But one can ask what the conditions for unique identifiability of the causal structure are when the \(f_j\) are nonlinear (and there are no restrictions other than nondegeneracy on the error distributions). Identifiability results of this kind are developed in Hoyer et al. [14] and Mooij et al. [27]. The authors characterize a very intricate condition – I will here only refer to it as the Hoyer condition—on the relation between the function f, the noise distribution and the parent distribution^{6}, and provide the following theorem:
Theorem 4
(nonlinear additive noise) Under the assumption of Markov, acyclicity and causal sufficiency and a nonlinear additive noise parameterization (Eq. 6), unless the Hoyer condition is satisfied, the true causal structure can be uniquely identified.

If the (additive) error distributions are all Gaussian, then the only functional form that satisfies the Hoyer condition is linearity, otherwise the model is uniquely identifiable.

If the (additive) error distributions are nonGaussian, then there exist (rather contrived) functions that satisfy the Hoyer condition, but in general the model is uniquely identifiable.

If the functions are linear, but the (additive) error distributions are nonGaussian, then there does not exist a linear backwards model (this is the result of the LinGaM approach of the previous section), but there exist cases where one can fit a nonlinear backwards model [51].
Again, an understanding of these results may be aided with a simple example of two variables (taken from [14]). Fig. 4a–c show first the data from a linear Gaussian model. As the “cuts” through the data indicate, no matter whether one fits the forwards or the backwards model, a Gaussian distribution of the residuals can be found that is independent of the value of the respective cause (x in the forwards, and y in the backwards model). However, panels df show that this no longer is true if the true model is in fact a nonlinear Gaussian (forwards) model: While the error distribution is independent of the value of the cause in the (correct) forwards model, the error distribution on x is dependent on the value of y if one attempts to construct a backwards model, i.e., we have Open image in new window , when in fact an independence is required for the backwards model to be true.
Causal discovery algorithms have been developed for these settings (see the papers) and the identifiability results have been generalized [35], including to certain types of discrete distributions (see next section). There have—to my knowledge— not been extensions to the causally insufficient or cyclic case.
In light of the identifiability results of this section and the previous one it is ironic that so much of structural equation modeling has historically focused on the linear Gaussian case. The identifiability results mentioned here indicate that this focus on computationally simple models came at the expense of the identifiability of the underlying causal model. So in cases when the true causal model is known, then linear Gaussian parameterizations make the computation of causal effects very easy, but for the identifiability of the model in the first place, the linear Gaussian case is about as bad as it could be.
5 Restrictions on multinomial distributions
Naturally, one can also consider the possibilities of avoiding the limitations placed on causal discovery by Theorem 1 with respect to discrete distributions. This has been a much less explored direction of inquiry, possibly due to the difficulty of estimating specific features of discrete distributions, especially when the state space is finite. Alternatively, the domain of application of discrete distributions may provide only much weaker grounds for the justification of assumptions that pick out specific discrete distributions. The multinomial distribution therefore provides a useful unconstrained model, yet causal identifiability is limited to the Markov equivalence class.
Instead of considering additive noise models, Park & Raskutti [29] consider discrete variables with Poisson distributions. Again, the causal structure can be identified as long as the variables have nonzero variances in specific settings (see their Theorem 3.1 for the precise condition). The key idea that drives the identifiability result in this case is overdispersion. For a variable X that is marginally Poisson distributed, we have \(E(X) = Var(X)\), but for a variable \(Y{ \;  \; }X\) that is conditionally Poisson distributed, we have \(Var(Y) > E(Y)\). The argument is nicely illustrated with the simple bivariate example on p. 3 in [29].
To my knowledge, there is very little work (other than some subcases of the additive noise models referred to above) that has developed general restrictions to enable identifiability of the causal structure for discrete models with finite state spaces, even though it is known that the assumption of a socalled “noisyOR” parameterization enables in some cases identifiability beyond that of Markov equivalence.
6 Experiments and background knowledge
The previous several sections have considered the challenge of causal discovery in terms of finding weak generic assumptions about the nature of the underlying causal system that will enable or at least aid the identifiability of the true causal model. But for any concrete problem of causal discovery in application, the search space of candidate causal models will often not include all possible causal structures over the set of variables in the first place, but be highly constrained by available background knowledge concerning, e.g., particular causal pathways, time ordering, tier orderings of variables (i.e., that some subsets of variables come before others) or even less specific prior knowledge about, say, the edge density or the connectivity of the true causal structure. This type of background knowledge can similarly aid the identifiability of the causal model, possibly even without making additional assumptions about the functional form of the causal relations.
Once all the information has been encoded in constraints in propositional logic, one can use standard Boolean SAT(isfiability) solvers to determine solutions consistent with the joint set of constraints. The nice feature of using these solvers is that they are entirely domain general and highly optimized. Consequently, with a suitably general encoding one can integrate heterogeneous information from a variety of different sources into the discovery procedure.
A solver will return either one solution consistent with the constraints—that is, one assignment of truth values to the atomic propositional variables, which in turn specify one graph—or it can return only the truth value for those atomic variables that have the same truth value in all the solutions consistent with the constraints. A socalled “backbone” of the constraints specifies those features of the causal graph that are determined in light of the constraints.
While these SATbased approaches are incredibly versatile in terms of the information they can integrate into the search procedure, and while they can achieve remarkably accurate results, they do not yet scale as well as other causal discovery algorithms. But there are several comments worth making in this regard: (1) The runtime of a constraint optimization using standard SATbased solvers has a very high variance; many instances can be resolved in seconds while some can take vastly longer. (2) The runtime is highly dependent on the set of constraints available and the search spaces they are applied to; for example [19] used a SATbased method for causal discovery in the highly constrained domain of subsampled time series and were able to scale to around 70 variables. (3) We can expect significant improvements in the scalability with the development of more efficient encodings and the parallelization of the computation. (4) One can always explore the accuracy/speed tradeoff and settle for a more scalable method with less accurate or less informative output. And finally, (5) if one is actually doing causal discovery on a specific application, one might be willing to wait for a week for the supercomputer to return a good result.
There is another aspect in which the SATbased approach to causal discovery opens new doors: Previous methods have focused on the identification of the causal structure or some general representation of the equivalence class of causal structures. SATbased methods do not output the equivalence class of causal structures explicitly, but rather represent it implicitly in terms of the constraints in the solver. So instead of requesting as output a “best” causal structure or an equivalence class, one can also query specific aspects of the underlying causal system. This is particularly useful if one is only interested in a specific pathway or the relations among a subset of variables. In that case, one need not compute the entire equivalence class but can query the solver directly to establish what is determined about the question of interest. Magliacane et al. [24] have taken this approach to only investigate the ancestral relations in a causal system and Hyttinen et al. [17] used a querybased approach to check the conditions for the applications of the rules of the docalculus [31] when the true graph is unknown.
7 Outlook
This article has highlighted some of the approaches to causal discovery and attempted to fit them together in terms of their motivations and in light of the formal limits to causal discovery that are known. This article is by no means exhaustive and I encourage the reader to pursue other review articles such as Spirtes and Zhang [42] to gain a more complete overview. Moreover, there are many questions concerning comparative efficiency, finite sample performance, robustness, etc. that I have not even touched on. Nevertheless, I hope to have shown that there is a vast array of different methods grounded on a whole set of different assumptions such that the reader may reasonably have some hope to find a method suitable (or adaptable) to their area of application. One almost paradigmatic application of a causal discovery method is illustrated in the article by Stekhoven et al. [43]. It exemplifies how a causal discovery method was applied to observational gene expression data to select candidate causes of the onset of flowering of the plant Arabidopsis thaliana. Once candidate causes had been identified, the researchers actually planted specimen, in which the genes, which had been determined to be relevant by the causal discovery method, had been knocked out—the causal hypothesis was put to the experimental test. I think it is fair to say that the results were positive.

Dynamics and time series Many areas of scientific investigation describe systems in terms of sets of dynamical equations. How can these results be integrated with the methods for causal discovery in time series? (See e.g., [3, 4, 21, 40, 48].)

Variable construction Standard causal discovery methods (such as the ones discussed in this article) take as input a statistical data set of measurements of welldefined causal variables. The goal is to find the causal relations between them. But how are these causal variables identified or constructed in the first place? Often we have sensor level data but assume that the relevant causal interactions occur at a higher scale of aggregation. Sometimes we only have aggregate measurements of causal interactions at a finer scale. (See e.g., [1, 2, 38].)

Relational data In many cases there can be in addition to the causal relation, a dependence structure among the causal variables that is not due to the causal relations, but due to relational features among the causal variables, e.g., whether an actor is in a movie, or which friendship relations are present. In this case, we need methods that can disentangle the dependencies due to the relational structure from the dependencies due to causality, and there may be causal effects from the relations to the individuals and vice versa. (See e.g., [25, 37]).
Footnotes
 1.
See a discussion of this example in Scientific American [22].
 2.
In a somewhat counterintuitive usage of terms, a vertex is also its own ancestor and its own descendent, but not its own parent or child.
 3.
In order to separate out limitations and sources of error in the overall inference it can be helpful to make the following threeway distinction: Statistical inference concerns the inference from data to the generating distribution or properties of the generating distribution, such as parameter values or (in)dependence relations. Causal discovery concerns the inference of identifying as much as possible about the causal structure given the statistical quantities, such as a probability distribution or its features. Causal inference concerns the determination of quantitative causal effects given the causal structure and associated statistical quantities. Of course, these three inference steps are not always completely separable and there are plenty of interesting approaches that combine them.
 4.
This example is taken from [12].
 5.
Especially with regard to the assumption of acyclicity it is worth noting that very subtle issues arise both about what exactly we mean when we allow for causal cycles, and how one may infer something about a system in which there are such feedback loops. The interested reader is encouraged to purse the references on cyclic models mentioned below.
 6.
An explicit statement of the condition is omitted here as it requires a fair bit of notation and no further insight is gained by just stating it. The intrigued reader should refer to the original paper, which is a worthwhile read in any case.
Notes
Acknowledgements
I am very grateful to the organizers of the 2016 KDD Causal Discovery Workshop for encouraging me to put together and write up this overview. I am also very grateful to two anonymous reviewers who made several suggestions to improve the presentation and who alerted me to additional important papers that I was not aware of before.
References
 1.Chalupka, K., Perona, P., Eberhardt, F.: Visual causal feature learning. In: Proceedings of UAI (2015)Google Scholar
 2.Chalupka, K., Perona, P., Eberhardt, F.: Multilevel causeeffect systems. In: Proceedings of AISTATS (2016)Google Scholar
 3.Dash, D.: Restructuring dynamic causal systems in equilibrium. In: Proceedings of AISTATS (2005)Google Scholar
 4.Dash, D., Druzdzel, M.: Caveats for causal reasoning with equilibrium models. In: European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty, pp. 192–203. Springer, Berlin (2001)Google Scholar
 5.Eberhardt, F., Scheines, R.: Interventions and causal inference. Philos. Sci. 74(5), 981–995 (2007)MathSciNetCrossRefGoogle Scholar
 6.Fisher, R.: The design of experiments. Hafner (1935)Google Scholar
 7.Frydenberg, M.: The chain graph markov property. Scand J Stat 17, 333–353 (1990)MathSciNetzbMATHGoogle Scholar
 8.Geiger, D., Pearl, J.: On the logic of causal models. In: Proceedings of UAI (1988)Google Scholar
 9.Gillispie, S., Perlman, M.: The size distribution for Markov equivalence classes of acyclic digraph models. Artif. Intell. 141(1), 137–155 (2002)MathSciNetCrossRefGoogle Scholar
 10.Glymour, C.: Markov properties and quantum experiments. In: Demopoulos, W., Pitowsky, I. (eds.) Physical Theory and Its Interpretation: Essays in Honor of Jeffrey Bub. Springer, Berlin (2006)Google Scholar
 11.He, Y., Jia, J., Yu, B.: Counting and exploring sizes of Markov equivalence classes of directed acyclic graphs. J. Mach. Learn. Res. 16, 2589–2609 (2015)MathSciNetzbMATHGoogle Scholar
 12.Hitchcock, C.: Causation. In: Psillos, S., Curd, M. (eds.) The Routledge Companion to Philosophy of Science. Routledge, London (2008)Google Scholar
 13.Hitchcock, C.: Probabilistic causality. In: Stanford Encyclopedia of Philosophy. The Metaphysics Research Lab, Stanford University, (2010) https://plato.stanford.edu/cite.html
 14.Hoyer, P., Janzing, D., Mooij, J., Peters, J., Schölkopf, B.: Nonlinear causal discovery with additive noise models. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 21, Curran Associates Inc., pp. 689–696 (2008)Google Scholar
 15.Hoyer, P., Shimizu, S., Kerminen, A., Palviainen, M.: Estimation of causal effects using linear nonGaussian causal models with hidden variables. Int. J. Approx. Reason. 49, 362–378 (2008)MathSciNetCrossRefGoogle Scholar
 16.Hyttinen, A., Eberhardt, F., Järvisalo, M.: Constraintbased causal discovery: conflict resolution with answer set programming. In: Proceedings of UAI (2014)Google Scholar
 17.Hyttinen, A., Eberhardt, F., Järvisalo, M.: Docalculus when the true graph is unknown. In: Proceedings of UAI (2015)Google Scholar
 18.Hyttinen, A., Hoyer, P., Eberhardt, F., Järvisalo, M.: Discovering cyclic causal models with latent variables: a general SATbased procedure. In: Proceedings of UAI, pp. 301–310. AUAI Press (2013)Google Scholar
 19.Hyttinen, A., Plis, S., Järvisalo, M., Eberhardt, F., Danks, D.: Causal discovery from subsampled time series data by constraint optimization. In: Proceedings of PGM (2016)Google Scholar
 20.Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis, vol. 46. Wiley, London (2004)Google Scholar
 21.Jantzen, B.: Projection, symmetry, and natural kinds. Synthese 192(11), 3617–3646 (2015)CrossRefGoogle Scholar
 22.Klatsky, A.: Drink to your health? Scientific American 288(2), 75–81 (2003)CrossRefGoogle Scholar
 23.Lacerda, G., Spirtes, P., Ramsey, J., Hoyer, P.O.: Discovering cyclic causal models by independent components analysis. In: Proceedings of UAI, pp. 366–374 (2008)Google Scholar
 24.Magliacane, S., Claassen, T., Mooij, J.: Ancestral causal inference. arXiv:1606.07035 (2016)
 25.Maier, M., Marazopoulou, K., Arbour, D., Jensen, D.: A sound and complete algorithm for learning causal models from relational data. Proceedings of UAI (2013)Google Scholar
 26.Meek, C.: Strong completeness and faithfulness in Bayesian networks. In: Proceedings of UAI, pp. 411–418. Morgan Kaufmann Publishers Inc. (1995)Google Scholar
 27.Mooij, J., Janzing, D., Peters, J., Schölkopf, B.: Regression by dependence minimization and its application to causal inference in additive noise models. In: Proceedings of ICML, pp. 745–752 (2009)Google Scholar
 28.Nyberg, E., Korb, K.: Informative interventions. In: Williamson, J., (ed.) Causality and Probability in the Sciences. College Publications (2006)Google Scholar
 29.Park, G., Raskutti, G.: Learning largescale poisson dag models based on overdispersion scoring. In: Advances in Neural Information Processing Systems, pp. 631–639 (2015)Google Scholar
 30.Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, Los Altos (1988)zbMATHGoogle Scholar
 31.Pearl, J.: Causality. Oxford University Press, Oxford (2000)zbMATHGoogle Scholar
 32.Pearl, J., Verma, T.: Equivalence and synthesis of causal models. In: Proceedings of Sixth Conference on Uncertainty in Artijicial Intelligence, pp. 220–227 (1991)Google Scholar
 33.Peters, J., Janzing, D., Schölkopf, B.: Identifying cause and effect on discrete data using additive noise models. In: Proceedings of AISTATS, pp. 597–604 (2010)Google Scholar
 34.Peters, J., Janzing, D., Schölkopf, B.: Causal inference on discrete data using additive noise models. IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2436–2450 (2011)CrossRefGoogle Scholar
 35.Peters, J., Mooij, J., Janzing, D., Schölkopf, B.: Identifiability of causal graphs using functional models. In: Proceedings of UAI, pp. 589–598. AUAI Press (2011)Google Scholar
 36.Richardson, T.: Feedback models: Interpretation and discovery. Ph.D. thesis, Carnegie Mellon University (1996)Google Scholar
 37.Schulte, O., Khosravi, H., Kirkpatrick, A., Gao, T., Zhu, Y.: Modelling relational statistics with Bayes nets. Mach. Learn. 94(1), 105–125 (2014)MathSciNetCrossRefGoogle Scholar
 38.Shalizi, C., Moore, C.: What is a macrostate? Subjective observations and objective dynamics. arXiv:condmat/0303625 (2003)
 39.Shimizu, S., Hoyer, P., Hyvärinen, A., Kerminen, A.: A linear nonGaussian acyclic model for causal discovery. J. Mach. Learn. Res. 7, 2003–2030 (2006)MathSciNetzbMATHGoogle Scholar
 40.Sokol, A., Hansen, N.: Causal interpretation of stochastic differential equations. Electron. J. Probab. 19(100), 1–24 (2014)MathSciNetzbMATHGoogle Scholar
 41.Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction and Search, 2nd edn. MIT Press, Cambridge (2000)zbMATHGoogle Scholar
 42.Spirtes, P., Zhang, K.: Causal discovery and inference: concepts and recent methodological advances. Appl. Inform. 3, 3 (2016). doi: 10.1186/s405350160018x
 43.Stekhoven, D., Moraes, I., Sveinbjörnsson, G., Hennig, L., Maathuis, M., Bühlmann, P.: Causal stability ranking. Bioinformatics 28(21), 2819–2823 (2012)CrossRefGoogle Scholar
 44.Tillman, R., Eberhardt, F.: Learning causal structure from multiple datasets with similar variable sets. Behaviormetrika 41(1), 41–64 (2014)CrossRefGoogle Scholar
 45.Triantafillou, S., Tsamardinos, I.: Constraintbased causal discovery from multiple interventions over overlapping variable sets. J. Mach. Learn. Res. 16, 2147–2205 (2015)MathSciNetzbMATHGoogle Scholar
 46.Triantafillou, S., Tsamardinos, I., Tollis, I.G.: Learning causal structure from overlapping variable sets. In: Proceedings of AISTATS, pp. 860–867. JMLR (2010)Google Scholar
 47.Uhler, C., Raskutti, G., Bühlmann, P., Yu, B.: Geometry of the faithfulness assumption in causal inference. Ann. Stat. 41(2), 436–463 (2013)MathSciNetCrossRefGoogle Scholar
 48.Voortman, M., Dash, D., Druzdzel, M.: Learning why things change: the differencebased causality learner. arXiv preprint arXiv:1203.3525 (2012)
 49.Zhang, J., Spirtes, P.: The three faces of faithfulness. Synthese 193(4), 1011–1027 (2016)MathSciNetCrossRefGoogle Scholar
 50.Zhang, K., Chan, L.W.: Extensions of ICA for causality discovery in the Hong Kong stock market. In: International Conference on Neural Information Processing, pp. 400–409. Springer, Berlin (2006)CrossRefGoogle Scholar
 51.Zhang, K., Hyvärinen, A.: On the identifiability of the postnonlinear causal model. In: Proceedings of UAI, pp. 647–655. AUAI Press (2009)Google Scholar