Introduction

Matching is a widely used non-experimental method of evaluation that can be used to estimate the average effect of a treatment or programme intervention. The method compares the outcomes of programme participants with those of matched non-participants, where matches are chosen on the basis of similarity in observed characteristics. One of the main advantages of matching estimators is that they typically do not require specifying the functional form of the outcome equation and are therefore not susceptible to misspecification bias along that dimension. Traditional matching estimators pair each programme participant with a single matched non-participant (see, for example, Rosenbaum and Rubin 1983), whereas more recently developed estimators pair programme participants with multiple non-participants and use weighted averaging to construct the matched outcomes.

We next define some notation and discuss how matching estimators solve the evaluation problem. Much of the treatment effect literature is built on the potential outcomes framework of Fisher (1935), exposited more recently in Rubin (1974) and Holland (1986). The framework assumes that there are two potential outcomes, denoted (Y0, Y1) that represent the states of being without and with treatment. An individual can be in only one state at a time, so only one of the outcomes is observed. The outcome that is not observed is termed a counterfactual outcome. The treatment impact for an individual is

$$ \Delta ={Y}_1-{Y}_0, $$

which is not directly observable. Assessing the impact of a programme intervention requires making an inference about what outcomes would have been observed in the no-programme state. Let D = 1 for persons who participate in the programme and D = 0 for persons who do not. The D = 1 sample often represents a select group of persons who were deemed eligible for a programme, applied to it, got accepted into it and decided to participate in it. The outcome that is observed is Y = DY1 + (1 − D)Y0.

Before considering different parameters of interest and their estimation, we first consider what is available directly from the data. The conditional distributions F(Y1| X, D = 1) and F(Y0| X, D = 0) can be recovered from the observations on Y1 and Y0, but not the joint distributions F(Y0, Y1| X, D = 1) , F(Y0, Y1| X) or the impact distribution, F(Δ| X, D = 1). Because of this missing data problem, researchers often aim instead on recovering some features of the impact distribution, such as its mean. The parameter that is most commonly the focus of evaluation studies is the mean impact of treatment on the treated, TT = E(Y1Y0| D = 1), which gives the benefit of the programme to programme participants. (If the outcome were earnings and the TT parameter exceeded the average cost of the programme, then the programme might be considered to at least cover its costs.)

Matching estimators typically assume that there exist a set of observed characteristics Z such that outcomes are independent of programme participation conditional on Z. That is, it is assumed that the outcomes (Y0,Y1) are independent of participation status D conditional on Z,

$$ \left({Y}_0,{Y}_1\right)\perp \perp D\mid Z. $$
(1)

The independence condition can be equivalently represented as Pr(D = 1| Y0, Y1, Z) = Pr(D = 1| Z) , or E(D| Y0, Y1, Z) = E(D| Z). In the terminology of Rosenbaum and Rubin 1983, treatment assignment is ‘strictly ignorable’ given Z. It is also assumed that for all Z there is a positive probability of either participating (D = 1) or not participating (D = 0) in the programme: that is,

$$ 0<\Pr \left(D=1|Z\right)<1. $$
(2)

This assumption is required so that matches for D = 0 and D = 1 observations can be found. If assumptions (1) and (2) are satisfied, then the problem of determining mean programme impacts can be solved by substituting the Y0 distribution observed for matched on Z non-participants for the missing participant Y0 distribution.

The above assumptions are overly strong if the parameter of interest is the mean impact of treatment on the treated (TT), in which case a weaker conditional mean independence assumption on Y0 suffices (see Heckman et al. 1998a, b):

$$ E\left({Y}_0|Z,D=1\right)=E\left({Y}_0|Z,D=0\right)=E\left({Y}_0|Z\right). $$
(3)

Furthermore, when TT is the parameter of interest, the condition 0 < Pr (D = 1| Z) is also not required, because that condition is only needed to guarantee a participant analogue for each non-participant. The TT parameter requires only

$$ \Pr \left(D=1|Z\right)<1. $$
(4)

Under these assumptions, the mean impact of the programme on programme participants can be written as

$$ {\displaystyle \begin{array}{l}{\Delta}_{TT}=E\left({Y}_1-{Y}_0|D=1\right)\\ {}\ \ \ \ \ \ \ =E\left({Y}_1|D=1\right)-{E}_{Z\mid D=1}\left\{{E}_Y\left(Y|D=1,Z\right)\right\}\hfill \\ {}\ \ \ \ \ \ \ =E\left({Y}_1|D=1\right)-{E}_{Z\mid D=1}\left\{{E}_Y\left(Y|D=0,Z\right)\right\},\hfill \end{array}} $$

where the second term can be estimated from the mean outcomes of the matched on Z comparison group. (The notation EZ|D = 1 denotes that the expectation is taken with respect to the f(Z| D = 1) density.)

Assumption (3) implies that D does not help predict values of Y0 conditional on Z which rules out selection into the programme directly on values of Y0. However, there is no similar restriction imposed on Y1, so the method does allow individuals who expect to experience higher levels of Y1 to select into the programme on the basis of that information. For estimating the TT parameter, matching methods allow selection into treatment to be based on possibly unobserved components of the anticipated programme impact, but only in so far as the programme participation decisions are based on the unobservable determinants of Y1 and not those of Y0.

Second, the matching method also requires that the distribution of the matching variables, Z, not be affected by whether the treatment is received. For example, age, gender, and race would generally be valid matching variables, but marital status may not be if it were potentially affected by receipt of the programme. To see why this assumption is necessary, consider the term

$$ {E}_{Z\mid D=1}\left\{{E}_Y\left(Y|D=0,Z\right)\right\}={\int}_{z\in Z}{\int}_{y\in Y}y\;f\left(y|D=0,z\right)f\left(z|D=1\right)\mathrm{dz}. $$

It uses the f(z| D = 1) conditional density to represent the density that would also have been observed in the no treatment (D = 0) state, which rules out the possibility that receipt of treatment changes the density of Z. Variables that are likely to be affected by the treatment or programme intervention cannot be used in the set of matching variables.

With non-experimental data, there may or may not exist a set of observed conditioning variables for which (1) and (2), or (3) and (4), hold. A finding of Heckman et al. (1997) and Heckman et al. (1996, 1998a, b) in their application of matching methods to data from the Job Training and Partnership Act (JTPA) programme is that (2) and (4) were not satisfied, because no match could be found for a fraction of the participants. If there are regions where the support of Z does not overlap for the D = 1 and D = 0 groups, then matching is justified only when performed over the region of common support. The estimated treatment effect must then be defined conditionally on the region of overlap. Some methods for empirically determining the overlap region are described below.

Matching estimators can be difficult to implement when the set of conditioning variables Z is large. If Z are discrete, small-cell problems may arise. If Z are continuous and the conditional mean E(Y0| D = 0, Z) is estimated nonparametrically, then convergence rates will be slow due to the so-called curse of dimensionality problem. Rosenbaum and Rubin (1983) provide a theorem that can be used to address this dimensionality problem. They show that for random variables Y and Z and a discrete random variable D

$$ {\displaystyle \begin{array}{ll}\hfill & E\left(D|Y,P\left(D=1|Z\right)\right)\\ {}& =E\left(E\left(D|Y,Z\right)|Y,\Pr \left(D=1|Z\right)\right),\hfill \end{array}} $$

so that

$$ {\displaystyle \begin{array}{l}E\left(D|Y,Z\right)=E\left(D|Z\right)\\ {}\Rightarrow E\left(D|Y,\Pr \left(D=1|Z\right)=E\left(D|\Pr \left(D=1|Z\right)\right.\right).\hfill \end{array}} $$

This result implies that, when Y0 outcomes are independent of programme participation conditional on Z, they are also independent of participation conditional on the probability of participation, P(Z) = Pr (D = 1| Z). That is, when matching on Z is valid, matching on the summary statistic Pr(D = 1| Z) (the propensity score) is also valid. Provided that P(Z) can be estimated parametrically (or semiparametrically at a rate faster than the nonparametric rate), matching on the propensity score reduces the dimensionality of the matching problem to that of a univariate problem. For this reason, much of the literature on matching focuses on propensity score matching methods. (Heckman et al. 1998a, b, and Hahn 1998, consider whether it is better in terms of efficiency to match on P(X) or on X directly.) With the use of the Rosenbaum and Rubin (1983) theorem, the matching procedure can be broken down into two stages. In the first stage, the propensity score Pr(D = 1| Z) is estimated, using a binary discrete choice model. (Options for first the stage estimation include, for example, a parametric logit or probit model or a semiparametric estimator, such as semiparametric least squares – Ichimura 1993 – maximum score – Manski 1973 – smoothed maximum score – Horowitz 1992 – or semiparametric maximum likelihood – Klein and Spady 1993. If P(Z) were estimated using a fully nonparametric method, then the curse of dimensionality problem would reappear.) In the second stage, individuals are matched on the basis of their predicted probabilities of participation.

We next describe a simple model of the programme participation decision to illustrate the kinds of assumptions needed to justify matching. (This model is similar to an example given in Heckman et al. 1999.) Assume that an individual chooses whether to apply to a training programme on the basis of the expected benefits. He or she compares the expected earnings streams with and without participating, taking into account opportunity costs and net of some random training cost ε, which may include a psychic component expressed in monetary terms. The participation decision is made at time t = 0 and the training programme lasts for periods 1 through τ, during which time earnings are zero. The information set used to determine expected earnings is denoted by W, which might include, for example, earnings and employment history. The participation model is

$$ D=1\ \ \mathrm{if}\ E\left(\underset{j=\tau }{\overset{T}{\Sigma}}\frac{Y_{1j}}{{\left(1+r\right)}^j}-\underset{k=1}{\overset{T}{\Sigma}}\frac{Y_{0k}}{{\left(1+r\right)}^k}|W\right)>\varepsilon +{Y}_{00},\ \ \ \mathrm{else}\ D=0. $$

The terms in the right-hand side of the inequality are assumed to be known to the individual but not to the econometrician.

$$ {\displaystyle \begin{array}{l}\mathrm{If}\;f\left({Y}_{0k}|\varepsilon +{Y}_{00},X\right)=f\left({Y}_{0k}|X\right),\hfill \\ {}\begin{array}{ll}\hfill & \mathrm{then}E\left({Y}_{0k}|X,D=1\right)\\ {}& =E\left({Y}_{0k}|X,\varepsilon +{Y}_{00}<\eta (W)\right)\hfill \\ {}& =E\left({Y}_{0k}|X\right),\hfill \end{array}\end{array}} $$

which would justify application of a matching estimator. This assumption places restrictions on the correlation structure of the earnings residuals. For example, the assumption would not be plausible if X = W and Y00 = Y0k, because knowing that a person selected into the programme (D = 1) would likely be informative about subsequent earnings. We could assume, however, a model for earnings

$$ {Y}_{0k}=\varphi (X)+{\nu}_{0k}, $$

such as where v0k follows an MA(q) process with q < k, which would imply that Y0k and Y00 are uncorrelated conditional on X. The matching method does not require that everything in the information set be known, but it does assume sufficient information to make the selection on observables assumption plausible.

Cross-Sectional Matching Methods

For notational simplicity, let P = P(Z). A prototypical propensity score matching estimator takes the form

$$ {\displaystyle \begin{array}{r}{\widehat{\alpha}}_M=\frac{1}{n_1}\underset{i\in {I}_1\cap {S}_P}{\Sigma}\left[{Y}_{1i}-\widehat{E}\left({Y}_{0i}|D=1,{P}_i\right)\right]\\ {}\\ {}\widehat{E}\left({Y}_{0i}|D=1,{P}_i\right)=\underset{j\in {I}_0}{\Sigma}W\left(i,j\right){Y}_{0j},\end{array}} $$
(5)

where I1 denotes the set of programme participants, I0 the set of non-participants, SP the region of common support (see below for ways of constructing this set). n1 is the number of persons in the set I1SP. The match for each participant II1SP is constructed as a weighted average over the outcomes of non-participants, where the weights W(i, j) depend on the distance between Pi and Pj. Define a neighbourhood C(Pi) for each i in the participant sample. Neighbours for i are non-participants jI0 for whom PjC(Pi). The persons matched to i are those people in set Ai where Ai = {jI0| PjC(Pi)}. We describe a number of alternative matching estimators below, that differ in how the neighbourhood is defined and in how the weights W (i, j) are constructed.

Alternative Ways of Constructing Matched Outcomes

Nearest-Neighbour Matching

Traditional, pairwise matching, also called nearest-neighbour matching, sets:

$$ C\left({P}_i\right)=\underset{j}{\min}\left\Vert {P}_i-{P}_j\right\Vert, j\in {I}_0. $$

That is, the non-participant with the value of Pj that is closest to Pi is selected as the match and Ai is a singleton set. The estimator can be implemented either matching with or without replacement. When matching is performed with replacement, the same comparison group observation can be used repeatedly as a match. A drawback of matching without replacement is that the final estimate will usually depend on the initial ordering of the treated observations for which the matches were selected.

Caliper matching (Cochran and Rubin 1973) is a variation of nearest neighbour matching that attempts to avoid ‘bad’ matches (those for which Pj is far from Pi) by imposing a tolerance on the maximum distance ‖PiPj‖allowed. That is, a match for person i is selected only if ‖PiPj‖ < ε , jI0, where ε is a pre-specified tolerance. Treated persons for whom no matches can be found within the caliper are excluded from the analysis, which is one way of imposing a common support condition. A drawback of caliper matching is that it is difficult to know a priori what choice for the tolerance level is reasonable.

Stratification or Interval Matching

In this variant of matching, the common support of P is partitioned into a set of intervals, and average treatment impacts are calculating through simple averaging within each interval. A weighted average of the interval impact estimates, using the fraction of the D = 1 population in each interval for the weights, provides an overall average impact estimate. Implementing this method requires a decision on how wide the intervals should be. Dehejia and Wahba (1999) implement interval matching using intervals that are selected such that the mean values of the estimated Pi and Pj are not statistically different from each other within intervals.

Kernel and Local Linear Matching

More recently developed matching estimators construct a match for each programme participant using a weighted average over multiple persons in the comparison group. Consider, for example, the nonparametric kernel matching estimator, given by

$$ {\widehat{\alpha}}_{KM}=\frac{1}{n_1}\underset{i\in {I}_1}{\Sigma}\left\{{Y}_{1i}-\frac{\sum_{j\in {I}_0}{Y}_{0j}G\left(\frac{P_j-{P}_i}{a_n}\right)}{\sum_{k\in {I}_0}G\left(\frac{P_k-{P}_i}{a_n}\right)}\right\} $$

where G( ) is a kernel function and an is a bandwidth parameter. (See Heckman et al. 1997a, b, 1998a, b and Heckman et al., 1998a, b.) In terms of Eq. (5), the weighting function, W(i,j), is equal to

$$ \frac{G\left(\frac{P_j-{P}_i}{a_n}\right)}{\sum_{k\in {I}_0}G\left(\frac{P_k-{P}_i}{a_n}\right)}. $$

For a kernel function bounded between − 1 and 1, the neighbourhood is

$$ C\left({P}_i\right)=\left\{\left|\frac{P_i-{P}_j}{a_n}\right|\le 1\right\},j\in {I}_0. $$

Under standard conditions on the bandwidth and kernel,

$$ \frac{\sum_{j\in {I}_0}{Y}_{0j}G\left(\frac{P_j-{P}_i}{a_n}\right)}{\sum_{k\in {I}_0}G\left(\frac{P_k-{P}_i}{a_n}\right)} $$

is a consistent estimator of E(Y0| D = 1, Pi). (Specifically, we require that G(·) integrates to one, has mean zero and that an → 0 as n → ∞ and nan → ∞: One example of a kernel function is the quartic kernel, given by \( G(s)=\frac{15}{16}{\left({s}^2-1\right)}^2 \) if |s| < 1, G(s) = 0 otherwise.)

Heckman et al. (1997) also propose a generalized version of kernel matching, called local linear matching. Recent research by Fan 1992a, b, demonstrated advantages of local linear estimation over more standard kernel estimation methods. These advantages include a faster rate of convergence near boundary points and greater robustness to different data design densities; see Fan 1992a, b.) The local linear weighting function is given by

$$ \ \ \ W\left(i,j\right)=\frac{G_{ij}{\sum}_{k\in {I}_0}{G}_{ik}{\left({P}_k-{P}_i\right)}^2-\left[{G}_{ij}\left({P}_j-{P}_i\right)\right]\left[{\sum}_{k\in {I}_0}{G}_{ik}\left({P}_k-{P}_i\right)\right]}{\sum_{j\in {I}_0}{G}_{ij}{\sum}_{k\in {I}_0}{G}_{ij}{\left({P}_k-{P}_i\right)}^2-{\left({\sum}_{k\in {I}_0}{G}_{ik}\left({P}_k-{P}_i\right)\right)}^2}. $$
(6)

As demonstrated in research by Fan (1992a, b), local linear estimation has some advantages over standard kernel estimation. These advantages include a faster rate of convergence near boundary points and greater robustness to different data design densities (see Fan 1992a, b). Thus, local linear regression would be expected to perform better than kernel estimation in cases where the non-participant observations on P fall on one side of the participant observations.

To implement the matching estimator given by Eq. (5), the region of common support SP needs to be determined. The common support region can be estimated by

$$ {\widehat{S}}_P=\left\{P:\widehat{f}\left(P|D=1\right)>0\ \mathrm{and}\;\widehat{f}\left(P|D=0\right)>{c}_q\right\}, $$

where \( \widehat{f}\left(P|D=d\right),d\in \left\{0,1\right\} \) are standard nonparametric density estimators. To ensure that the densities are strictly greater than zero, it is required that the densities be strictly positive (that is, exceed zero by a certain amount), determined using a ‘trimming level’ q. That is, after excluding any P points for which the estimated density is zero, an additional small percentage of the remaining P points is excluded for which the estimated density is positive but very low. The set of eligible matches is thus given by

$$ \ \ \ {\widehat{S}}_q=\left\{P\in {\widehat{S}}_P:\widehat{f}\left(P|D=1\right)>{c}_q\;\mathrm{and}\;\widehat{f}\left(P|D=0\right)>{c}_q\right\}, $$

where cq is the density cut-off level that satisfies

$$ \underset{c_q}{\sup}\frac{1}{2J}\underset{\left\{i\in {I}_1\cap \widehat{S}\right\}}{\Sigma}\left\{1\left(\widehat{f}\left(P|D=1\right)\right)<{c}_q+1\left(1\left(\widehat{f}\left(P|D=0\right)\right)<{c}_q\right)\right\}\le q. $$

Here, J is the cardinality of the set of observed values of P that lie in \( {I}_1\cap {\widehat{S}}_P \). That is, matches are constructed only for the programme participants for which the propensity scores lie in \( {\widehat{S}}_q \).

The above estimators are representations of matching estimators and are commonly used. They can be easily adapted to estimate other parameters of interest, such as the average effect of treatment on the untreated (UT = E(Y1Y0| D = 0, X)), or the average treatment effect (ATE = E(Y1Y0| X)), which is just a weighted average of treatment on the treated (TT) and treatment on the untreated (UT).

The recent literature has also developed alternative matching estimators that employ different weighting schemes to increase efficiency. See, for example, Hahn (1998) and Hirano et al. (2003) for estimators that attain the semiparametric efficiency bound. The methods are not described in detail here, because those studies focus on the ATE and not on the average effect of treatment on the treated (TT) parameter. Heckman, Ichimura and Todd (1998) develop a regression-adjusted version of the matching estimator, which replaces Y0j as the dependent variable with the residual from a regression of Y0j on a vector of exogenous covariates. The estimator uses a Robinson (1988) type estimation approach to incorporate exclusion restrictions: that is, that some of the conditioning variables in an equation for the outcomes do not enter into the participation equation or vice versa. In principle, imposing exclusion restrictions can increase efficiency. In practice, though, researchers have not observed much gain from using the regression-adjusted matching estimator. Some alternatives to propensity score matching are discussed in Diamond and Sekhon (2005).

When Does Bias Arise in Matching?

The success of a matching estimator depends on the availability of observable data to construct the conditioning set Z, such that (1) and (2) are satisfied. Suppose only a subset Z0Z of the required variables is observed. The propensity score matching estimator based on Z0 then converges to

$$ \ \ \ {\alpha}_M^{\prime }={E}_{P\left({Z}_0\right)\mid D=1}\left(E\left({Y}_1|P\left({Z}_0\right),D=1\right)-E\left({Y}_0|P\left({Z}_0\right),D=0\right)\right). $$
(7)

The bias for the parameter of interest, E(Y1Y0| D = 1), is

$$ {\mathrm{bias}}_M=E\left({Y}_0|D=1\right)-{E}_{P\left({Z}_0\right)\mid D=1}\left\{E\left({Y}_0|P\left({Z}_0\right),D=0\right)\right\}. $$

There is no way of a priori choosing the set of Z variables to satisfy the matching condition or of testing whether a particular set meets the requirements. In rare cases, where data are available on a randomized social experiment, it is sometimes possible to ascertain the bias (see, for example, Heckman et.al 1997a, b; Dehejia and Wahba 1999, 2002; Smith and Todd 2005).

Difference-in-Difference Matching Estimators

The estimators described above assume that, after conditioning on a set of observable characteristics, outcomes are conditionally mean independent of programme participation. However, for a variety of reasons there may be systematic differences between participant and non-participant outcomes, even after conditioning on observables, which could lead to a violation of the identification conditions required for matching. Such differences may arise, for example, because of programme selectivity on unmeasured characteristics or because of levels differences in outcomes that might arise when participants and non-participants reside in different local labour markets or if the survey questionnaires used to gather the data differ in some ways across groups.

A difference-in-differences (DID) matching strategy, as defined in Heckman et al. (1997) and Heckman et al. (1998a, b), allows for temporally invariant differences in outcomes between participants and non-participants. This type of estimator matches on the basis of differences in outcomes using the same weighting functions described above. The propensity score DID matching estimator requires that

$$ {\displaystyle \begin{array}{ll}\hfill & E\left({Y}_{0t}-{Y}_{0{t}^{\prime }}|P,D=1\right)\\ {}& =E\left({Y}_{0t}-{Y}_{0{t}^{\prime }}|P,D=0\right),\hfill \end{array}} $$

where t and t′ are time periods after and before the programme enrolment date. This estimator also requires the support condition given above, which must now hold in both periods t and t′. The local linear difference-in-difference estimator is given by

$$ {\widehat{\alpha}}_{DM}=\frac{1}{n_1}\underset{i\in {I}_1\cap {S}_P}{\Sigma}\left.\left({Y}_{1 ti}-{Y}_{0{t}^{\prime }i}\right)-\underset{j\in {I}_0\cap {S}_P}{\Sigma}W\left(i,j\right)\left({Y}_{0 tj}-{Y}_{0{t}^{\prime }j}\right)\right\}, $$

where the weights correspond to the local linear weights defined above. If repeated cross-section data are available, instead of longitudinal data, the estimator can be implemented as

$$ {\widehat{\alpha}}_{DM}=\frac{1}{n_{1t}}\underset{i\in {I}_{1t}\cap {S}_P}{\Sigma}\left\{\left({Y}_{1 ti}-\underset{j\in {I}_{0t}\cap {S}_P}{\Sigma}W\left(i,j\right){Y}_{0 tj}\right.\right\}-\frac{1}{n_{1{t}^{\prime }}}\underset{i\in {I}_{1t\prime}\cap {S}_P}{\Sigma}\left\{\left({Y}_{1{t}^{\prime }i}-\underset{j\in {I}_{0{t}^{\prime }}}{\Sigma}W\left(i,j\right){Y}_{0{t}^{\prime }j}\right.\right\}, $$

where \( {I}_{1t},{I}_{1{t}^{\prime }},{I}_{0t},{I}_{0{t}^{\prime }} \) denote the treatment and comparison group data-sets in each time period.

Finally, the DID matching estimator allows selection into the programme to be based on anticipated gains from the programme in the sense that D can help predict the value of Y1 given P. However, the method assumes that D does not help predict changes \( {Y}_{0t}-{Y}_{0{t}^{\prime }} \) conditional on a set of observables (Z) used in estimating the propensity score. In their analysis of the effectiveness of matching estimators, Smith and Todd (2005) found difference-in-difference matching estimators to perform much better than cross-sectional methods in cases where participants and non-participants were drawn from different regional labour markets and/or were given different survey questionnaires.

Matching When the Data are Choice-Based Sampled

The samples used in evaluating the impacts of programmes are often choice-based, with programme participants oversampled relative to their frequency in the population of persons eligible for the programme. Under choice-based sampling, weights are generally required to consistently estimate the probabilities of programme participation. (See, for example, Manski and Lerman 1977, for discussion of weighting for logistic regressions.) When the weights are unknown, Heckman and Todd (1995) show that with a slight modification matching methods can still be applied, because the odds ratio (P/(1 − P)) estimated using a logistic model with incorrect weights (that is, ignoring the fact that samples are choice-based) is a scalar multiple of the true odds ratio, which is itself a monotonic transformation of the propensity scores. Therefore, matching can proceed on the (misweighted) estimate of the odds ratio (or on the log odds ratio).

Using Balancing Tests to Check the Specification of the Propensity Score Model

As described earlier, the propensity score matching estimator requires the outcome variable to be mean independent of the treatment indicator conditional on the propensity score, P(Z). An important consideration in implementation is how to choose Z. Unfortunately, there is no theoretical basis for choosing a particular set Z to satisfy the identifying assumptions, and the set is not necessarily the most inclusive one.

To guide in the selection of Z, there is some accumulated empirical evidence on how bias estimates depended on the choice of Z in particular applications. For example, Heckman et al. (1998a, b), Heckman et al. (1997) and Lechner (2001) show that the choice of variables included in Z can make a substantial difference to the estimator’s performance. These papers found that biases tended to be higher when the participation equation was estimated using a cruder set of conditioning variables. One approach adopted is to select the set Z to maximize the percentage of people correctly classified under the model. Another finding in these papers is that the matching estimators performed best when the treatment and control groups were located in the same geographic area and when the same survey instrument was administered to both treatments and controls to ensure comparable measurement of outcomes.

Rosenbaum and Rubin (1983) suggest a method to aid in the specification of the propensity score model. The method does not provide guidance in choosing which variables to include in Z, but can help to determine which interactions and higher-order terms to include in the model for a given Z set. They note that for the true propensity score, the following holds:

$$ Z\perp \perp D\mid \Pr \left(D=1|Z\right), $$

or equivalently E(D| Z; Pr(D = 1| Z)) = E(D| Pr(D = 1| Z)). The basic intuition is that, after conditioning on Pr(D = 1|Z), additional conditioning on Z should not provide new information about D. If after conditioning on the estimated values of P (D = 1|Z) there is still dependence on Z, this suggests misspecification in the model used to estimate Pr(D = 1|Z). The theorem holds for any Z, including sets Z that do not satisfy the conditional independence condition required to justify matching. As such, the theorem is not informative about what set of variables to include in Z.

This result motivates a specification test for Pr(D = 1|Z), that is a test whether or not there are differences in Z between the D = 1 and D = 0 groups after conditioning on P(Z). The test has been implemented in the literature a number of ways (see, for example Eichler and Lechner 2002; Dehijia and Wahba 1999, 2002; Smith and Todd 2005; Diamond and Sekohn 2005).

Assessing the Variability of Matching Estimators

The distribution theory for the cross-sectional and difference-in-difference kernel and local linear matching estimators described above is derived in Heckman et al. (1998). However, implementing the asymptotic standard error formulae can be cumbersome, so standard errors for matching estimators are often instead generating using bootstrap resampling methods. (See Efron and Tibshirani 1993, for an introduction to bootstrap methods, and Horowitz 2003, for a recent survey of bootstrapping in econometrics.) A recent paper by Abadie and Imbens (2006a) shows that standard bootstrap resampling methods are not valid for assessing the variability of nearest neighbour estimators, but can be applied to assess the variability of kernel or local linear matching estimators for a suitably chosen bandwidth. Abadie and Imbens (2006b) present alternative standard error formulae for assessing the variability of nearest neighbour matching estimators.

Applications

There have been numerous evaluations of matching estimators in recent decades. For a survey of many applications in the context of evaluating the effects of labour market programmes (see Heckman et al. 1999). More recently, propensity score matching estimators have been used in evaluating the impacts of a variety of programme interventions in developing countries. Jalan and Ravallion (1999) assess the impact of a workfare programme in Argentina (the Trabajar programme), and Jalan and Ravallion (2003) study the effects of public investments in piped water on child health outcomes in rural India. Galiani et al. (2005) use difference-in-difference matching methods to analyse the effects of privatization of water services on child mortality in Argentina. Other applications include Gertler et al. (2004) in a study of the effects of parental death on child outcomes, Lavy (2004) in a study of the effects of a teacher incentive programme in Israel on student performance, Angrist and Lavy (2001) in a study of the effects of teacher training on children’s test scores in Israel, and Chen and Ravallion (2003) in a study of a poverty reduction project in China.

Behrman et al. (2004) use a modified version of a propensity score matching estimator to evaluate the effects of a preschool programme in Bolivia on child health and cognitive outcomes. They identify programme effects by comparing children with different lengths of duration in the programme, using matching to control for selectivity into alternative durations. Also, see Imbens (2000) and Hirano and Imbens (2004) for an analysis of the role of the propensity score with continuous treatments. Lechner (2001) extends propensity score analysis for the case of multiple treatments.

See Also