Matching Estimators

Todd, Petra E.

doi:10.1057/978-1-349-95189-5_2104

Matching Estimators

Petra E. Todd¹

Reference work entry
First Online: 01 January 2018

53 Accesses

Abstract

Matching methods are a popular method for evaluating the effects of programme or other treatment interventions. This article reviews recent developments in the econometric literature on matching estimators, including the assumptions required to justify their application, different ways of implementing the estimators and some recent empirical applications.

Download reference work entry PDF

Introduction

Matching is a widely used non-experimental method of evaluation that can be used to estimate the average effect of a treatment or programme intervention. The method compares the outcomes of programme participants with those of matched non-participants, where matches are chosen on the basis of similarity in observed characteristics. One of the main advantages of matching estimators is that they typically do not require specifying the functional form of the outcome equation and are therefore not susceptible to misspecification bias along that dimension. Traditional matching estimators pair each programme participant with a single matched non-participant (see, for example, Rosenbaum and Rubin 1983), whereas more recently developed estimators pair programme participants with multiple non-participants and use weighted averaging to construct the matched outcomes.

We next define some notation and discuss how matching estimators solve the evaluation problem. Much of the treatment effect literature is built on the potential outcomes framework of Fisher (1935), exposited more recently in Rubin (1974) and Holland (1986). The framework assumes that there are two potential outcomes, denoted (Y₀, Y₁) that represent the states of being without and with treatment. An individual can be in only one state at a time, so only one of the outcomes is observed. The outcome that is not observed is termed a counterfactual outcome. The treatment impact for an individual is

$$ \Delta ={Y}_1-{Y}_0, $$

which is not directly observable. Assessing the impact of a programme intervention requires making an inference about what outcomes would have been observed in the no-programme state. Let D = 1 for persons who participate in the programme and D = 0 for persons who do not. The D = 1 sample often represents a select group of persons who were deemed eligible for a programme, applied to it, got accepted into it and decided to participate in it. The outcome that is observed is Y = DY₁ + (1 − D)Y₀.

Before considering different parameters of interest and their estimation, we first consider what is available directly from the data. The conditional distributions F(Y₁| X, D = 1) and F(Y₀| X, D = 0) can be recovered from the observations on Y₁ and Y₀, but not the joint distributions F(Y₀, Y₁| X, D = 1) , F(Y₀, Y₁| X) or the impact distribution, F(Δ| X, D = 1). Because of this missing data problem, researchers often aim instead on recovering some features of the impact distribution, such as its mean. The parameter that is most commonly the focus of evaluation studies is the mean impact of treatment on the treated, TT = E(Y₁ − Y₀| D = 1), which gives the benefit of the programme to programme participants. (If the outcome were earnings and the TT parameter exceeded the average cost of the programme, then the programme might be considered to at least cover its costs.)

Matching estimators typically assume that there exist a set of observed characteristics Z such that outcomes are independent of programme participation conditional on Z. That is, it is assumed that the outcomes (Y₀,Y₁) are independent of participation status D conditional on Z,

$$ \left({Y}_0,{Y}_1\right)\perp \perp D\mid Z. $$

(1)

The independence condition can be equivalently represented as Pr(D = 1| Y₀, Y₁, Z) = Pr(D = 1| Z) , or E(D| Y₀, Y₁, Z) = E(D| Z). In the terminology of Rosenbaum and Rubin 1983, treatment assignment is ‘strictly ignorable’ given Z. It is also assumed that for all Z there is a positive probability of either participating (D = 1) or not participating (D = 0) in the programme: that is,

$$ 0<\Pr \left(D=1|Z\right)<1. $$

(2)

This assumption is required so that matches for D = 0 and D = 1 observations can be found. If assumptions (1) and (2) are satisfied, then the problem of determining mean programme impacts can be solved by substituting the Y₀ distribution observed for matched on Z non-participants for the missing participant Y₀ distribution.

The above assumptions are overly strong if the parameter of interest is the mean impact of treatment on the treated (TT), in which case a weaker conditional mean independence assumption on Y₀ suffices (see Heckman et al. 1998a, b):

$$ E\left({Y}_0|Z,D=1\right)=E\left({Y}_0|Z,D=0\right)=E\left({Y}_0|Z\right). $$

(3)

Furthermore, when TT is the parameter of interest, the condition 0 < Pr (D = 1| Z) is also not required, because that condition is only needed to guarantee a participant analogue for each non-participant. The TT parameter requires only

$$ \Pr \left(D=1|Z\right)<1. $$

(4)

Under these assumptions, the mean impact of the programme on programme participants can be written as

$$ {\displaystyle \begin{array}{l}{\Delta}_{TT}=E\left({Y}_1-{Y}_0|D=1\right)\\ {}\ \ \ \ \ \ \ =E\left({Y}_1|D=1\right)-{E}_{Z\mid D=1}\left\{{E}_Y\left(Y|D=1,Z\right)\right\}\hfill \\ {}\ \ \ \ \ \ \ =E\left({Y}_1|D=1\right)-{E}_{Z\mid D=1}\left\{{E}_Y\left(Y|D=0,Z\right)\right\},\hfill \end{array}} $$

where the second term can be estimated from the mean outcomes of the matched on Z comparison group. (The notation E_{Z|D = 1} denotes that the expectation is taken with respect to the f(Z| D = 1) density.)

Assumption (3) implies that D does not help predict values of Y₀ conditional on Z which rules out selection into the programme directly on values of Y₀. However, there is no similar restriction imposed on Y₁, so the method does allow individuals who expect to experience higher levels of Y₁ to select into the programme on the basis of that information. For estimating the TT parameter, matching methods allow selection into treatment to be based on possibly unobserved components of the anticipated programme impact, but only in so far as the programme participation decisions are based on the unobservable determinants of Y₁ and not those of Y₀.

Second, the matching method also requires that the distribution of the matching variables, Z, not be affected by whether the treatment is received. For example, age, gender, and race would generally be valid matching variables, but marital status may not be if it were potentially affected by receipt of the programme. To see why this assumption is necessary, consider the term

$$ {E}_{Z\mid D=1}\left\{{E}_Y\left(Y|D=0,Z\right)\right\}={\int}_{z\in Z}{\int}_{y\in Y}y\;f\left(y|D=0,z\right)f\left(z|D=1\right)\mathrm{dz}. $$

It uses the f(z| D = 1) conditional density to represent the density that would also have been observed in the no treatment (D = 0) state, which rules out the possibility that receipt of treatment changes the density of Z. Variables that are likely to be affected by the treatment or programme intervention cannot be used in the set of matching variables.

With non-experimental data, there may or may not exist a set of observed conditioning variables for which (1) and (2), or (3) and (4), hold. A finding of Heckman et al. (1997) and Heckman et al. (1996, 1998a, b) in their application of matching methods to data from the Job Training and Partnership Act (JTPA) programme is that (2) and (4) were not satisfied, because no match could be found for a fraction of the participants. If there are regions where the support of Z does not overlap for the D = 1 and D = 0 groups, then matching is justified only when performed over the region of common support. The estimated treatment effect must then be defined conditionally on the region of overlap. Some methods for empirically determining the overlap region are described below.

Matching estimators can be difficult to implement when the set of conditioning variables Z is large. If Z are discrete, small-cell problems may arise. If Z are continuous and the conditional mean E(Y₀| D = 0, Z) is estimated nonparametrically, then convergence rates will be slow due to the so-called curse of dimensionality problem. Rosenbaum and Rubin (1983) provide a theorem that can be used to address this dimensionality problem. They show that for random variables Y and Z and a discrete random variable D

$$ {\displaystyle \begin{array}{ll}\hfill & E\left(D|Y,P\left(D=1|Z\right)\right)\\ {}& =E\left(E\left(D|Y,Z\right)|Y,\Pr \left(D=1|Z\right)\right),\hfill \end{array}} $$

so that

$$ {\displaystyle \begin{array}{l}E\left(D|Y,Z\right)=E\left(D|Z\right)\\ {}\Rightarrow E\left(D|Y,\Pr \left(D=1|Z\right)=E\left(D|\Pr \left(D=1|Z\right)\right.\right).\hfill \end{array}} $$

This result implies that, when Y₀ outcomes are independent of programme participation conditional on Z, they are also independent of participation conditional on the probability of participation, P(Z) = Pr (D = 1| Z). That is, when matching on Z is valid, matching on the summary statistic Pr(D = 1| Z) (the propensity score) is also valid. Provided that P(Z) can be estimated parametrically (or semiparametrically at a rate faster than the nonparametric rate), matching on the propensity score reduces the dimensionality of the matching problem to that of a univariate problem. For this reason, much of the literature on matching focuses on propensity score matching methods. (Heckman et al. 1998a, b, and Hahn 1998, consider whether it is better in terms of efficiency to match on P(X) or on X directly.) With the use of the Rosenbaum and Rubin (1983) theorem, the matching procedure can be broken down into two stages. In the first stage, the propensity score Pr(D = 1| Z) is estimated, using a binary discrete choice model. (Options for first the stage estimation include, for example, a parametric logit or probit model or a semiparametric estimator, such as semiparametric least squares – Ichimura 1993 – maximum score – Manski 1973 – smoothed maximum score – Horowitz 1992 – or semiparametric maximum likelihood – Klein and Spady 1993. If P(Z) were estimated using a fully nonparametric method, then the curse of dimensionality problem would reappear.) In the second stage, individuals are matched on the basis of their predicted probabilities of participation.

We next describe a simple model of the programme participation decision to illustrate the kinds of assumptions needed to justify matching. (This model is similar to an example given in Heckman et al. 1999.) Assume that an individual chooses whether to apply to a training programme on the basis of the expected benefits. He or she compares the expected earnings streams with and without participating, taking into account opportunity costs and net of some random training cost ε, which may include a psychic component expressed in monetary terms. The participation decision is made at time t = 0 and the training programme lasts for periods 1 through τ, during which time earnings are zero. The information set used to determine expected earnings is denoted by W, which might include, for example, earnings and employment history. The participation model is

$$ D=1\ \ \mathrm{if}\ E\left(\underset{j=\tau }{\overset{T}{\Sigma}}\frac{Y_{1j}}{{\left(1+r\right)}^j}-\underset{k=1}{\overset{T}{\Sigma}}\frac{Y_{0k}}{{\left(1+r\right)}^k}|W\right)>\varepsilon +{Y}_{00},\ \ \ \mathrm{else}\ D=0. $$

The terms in the right-hand side of the inequality are assumed to be known to the individual but not to the econometrician.

$$ {\displaystyle \begin{array}{l}\mathrm{If}\;f\left({Y}_{0k}|\varepsilon +{Y}_{00},X\right)=f\left({Y}_{0k}|X\right),\hfill \\ {}\begin{array}{ll}\hfill & \mathrm{then}E\left({Y}_{0k}|X,D=1\right)\\ {}& =E\left({Y}_{0k}|X,\varepsilon +{Y}_{00}<\eta (W)\right)\hfill \\ {}& =E\left({Y}_{0k}|X\right),\hfill \end{array}\end{array}} $$

which would justify application of a matching estimator. This assumption places restrictions on the correlation structure of the earnings residuals. For example, the assumption would not be plausible if X = W and Y₀₀ = Y_0k, because knowing that a person selected into the programme (D = 1) would likely be informative about subsequent earnings. We could assume, however, a model for earnings

$$ {Y}_{0k}=\varphi (X)+{\nu}_{0k}, $$

such as where v_0k follows an MA(q) process with q < k, which would imply that Y_0k and Y₀₀ are uncorrelated conditional on X. The matching method does not require that everything in the information set be known, but it does assume sufficient information to make the selection on observables assumption plausible.

Cross-Sectional Matching Methods

For notational simplicity, let P = P(Z). A prototypical propensity score matching estimator takes the form

$$ {\displaystyle \begin{array}{r}{\widehat{\alpha}}_M=\frac{1}{n_1}\underset{i\in {I}_1\cap {S}_P}{\Sigma}\left[{Y}_{1i}-\widehat{E}\left({Y}_{0i}|D=1,{P}_i\right)\right]\\ {}\\ {}\widehat{E}\left({Y}_{0i}|D=1,{P}_i\right)=\underset{j\in {I}_0}{\Sigma}W\left(i,j\right){Y}_{0j},\end{array}} $$

(5)

where I₁ denotes the set of programme participants, I₀ the set of non-participants, S_P the region of common support (see below for ways of constructing this set). n₁ is the number of persons in the set I₁ ∩ S_P. The match for each participant I ∈ I₁ ∩ S_P is constructed as a weighted average over the outcomes of non-participants, where the weights W(i, j) depend on the distance between P_i and P_j. Define a neighbourhood C(P_i) for each i in the participant sample. Neighbours for i are non-participants j ∈ I₀ for whom P_j ∈ C(P_i). The persons matched to i are those people in set A_i where A_i = {j ∈ I₀| P_j ∈ C(P_i)}. We describe a number of alternative matching estimators below, that differ in how the neighbourhood is defined and in how the weights W (i, j) are constructed.

Alternative Ways of Constructing Matched Outcomes

Nearest-Neighbour Matching

Traditional, pairwise matching, also called nearest-neighbour matching, sets:

$$ C\left({P}_i\right)=\underset{j}{\min}\left\Vert {P}_i-{P}_j\right\Vert, j\in {I}_0. $$

That is, the non-participant with the value of P_j that is closest to P_i is selected as the match and A_i is a singleton set. The estimator can be implemented either matching with or without replacement. When matching is performed with replacement, the same comparison group observation can be used repeatedly as a match. A drawback of matching without replacement is that the final estimate will usually depend on the initial ordering of the treated observations for which the matches were selected.

Caliper matching (Cochran and Rubin 1973) is a variation of nearest neighbour matching that attempts to avoid ‘bad’ matches (those for which P_j is far from P_i) by imposing a tolerance on the maximum distance ‖P_i − P_j‖allowed. That is, a match for person i is selected only if ‖P_i − P_j‖ < ε , j ∈ I₀, where ε is a pre-specified tolerance. Treated persons for whom no matches can be found within the caliper are excluded from the analysis, which is one way of imposing a common support condition. A drawback of caliper matching is that it is difficult to know a priori what choice for the tolerance level is reasonable.

Stratification or Interval Matching

In this variant of matching, the common support of P is partitioned into a set of intervals, and average treatment impacts are calculating through simple averaging within each interval. A weighted average of the interval impact estimates, using the fraction of the D = 1 population in each interval for the weights, provides an overall average impact estimate. Implementing this method requires a decision on how wide the intervals should be. Dehejia and Wahba (1999) implement interval matching using intervals that are selected such that the mean values of the estimated P_i and P_j are not statistically different from each other within intervals.

Kernel and Local Linear Matching

More recently developed matching estimators construct a match for each programme participant using a weighted average over multiple persons in the comparison group. Consider, for example, the nonparametric kernel matching estimator, given by

$$ {\widehat{\alpha}}_{KM}=\frac{1}{n_1}\underset{i\in {I}_1}{\Sigma}\left\{{Y}_{1i}-\frac{\sum_{j\in {I}_0}{Y}_{0j}G\left(\frac{P_j-{P}_i}{a_n}\right)}{\sum_{k\in {I}_0}G\left(\frac{P_k-{P}_i}{a_n}\right)}\right\} $$

where G( ) is a kernel function and an is a bandwidth parameter. (See Heckman et al. 1997a, b, 1998a, b and Heckman et al., 1998a, b.) In terms of Eq. (5), the weighting function, W(i,j), is equal to

$$ \frac{G\left(\frac{P_j-{P}_i}{a_n}\right)}{\sum_{k\in {I}_0}G\left(\frac{P_k-{P}_i}{a_n}\right)}. $$

For a kernel function bounded between − 1 and 1, the neighbourhood is

$$ C\left({P}_i\right)=\left\{\left|\frac{P_i-{P}_j}{a_n}\right|\le 1\right\},j\in {I}_0. $$

Under standard conditions on the bandwidth and kernel,

$$ \frac{\sum_{j\in {I}_0}{Y}_{0j}G\left(\frac{P_j-{P}_i}{a_n}\right)}{\sum_{k\in {I}_0}G\left(\frac{P_k-{P}_i}{a_n}\right)} $$

is a consistent estimator of E(Y₀| D = 1, P_i). (Specifically, we require that G(·) integrates to one, has mean zero and that a_n → 0 as n → ∞ and na_n → ∞: One example of a kernel function is the quartic kernel, given by $ G(s)=\frac{15}{16}{\left({s}^2-1\right)}^2 $ if |s| < 1, G(s) = 0 otherwise.)

Heckman et al. (1997) also propose a generalized version of kernel matching, called local linear matching. Recent research by Fan 1992a, b, demonstrated advantages of local linear estimation over more standard kernel estimation methods. These advantages include a faster rate of convergence near boundary points and greater robustness to different data design densities; see Fan 1992a, b.) The local linear weighting function is given by

$$ \ \ \ W\left(i,j\right)=\frac{G_{ij}{\sum}_{k\in {I}_0}{G}_{ik}{\left({P}_k-{P}_i\right)}^2-\left[{G}_{ij}\left({P}_j-{P}_i\right)\right]\left[{\sum}_{k\in {I}_0}{G}_{ik}\left({P}_k-{P}_i\right)\right]}{\sum_{j\in {I}_0}{G}_{ij}{\sum}_{k\in {I}_0}{G}_{ij}{\left({P}_k-{P}_i\right)}^2-{\left({\sum}_{k\in {I}_0}{G}_{ik}\left({P}_k-{P}_i\right)\right)}^2}. $$

(6)

As demonstrated in research by Fan (1992a, b), local linear estimation has some advantages over standard kernel estimation. These advantages include a faster rate of convergence near boundary points and greater robustness to different data design densities (see Fan 1992a, b). Thus, local linear regression would be expected to perform better than kernel estimation in cases where the non-participant observations on P fall on one side of the participant observations.

To implement the matching estimator given by Eq. (5), the region of common support S_P needs to be determined. The common support region can be estimated by

$$ {\widehat{S}}_P=\left\{P:\widehat{f}\left(P|D=1\right)>0\ \mathrm{and}\;\widehat{f}\left(P|D=0\right)>{c}_q\right\}, $$

where $ \widehat{f}\left(P|D=d\right),d\in \left\{0,1\right\} $ are standard nonparametric density estimators. To ensure that the densities are strictly greater than zero, it is required that the densities be strictly positive (that is, exceed zero by a certain amount), determined using a ‘trimming level’ q. That is, after excluding any P points for which the estimated density is zero, an additional small percentage of the remaining P points is excluded for which the estimated density is positive but very low. The set of eligible matches is thus given by

$$ \ \ \ {\widehat{S}}_q=\left\{P\in {\widehat{S}}_P:\widehat{f}\left(P|D=1\right)>{c}_q\;\mathrm{and}\;\widehat{f}\left(P|D=0\right)>{c}_q\right\}, $$

where c_q is the density cut-off level that satisfies

$$ \underset{c_q}{\sup}\frac{1}{2J}\underset{\left\{i\in {I}_1\cap \widehat{S}\right\}}{\Sigma}\left\{1\left(\widehat{f}\left(P|D=1\right)\right)<{c}_q+1\left(1\left(\widehat{f}\left(P|D=0\right)\right)<{c}_q\right)\right\}\le q. $$

Here, J is the cardinality of the set of observed values of P that lie in $ {I}_1\cap {\widehat{S}}_P $. That is, matches are constructed only for the programme participants for which the propensity scores lie in $ {\widehat{S}}_q $.

The above estimators are representations of matching estimators and are commonly used. They can be easily adapted to estimate other parameters of interest, such as the average effect of treatment on the untreated (UT = E(Y₁ − Y₀| D = 0, X)), or the average treatment effect (ATE = E(Y₁ − Y₀| X)), which is just a weighted average of treatment on the treated (TT) and treatment on the untreated (UT).

The recent literature has also developed alternative matching estimators that employ different weighting schemes to increase efficiency. See, for example, Hahn (1998) and Hirano et al. (2003) for estimators that attain the semiparametric efficiency bound. The methods are not described in detail here, because those studies focus on the ATE and not on the average effect of treatment on the treated (TT) parameter. Heckman, Ichimura and Todd (1998) develop a regression-adjusted version of the matching estimator, which replaces Y_0j as the dependent variable with the residual from a regression of Y_0j on a vector of exogenous covariates. The estimator uses a Robinson (1988) type estimation approach to incorporate exclusion restrictions: that is, that some of the conditioning variables in an equation for the outcomes do not enter into the participation equation or vice versa. In principle, imposing exclusion restrictions can increase efficiency. In practice, though, researchers have not observed much gain from using the regression-adjusted matching estimator. Some alternatives to propensity score matching are discussed in Diamond and Sekhon (2005).

When Does Bias Arise in Matching?

The success of a matching estimator depends on the availability of observable data to construct the conditioning set Z, such that (1) and (2) are satisfied. Suppose only a subset Z₀ ⊂ Z of the required variables is observed. The propensity score matching estimator based on Z₀ then converges to

$$ \ \ \ {\alpha}_M^{\prime }={E}_{P\left({Z}_0\right)\mid D=1}\left(E\left({Y}_1|P\left({Z}_0\right),D=1\right)-E\left({Y}_0|P\left({Z}_0\right),D=0\right)\right). $$

(7)

The bias for the parameter of interest, E(Y₁ − Y₀| D = 1), is

$$ {\mathrm{bias}}_M=E\left({Y}_0|D=1\right)-{E}_{P\left({Z}_0\right)\mid D=1}\left\{E\left({Y}_0|P\left({Z}_0\right),D=0\right)\right\}. $$

There is no way of a priori choosing the set of Z variables to satisfy the matching condition or of testing whether a particular set meets the requirements. In rare cases, where data are available on a randomized social experiment, it is sometimes possible to ascertain the bias (see, for example, Heckman et.al 1997a, b; Dehejia and Wahba 1999, 2002; Smith and Todd 2005).

Difference-in-Difference Matching Estimators

The estimators described above assume that, after conditioning on a set of observable characteristics, outcomes are conditionally mean independent of programme participation. However, for a variety of reasons there may be systematic differences between participant and non-participant outcomes, even after conditioning on observables, which could lead to a violation of the identification conditions required for matching. Such differences may arise, for example, because of programme selectivity on unmeasured characteristics or because of levels differences in outcomes that might arise when participants and non-participants reside in different local labour markets or if the survey questionnaires used to gather the data differ in some ways across groups.

A difference-in-differences (DID) matching strategy, as defined in Heckman et al. (1997) and Heckman et al. (1998a, b), allows for temporally invariant differences in outcomes between participants and non-participants. This type of estimator matches on the basis of differences in outcomes using the same weighting functions described above. The propensity score DID matching estimator requires that

$$ {\displaystyle \begin{array}{ll}\hfill & E\left({Y}_{0t}-{Y}_{0{t}^{\prime }}|P,D=1\right)\\ {}& =E\left({Y}_{0t}-{Y}_{0{t}^{\prime }}|P,D=0\right),\hfill \end{array}} $$

where t and t′ are time periods after and before the programme enrolment date. This estimator also requires the support condition given above, which must now hold in both periods t and t′. The local linear difference-in-difference estimator is given by

$$ {\widehat{\alpha}}_{DM}=\frac{1}{n_1}\underset{i\in {I}_1\cap {S}_P}{\Sigma}\left.\left({Y}_{1 ti}-{Y}_{0{t}^{\prime }i}\right)-\underset{j\in {I}_0\cap {S}_P}{\Sigma}W\left(i,j\right)\left({Y}_{0 tj}-{Y}_{0{t}^{\prime }j}\right)\right\}, $$

where the weights correspond to the local linear weights defined above. If repeated cross-section data are available, instead of longitudinal data, the estimator can be implemented as

$$ {\widehat{\alpha}}_{DM}=\frac{1}{n_{1t}}\underset{i\in {I}_{1t}\cap {S}_P}{\Sigma}\left\{\left({Y}_{1 ti}-\underset{j\in {I}_{0t}\cap {S}_P}{\Sigma}W\left(i,j\right){Y}_{0 tj}\right.\right\}-\frac{1}{n_{1{t}^{\prime }}}\underset{i\in {I}_{1t\prime}\cap {S}_P}{\Sigma}\left\{\left({Y}_{1{t}^{\prime }i}-\underset{j\in {I}_{0{t}^{\prime }}}{\Sigma}W\left(i,j\right){Y}_{0{t}^{\prime }j}\right.\right\}, $$

where $ {I}_{1t},{I}_{1{t}^{\prime }},{I}_{0t},{I}_{0{t}^{\prime }} $ denote the treatment and comparison group data-sets in each time period.

Finally, the DID matching estimator allows selection into the programme to be based on anticipated gains from the programme in the sense that D can help predict the value of Y₁ given P. However, the method assumes that D does not help predict changes $ {Y}_{0t}-{Y}_{0{t}^{\prime }} $ conditional on a set of observables (Z) used in estimating the propensity score. In their analysis of the effectiveness of matching estimators, Smith and Todd (2005) found difference-in-difference matching estimators to perform much better than cross-sectional methods in cases where participants and non-participants were drawn from different regional labour markets and/or were given different survey questionnaires.

Matching When the Data are Choice-Based Sampled

The samples used in evaluating the impacts of programmes are often choice-based, with programme participants oversampled relative to their frequency in the population of persons eligible for the programme. Under choice-based sampling, weights are generally required to consistently estimate the probabilities of programme participation. (See, for example, Manski and Lerman 1977, for discussion of weighting for logistic regressions.) When the weights are unknown, Heckman and Todd (1995) show that with a slight modification matching methods can still be applied, because the odds ratio (P/(1 − P)) estimated using a logistic model with incorrect weights (that is, ignoring the fact that samples are choice-based) is a scalar multiple of the true odds ratio, which is itself a monotonic transformation of the propensity scores. Therefore, matching can proceed on the (misweighted) estimate of the odds ratio (or on the log odds ratio).

Using Balancing Tests to Check the Specification of the Propensity Score Model

As described earlier, the propensity score matching estimator requires the outcome variable to be mean independent of the treatment indicator conditional on the propensity score, P(Z). An important consideration in implementation is how to choose Z. Unfortunately, there is no theoretical basis for choosing a particular set Z to satisfy the identifying assumptions, and the set is not necessarily the most inclusive one.

To guide in the selection of Z, there is some accumulated empirical evidence on how bias estimates depended on the choice of Z in particular applications. For example, Heckman et al. (1998a, b), Heckman et al. (1997) and Lechner (2001) show that the choice of variables included in Z can make a substantial difference to the estimator’s performance. These papers found that biases tended to be higher when the participation equation was estimated using a cruder set of conditioning variables. One approach adopted is to select the set Z to maximize the percentage of people correctly classified under the model. Another finding in these papers is that the matching estimators performed best when the treatment and control groups were located in the same geographic area and when the same survey instrument was administered to both treatments and controls to ensure comparable measurement of outcomes.

Rosenbaum and Rubin (1983) suggest a method to aid in the specification of the propensity score model. The method does not provide guidance in choosing which variables to include in Z, but can help to determine which interactions and higher-order terms to include in the model for a given Z set. They note that for the true propensity score, the following holds:

$$ Z\perp \perp D\mid \Pr \left(D=1|Z\right), $$

or equivalently E(D| Z; Pr(D = 1| Z)) = E(D| Pr(D = 1| Z)). The basic intuition is that, after conditioning on Pr(D = 1|Z), additional conditioning on Z should not provide new information about D. If after conditioning on the estimated values of P (D = 1|Z) there is still dependence on Z, this suggests misspecification in the model used to estimate Pr(D = 1|Z). The theorem holds for any Z, including sets Z that do not satisfy the conditional independence condition required to justify matching. As such, the theorem is not informative about what set of variables to include in Z.

This result motivates a specification test for Pr(D = 1|Z), that is a test whether or not there are differences in Z between the D = 1 and D = 0 groups after conditioning on P(Z). The test has been implemented in the literature a number of ways (see, for example Eichler and Lechner 2002; Dehijia and Wahba 1999, 2002; Smith and Todd 2005; Diamond and Sekohn 2005).

Assessing the Variability of Matching Estimators

The distribution theory for the cross-sectional and difference-in-difference kernel and local linear matching estimators described above is derived in Heckman et al. (1998). However, implementing the asymptotic standard error formulae can be cumbersome, so standard errors for matching estimators are often instead generating using bootstrap resampling methods. (See Efron and Tibshirani 1993, for an introduction to bootstrap methods, and Horowitz 2003, for a recent survey of bootstrapping in econometrics.) A recent paper by Abadie and Imbens (2006a) shows that standard bootstrap resampling methods are not valid for assessing the variability of nearest neighbour estimators, but can be applied to assess the variability of kernel or local linear matching estimators for a suitably chosen bandwidth. Abadie and Imbens (2006b) present alternative standard error formulae for assessing the variability of nearest neighbour matching estimators.

Applications

There have been numerous evaluations of matching estimators in recent decades. For a survey of many applications in the context of evaluating the effects of labour market programmes (see Heckman et al. 1999). More recently, propensity score matching estimators have been used in evaluating the impacts of a variety of programme interventions in developing countries. Jalan and Ravallion (1999) assess the impact of a workfare programme in Argentina (the Trabajar programme), and Jalan and Ravallion (2003) study the effects of public investments in piped water on child health outcomes in rural India. Galiani et al. (2005) use difference-in-difference matching methods to analyse the effects of privatization of water services on child mortality in Argentina. Other applications include Gertler et al. (2004) in a study of the effects of parental death on child outcomes, Lavy (2004) in a study of the effects of a teacher incentive programme in Israel on student performance, Angrist and Lavy (2001) in a study of the effects of teacher training on children’s test scores in Israel, and Chen and Ravallion (2003) in a study of a poverty reduction project in China.

Behrman et al. (2004) use a modified version of a propensity score matching estimator to evaluate the effects of a preschool programme in Bolivia on child health and cognitive outcomes. They identify programme effects by comparing children with different lengths of duration in the programme, using matching to control for selectivity into alternative durations. Also, see Imbens (2000) and Hirano and Imbens (2004) for an analysis of the role of the propensity score with continuous treatments. Lechner (2001) extends propensity score analysis for the case of multiple treatments.

Bibliography

Abadie, A. and G. Imbens 2006a. On the failure of the bootstrap for matching estimators. Technical working paper no. 325. Cambridge, MA: NBER.
Google Scholar
Abadie, A., and G. Imbens. 2006b. Large sample properties of matching estimators for average treatment effects. Econometrica 74: 235–267.
Article Google Scholar
Angrist, J., and V. Lavy. 2001. Does teacher training affect pupil learning? evidence from matched comparisons in jerusalem public schools. Journal of Labor Economics 19: 343–369.
Article Google Scholar
Behrman, J., Y. Cheng, and P. Todd. 2004. Evaluating preschool programs when length of exposure to the program varies: A nonparametric approach. The Review of Economics and Statistics 86: 108–132.
Article Google Scholar
Chen, S. and M. Ravallion 2003. Hidden impact? ex-post evaluation of an antipoverty program. Policy Research Working paper no. 3049. Washington, DC: World Bank.
Google Scholar
Cochran, W., and D. Rubin. 1973. Controlling bias in observational studies. Sankyha 35: 417–446.
Google Scholar
Dehejia, R., and S. Wahba. 1999. Causal effects in non-experimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association 94: 1053–1062.
Article Google Scholar
Dehejia, R., and S. Wahba. 2002. Propensity score matching methods for nonexperimental causal studies. The Review of Economics and Statistics 84: 151–161.
Article Google Scholar
Diamond, A. and J.S. Sekhon 2005. Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Working paper, Department of Political Science, Berkeley.
Google Scholar
Efron, B., and R. Tibshirani. 1993. An introduction to the bootstrap. New York: Chapman and Hall.
Book Google Scholar
Eichler, M., and M. Lechner. 2002. An evaluation of public employment programmes in the East German state of Sachsen-Anhalt. Labour Economics 9: 143–186.
Article Google Scholar
Fan, J. 1992a. Design adaptive nonparametric regression. Journal of the American Statistical Association 87: 998–1004.
Article Google Scholar
Fan, J. 1992b. Local linear regression smoothers and their minimax efficiencies. Annals of Statistics 21: 196–216.
Article Google Scholar
Fisher, R.A. 1935. Design of experiments. New York: Hafner.
Google Scholar
Friedlander, D., and P. Robins. 1995. Evaluating program evaluations: New evidence on commonly used nonexperimental methods. American Economic Review 85: 923–937.
Google Scholar
Galiani, S., P. Gertler, and E. Schargrodsky. 2005. Water for life: The impact of the privatization of water services on child mortality in argentina. Journal of Political Economy 113: 83–120.
Article Google Scholar
Gertler, P., D. Levine, and M. Ames. 2004. Schooling and parental death. The Review of Economics and Statistics 86: 211–225.
Article Google Scholar
Hahn, J. 1998. On the role of the propensity score in efficient estimation of average treatment effects. Econometrica 66: 315–331.
Article Google Scholar
Heckman, J. and P. Todd 1995. Adapting propensity score matching and selection models to choice-based samples. Manuscript, Department of Economics, University of Chicago.
Google Scholar
Heckman, J., H. Ichimura, J. Smith, and P. Todd. 1996. Sources of selection bias in evaluating social programs: An interpretation of conventional measures and evidence on the effectiveness of matching as a program evaluation method. Proceedings of the National Academy of Sciences 93: 13416–13420.
Article Google Scholar
Heckman, J., J. Smith, and N. Clements. 1997a. Making the most out of social experiments: Accounting for heterogeneity in programme impacts. Review of Economic Studies 64: 487–536.
Article Google Scholar
Heckman, J., H. Ichimura, and P. Todd. 1997b. Matching as an econometric evaluation estimator: Evidence from evaluating a job training program. Review of Economic Studies 64: 605–654.
Article Google Scholar
Heckman, J., H. Ichimura, J. Smith, and P. Todd. 1998a. Characterizing selection bias using experimental data. Econometrica 66: 1017–1098.
Article Google Scholar
Heckman, J., H. Ichimura, and P. Todd. 1998b. Matching as an econometric evaluation estimator. Review of Economic Studies 65: 261–294.
Article Google Scholar
Heckman, J., R. Lalonde, and J. Smith. 1999. The economics and econometrics of active labor market programs. In Handbook of labor economics, ed. O. Ashenfelter and D. Card, Vol. 3A. Amsterdam: North-Holland.
Google Scholar
Hirano, K., and G. Imbens. 2004. The propensity score with continuous treatments. In Applied bayesian modeling and causal inference from incomplete data perspectives, ed. A. Gelman and X.L. Meng. New York: Wiley.
Google Scholar
Hirano, K., G. Imbens, and G. Ridder. 2003. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71: 1161–1189.
Article Google Scholar
Holland, P.W. 1986. Statistics and causal inference (with discussion). Journal of the American Statistical Association 81: 945–970.
Article Google Scholar
Horowitz, J.L. 1992. A smoothed maximum score estimator for the binary response model. Econometrica 60: 505–532.
Article Google Scholar
Horowitz, J.L. 2003. The bootstrap. In Handbook of econometrics, ed. J.J. Heckman and E.E. Leamer, Vol. 5. Amsterdam: North-Holland.
Google Scholar
Ichimura, H. 1993. Semiparametric least squares and weighted SLS estimation of single index models. Journal of Econometrics 58: 71–120.
Article Google Scholar
Imbens, G. 2000. The role of the propensity score in estimating dose-response functions. Biometrika 87: 706–710.
Article Google Scholar
Jalan, J. and M. Ravallion 1999. Efficient estimation of average treatment effects: Evidence for argentina’s trabajar program. Policy research working paper. Washington, DC: World Bank.
Google Scholar
Jalan, J., and M. Ravallion. 2003. Does piped water reduce diarrhea for children in rural India. Journal of Econometrics 112: 153–173.
Article Google Scholar
Klein, R.W., and R.H. Spady. 1993. An efficient semiparametric estimator for binary response models. Econometrica 61: 387–422.
Article Google Scholar
LaLonde, R. 1986. Evaluating the econometric evaluations of training programs with experimental data. American Economic Review 76: 604–620.
Google Scholar
Lavy, V. 2002. Evaluating the effects of teachers’ group performance incentives on pupil achievement. Journal of Political Economy 110: 1286–1387.
Article Google Scholar
Lavy, V. 2004. Performance pay and teachers’ effort, productivity and grading ethics. Working paper no. 10622. Cambridge, MA: NBER.
Google Scholar
Lechner, M. 2001. Identification and estimation of causal effects of multiple treatments under the conditional independence assumption. In Econometric evaluations of active labor market policies in Europe, ed. M. Lechner and F. Pfeiffer. Heidelberg: Physica.
Chapter Google Scholar
Manski, C. 1973. Maximum score estimation of the stochastic utility model of choice. Journal of Econometrics 3: 205–228.
Article Google Scholar
Manski, C., and S. Lerman. 1977. The estimation of choice probabilities from choice-based samples. Econometrica 45: 1977–1988.
Article Google Scholar
Robinson, P. 1988. Root-N consistent nonparametric regression. Econometrica 56: 931–954.
Article Google Scholar
Rosenbaum, P., and D. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70: 41–55.
Article Google Scholar
Rosenbaum, P., and D. Rubin. 1985. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. American Statistician 39: 33–38.
Google Scholar
Rubin, D.B. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66: 688–701.
Article Google Scholar
Silverman, B.W. 1986. Density estimation for statistics and data analysis. London: Chapman and Hall.
Book Google Scholar
Smith, J., and P. Todd. 2005. Does matching overcome lalonde’s critique of nonexperimental estimators? Journal of Econometrics 125: 305–353.
Article Google Scholar

Download references

Author information

Authors and Affiliations

http://link.springer.com/referencework/10.1057/978-1-349-95121-5
Petra E. Todd

Authors

Petra E. Todd
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Copyright information

About this entry

Cite this entry

Todd, P.E. (2018). Matching Estimators. In: The New Palgrave Dictionary of Economics. Palgrave Macmillan, London. https://doi.org/10.1057/978-1-349-95189-5_2104

Download citation

DOI: https://doi.org/10.1057/978-1-349-95189-5_2104
Published: 15 February 2018
Publisher Name: Palgrave Macmillan, London
Print ISBN: 978-1-349-95188-8
Online ISBN: 978-1-349-95189-5
eBook Packages: Economics and FinanceReference Module Humanities and Social SciencesReference Module Business, Economics and Social Sciences

Publish with us

Policies and ethics

Matching Estimators

Abstract

Introduction

Cross-Sectional Matching Methods

Alternative Ways of Constructing Matched Outcomes

Nearest-Neighbour Matching

Stratification or Interval Matching

Kernel and Local Linear Matching

When Does Bias Arise in Matching?

Difference-in-Difference Matching Estimators

Matching When the Data are Choice-Based Sampled

Using Balancing Tests to Check the Specification of the Propensity Score Model

Assessing the Variability of Matching Estimators

Applications

See Also

Bibliography

Author information

Authors and Affiliations

Editor information

Copyright information

About this entry

Cite this entry

Download citation

Publish with us

Navigation

Abstract

Introduction

Cross-Sectional Matching Methods

Alternative Ways of Constructing Matched Outcomes

Nearest-Neighbour Matching

Stratification or Interval Matching

Kernel and Local Linear Matching

When Does Bias Arise in Matching?

Difference-in-Difference Matching Estimators

Matching When the Data are Choice-Based Sampled

Using Balancing Tests to Check the Specification of the Propensity Score Model

Assessing the Variability of Matching Estimators

Applications

See Also

Bibliography

Author information

Authors and Affiliations

Editor information

Copyright information

About this entry

Cite this entry

Download citation

Share this entry

Publish with us

Search

Navigation