Abstract
The massive development in statistical causal inference to the era of big data commonly seen in public health applications can be always hindered due to the computational barriers. In this chapter we discuss a practical concern on computing barriers in statistical causal inference with example in optimal pair matching and consequently offer a novel solution by constructing a stratification tree based on exact matching and propensity scores. We demonstrate the implementation of this novel method with a large observational study from Philadelphia obstetric unit closure from 1995 to 2003 with 59 observed covariates in each of the 132,786 birth deliveries and 5,998,111 potential controls. Algorithms and R program code are also provided for interested readers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hansen, B.B., Klopfer, S.O.: Optimal full matching and related designs via network flows. J. Comput. Graph. Stat. 15, 609–627 (2006)
Rosenbaum, P.R.: Observational Studies. Springer Series in Statistics. Springer, New York (2002)
Rosenbaum, P.R.: Design of Observational Studies. Springer, New York (2010)
Rosenbaum, P.R., Rubin, D.B.: Reducing bias in observational studies using subclassification on the propensity score. J. Am. Stat. Assoc. 79, 516–524 (1984)
Zhang, K., Small, D.S., Lorch, S., Srinivas, S., Rosenbaum, P.R.: Using split samples and evidence factors in an observational study of neonatal outcomes. J. Am. Stat. Assoc. 106, 511–524 (2011)
Acknowledgements
Zhang’s research is partially supported by NSF DMS-1309619, DMS-1613112, and IIS-1633212. Chen’s research was supported in part by NIH grants from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD, R01HD075635, PIs: Xinguang. Chen and Ding-Geng Chen). This material was also partially based upon work supported by the NSF under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Zhang thanks Dylan S. Small for very helpful suggestions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: R Code for Propensity Score Stratification
Appendix: R Code for Propensity Score Stratification
The following R function opt_pstrat implements the PSS algorithm in Step 3 described above and forms the subclasses. The function takes three arguments as inputs:
-
1.
indicator: This argument takes a binary vector which takes value 1 for treated units and 0 for control ones.
-
2.
pscore: This argument takes a vector of propensity scores of each unit.
-
3.
sizemax: This argument takes a preset tolerance level on the size of the distance matrix. The default value is 9,000,000.
The function opt_pstrat returns with the following values:
-
1.
flag: This output returns 1 for successful stratification and 2 otherwise.
-
2.
cutoffs: This output returns the cutoff points where the subclasses are split.
-
3.
t.pstrata: This output returns a vector listing the number of treated units in each subclass formed.
-
4.
c.pstrata: This output returns a vector listing the number of control units in each subclass formed.
-
5.
prodsize.pstrata: This output returns a vector listing the size of distance matrix in each subclass formed.
opt_pstrat <− function ( indicator, pscore, sizemax =9000000){
cutoffs <− max( pscore )
t. strata <− NULL
c. strata <− NULL
s i z e. strata <− NULL
indicator_iter <− indicator
pscore_iter <− pscore
num_strata_formed <− 0
while (sum( indicator_iter )>0 & sum(1− indicator_iter )>0){
n <− length ( indicator_iter )
t_ind <− which ( indicator_iter==1)
n_treated <− sum( indicator_iter )
treated_pscore_iter <− pscore_iter [ t_ind ]
t_geq_t <− n_treated+1−rank ( treated_pscore_iter,
t i e s. method=”min”)
c_geq_t <− n +1−rank ( pscore_iter, t i e s. method=”min”
) [ t_ind]−t_geq_t
matchable <− c_geq_t >= t_geq_t
i f (sum( matchable)>0){
matchable_set <− t_ind [ c_geq_t >= t_geq_t ]
} e l s e {
print (”No way to s t r a t i f y: c_geq_t < t_geq_t.”
); stop}
s i z e. dist <− t_geq_t * c_geq_t
i f (min( s i z e. dist [ matchable])> sizemax ){
print (”No way to s t r a t i f y: min( s i z e. dist)>sizemax. ” );
return ( l i s t ( flag =2))
}
cutoff. s i z e. ind <− which ( s i z e. dist== max( s i z e. dist [
matchable ] [ s i z e. dist [ matchable]<sizemax ] ) ) [ 1 ]
cutoff <− pscore_iter [ t_ind [ cutoff. s i z e. ind ] ]
cutoffs <− c ( cutoffs, cutoff )
t. strata <− c ( t. strata,sum( treated_pscore_iter >=cutoff ))
c. strata <− c ( c. strata,sum( pscore_iter>=cutoff)−sum(
treated_pscore_iter >= cutoff ))
s i z e. strata <− c ( s i z e. strata,sum( treated_pscore_iter >=
cutoff ) * ( sum( pscore_iter>=cutoff)−sum(
treated_pscore_iter >=cutoff )))
num_strata_formed <− num_strata_formed+1
print ( num_strata_formed )
indicator_iter <− indicator_iter [ pscore_iter <cutoff ]
pscore_iter <− pscore_iter [ pscore_iter <cutoff ]
}
i f (sum( indicator_iter )==0){
print (” S t r a t i f i c a t i o n Finished: Treated Units Used Up.”)
return ( l i s t ( flag =1,num_pstrata = num_strata_formed,
cutoffs=rev ( cutoffs ), t. pstrata=rev ( t. strata ), c. pstrata=
rev ( c. strata ), prodsize. pstrata=rev ( s i z e. strata )))
}
i f (sum ( ! indicator_iter )==0){
print (” S t r a t i f i c a t i o n Finished: Control Units Used Up.
Cannot Form New Strata.”)
return ( l i s t ( flag =2,num_pstrata = num_strata_formed,
cutoffs=rev ( cutoffs ), t. pstrata=rev ( t. strata ), c. pstrata=
rev ( c. strata ), prodsize. pstrata=rev ( s i z e. strata )))
}
}
As described in the main text, the function opt_pstrat is appliedwhen each stratum goes through Step 3. The outputs of this functionprovide useful information on whether to further split or match within thesubclasses.
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Zhang, K., Chen, DG. (2016). Overcoming the Computing Barriers in Statistical Causal Inference. In: He, H., Wu, P., Chen, DG. (eds) Statistical Causal Inferences and Their Applications in Public Health Research. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-41259-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-41259-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41257-3
Online ISBN: 978-3-319-41259-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)