ABSTRACT: Nonexperimental or “quasi-experimental” evaluation methods, in which researchers use treatment and comparison groups without randomly assigning subjects to the groups, are often proposed as substitutes for randomized trials. Yet, nonexperimental (NX) methods rely on untestable assumptions. To assess these methods in the context of welfare, job training, and employment services programs, we synthesized the results of 12 design replication studies, case studies that try to replicate experimental impact estimates using NX methods. We interpret the difference between experimental and NX estimates of the impacts on participants’ annual earnings as an estimate of bias in the NX estimator. We found that NX methods sometimes came close to replicating experiments, but were often substantially off, in some cases by several thousand dollars. The wide variation in bias estimates has three sources. It reflects variation in the bias of NX methods as well as sampling variability in both the experimental and NX estimators. We identified several factors associated with smaller bias; for example, comparison groups being drawn from the same labor market as the treatment population and pre-program earnings being used to adjust for individual differences. We found that matching methods, such as those using propensity scores, were not uniformly better than more traditional regression modeling. We found that specification tests were successful at eliminating some of the worst performing NX impact estimates. These findings suggest ways to improve a given NX research design, but do not provide strong assurance that such research designs would reliably replicate any particular well-run experiment. If a single NX estimator cannot reliably replicate an experimental one, perhaps several estimators pertaining to different study sites, time periods, or methods might do so on average. We therefore examined the extent to which positive and negative bias estimates cancel out. We found that this did happen for the training and welfare programs we examined, but only when we looked across a wide range of studies, sites, and interventions. When we looked at individual interventions, the bias estimates did not always cancel out. We failed to identify an aggregation strategy that consistently removed bias while answering a focused question about earnings impacts of a program. The lessons of this exercise suggest that the empirical evidence from the design replication literature can be used, in the context of training and welfare programs, to improve NX research designs, but on its own cannot justify their use. More design replication would be necessary to determine whether aggregation of NX evidence is a valid approach to research synthesis.
Authors: Steven Glazerman, Dan Levy, David Myers, Steven Glazerman, Dan Levy, David Myers
Publication Date: 2016
Rights: Draft submitted to the Annals of the Academy of Political and Social Sciences: Not for citation or quotation