Earlier this year I read Tim Ogden’s book entitled, “Experimental Conversations: Perspectives on Randomized Trials in Development Economics“*. Including interviews from the “randomistas” (e.g. Michael Kremer, Ester Duflo, Abijit Bannerjee, Dean Karlan), “skeptics” (e.g. Angus Deaton), and folks not typically associated with RCTs (e.g. Tyler Cowen), it was an interesting book to read. One of the “dogs that didn’t bark” in the book was the statement that RCTs are the gold standard.
I’m not entirely sure how this idea started, but almost every popular press discussion of RCTs in development research states in one way or anther that RCTs are the so-called [quote] gold standard [unquote]. I think this is an important concept to unpack, because like most things, this isn’t completely correct or incorrect. I’ll take each of these notions one at a time.
RCTs are the “gold standard”
When it comes to causal identification, totally random variation in program participation or some other factor is a standard that applied researchers always strive for and hope to replicate as closely as possible by using quasi-experimental methods. Almost every other identification strategy, in some way, approximates the same sort of random variation an RCT manufactures for researchers.
Take instrumental variables, for example: Want to understand the impact of charter schools on education outcomes, but comparisons between those attending charter schools and those who don’t might be biased by unobserved differences in ability, motivation, and/or family circumstances? The solution, in part, is to take advantage of a setting that uses a random lottery to assign charter school attendance offers to wait-listed families. Outcomes from this lottery can be used as an instrument for charter school attendance, and comparing lottery winners with losers should yield an unbiased impact estimate. Indeed this is exactly what Angrist et al. (JLE, 2016) do.
Another example is the regression discontinuity design. Say, we want to understand the impact of a program that is only applicable to individuals who are over some age threshold. Then, since this threshold is arbitrary, individuals at either side the threshold are fairly similar and comparable. So, rather than compare outcomes of everyone above and below the threshold, we just compare outcomes of people “close to” the threshold. This is exactly what Duflo (WBER, 2003) implements to evaluate the impact of cash transfer program in South Africa.
Okay, so in this sense, RCTs act as some sort of standard that helps applied researchers think through alternative identification strategies when pure randomization is either not possible or ethical.
RCTs are not the “gold standard”
In recent years however, this “gold standard” argument has been taken to mean that we really can not be sure of anything unless we performed an RCT. This clearly isn’t true since, (a) sometimes randomization is impossible or unethical, and this being the case (b) other methods (such as difference-in-differences, regression discontinuity, instrumental variables, or simple differences, etc.) perform just as good and maybe better in specific circumstances. As an extreme example, consider the method for learning about the life-saving impacts of parachutes on sky-divers.
Most who think about impact evaluation will be fully aware of the perils of using simple pre-post observational study designs. Selection into who participates in something may bias the measured outcome and, therefore, the observed correlation may not truly be causal. In the case of understanding the impact of parachutes on the mortality of sky-divers, a simple pre-post observational study performs just fine. When we observe a sky-diver who jumps out of a plane with a parachute and notice that they survived the landing, we can conclude that the parachute “worked”, that is it caused the sky-diver to survive the stunt. We don’t need to randomly assign parachutes to sky-divers. This is a good thing, because randomly allocating parachutes to skydivers counts as a pretty terrible idea.
The point of this (rather extreme) example is some identification strategies are better suited for some situations and instances than others. Even simple pre-post differences can provide helpful evidence to inform life-saving policies and products. So, I’m not trying to argue that calling RCTs the gold standard is totally wrong, but I’m also not so sure it is entirely correct either. In his interview in “Experimental Conversations”, Dean Yang suggests that RCTs are just one tool in the toolbox of the applied researcher. Pick your identification strategies wisely and think carefully before you randomize.
*In an act of shameless self-promotion, I (along with Marc F. Bellemare) wrote a book review of Tim’s book and published it in the American Journal of Agricultural Economics.