Last week, I came across this paper by Fiona Burlig, Louis Preonas, and Matt Woerman recently published in the Journal of Development Economics. It is a paper that seems broadly applicable, so I’ll highlight the key points.
(Note: Fiona has a really nice blog post, from 2017, that discusses all of this in much more detail than I will here.)
In graduate school econometrics, we all learn about performing ex-ante power calculations. This allows researchers to design experiments so that the results are able to detect an effect if one actually exists. In the most basic case, these power calculations assume one wave of data with a treatment group and a control group. In most applied research, however, our experiments are a little more complicated and this can generate some problems with the accuracy of ex-ante power calculations.
In particular, McKenzie (2012) shows that when we are interested in outcomes with low autocorrelation (e.g., noisy) over time, taking multiple measurements of these outcomes allows the researcher to average out this noise, thereby increasing statistical power. Moreover, for outcomes with low autocorrelation, we can achieve more statistical power with multiple survey waves than using the same sample size over a single follow-up and baseline. In a world with binding budget constraints, McKenzie (2012) is a super-informative paper. (I still remember reading every word of it several years ago.) When we are interested in noisy outcomes, such as business profits or household expenditures, it may be cost-effective to collect additional waves of data on a single unit rather than include additional units in the study. Additionally, McKenzie provides formulas useful for designing experiments with “more T” and performing ex-ante power calculations.
Burlig et al. (2020), however, highlight an imbedded assumption in the power calculation formulas in McKenzie (2012): these formulas assume constant autocorrelation. The problem, which is clarified by Bertrand, Duflo, and Mullainathan (2004), is that within-unit autocorrelation in panel data is often heterogeneous (e.g., not constant). Moreover, failing to account for this detail (e.g., by clustering estimates of standard errors) can lead to biased standard errors in ex-post panel data estimates. The central point of Burlig et al. (2020) is that this is true not only in ex-post estimation but also in ex-ante power calculations.
So, how big of a deal is this?
The Figure above answers this question and also shows what can be done to address the problem. The first column shows the consequences of failing to account for heterogeneous autocorrelation and not clustering standard errors properly. The middle column shows the consequences of failing to account for heterogeneous autocorrelation when clustering standard errors properly. Finally, the last column shows the benefits of using the “serial-correlation-robust” power calculation formula when clustering standard errors properly.
The results are quite stark. Even when clustering standard errors (as suggested by Bertrand et al. 2004), autocorrelation at even moderate levels can lead to experiments being either over- or underpowered. This is problematic because experiments are costly and we should try our best to design experiments that are not overpowered and thus wasteful or underpowered and thus uninformative.
The best news is the authors made a Stata package, pcpanel, which will help all of us perform ex-ante serial-correlation-robust power calculations when designing our experiments.