A recent trend in quantitative political science is the increasing attention to the multiple testing problem in hypothesis testing. As we examine more and more dependent variables or subgroups in a given study, the potential for falsely rejecting at least one null hypothtesis due to chance increases. Despite this being a widely recognized problem in statistical theory, relatively few studies implement corrections that adjust p-values in order to guard against false positives. My hunch is that scholars fail to adjust for multiple tests because the most easily implemented tests (such as the Bonferroni adjustment or the Holm adjustment) can reduce statistical power dramatically and more forgiving resampling-based tests proposed in the literature are not easy to implement.
Because I’ve needed to implement these types of adjustments in my own work recently, I’ve written a R package called
multitestr that implements one of the more popular resampling-based multiple testing adjustments proposed by Westfall and Young (1993). If certain assumptions are met1, this “free step down” bootstrap procedure controls the family wise error rate (FWER)2 like the more famous Bonerroni or similar procedures, but will be more powerful because it does not assume that individual p-values are independent of each other.
The package has been developed with linear regression in mind, since that is the workhorse of social science statistics. Currently, the main function
boot_stepdown handles multiple outcomes and multiple subgroups. The function accepts weights and can adjust inference for “clustered” errors. One can choose to use the nonparametric “pairs” or parametric “wild” variants of the bootstrap.
It’s still in “beta”, so feedback is welcome. You can install from Github with the following commands:
Technical and implementation details, as well as more examples, can be found in the package Vignette.
Here is a quick demonstration of how it works. I replicate the famous study by Casey, Glennerster, and Miguel (2012) that helped popularize the use of pre-analysis plans in economics. The study estimates the effect of a “community-driven development” program on 12 composite outcomes in Sierra Leone.
The user must specify a list of
null_formulas, where the
full_formulas are the full regression models being tested and
null_formulas are models under the null hypothesis. In this particular case, the
full_formulas are formulas with 12 different outcome variables (
z_score_12), the treatment variable (
t), pre-treatment covariates (
road), and strata fixed effects (
null_formulas are the corresponding models without a treatment variable, since the null hyptothesis is that the treatment had no effect.
library("multitestr") # Replicate Casey et al 2012, Table II Column 3 set.seed(123) # Create a list of formulas with different dependent variables F <- lapply(sprintf("zscore_%d ~ 0 + t + tothhs + road + ward", 1:12), as.formula) #Run the Westfall and Young boostrap step down procedure pvals <- boot_stepdown(full_formulas = F, null_formulas = lapply(F, update, . ~ . - t), data = gobifo, coef_list = "t", #this parameter specifies the coefficient of interest nboots = 1000, parallel = FALSE, boot_type = "pairs", pb = FALSE) dplyr::mutate_if(pvals, is.numeric, round, 3)
## Hypothesis Variable bs_pvalues_unadjusted bs_pvalues_adjusted ## 1 Hypothesis 1 t 0.001 0.001 ## 2 Hypothesis 2 t 0.001 0.001 ## 3 Hypothesis 3 t 0.001 0.001 ## 4 Hypothesis 4 t 0.738 0.981 ## 5 Hypothesis 5 t 0.938 0.981 ## 6 Hypothesis 6 t 0.134 0.677 ## 7 Hypothesis 7 t 0.359 0.915 ## 8 Hypothesis 8 t 0.452 0.915 ## 9 Hypothesis 9 t 0.312 0.910 ## 10 Hypothesis 10 t 0.043 0.348 ## 11 Hypothesis 11 t 0.839 0.981 ## 12 Hypothesis 12 t 0.357 0.915
In the function output,
bs_pvalues_unadjusted are the boostrap p-values that do not adjust for multiple testing, while
bs_pvalues_adjusted have been adjusted using the stepdown method. As is evident, adjusted p-values can be larger than their corresponding unadjusted p-values.
Further documentation can be found here and please feel free to contact me regarding bugs or feature requests.
Casey, Katherine, Rachel Glennerster, and Edward Miguel. 2012. “Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan.” The Quarterly Journal of Economics 127 (4). MIT Press: 1755–1812.
Westfall, P, and S Young. 1993. “Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment.” New York: Wiley.