A recent trend in quantitative political science is the increasing attention to the *multiple testing* problem in hypothesis testing. As we examine more and more dependent variables or subgroups in a given study, the potential for falsely rejecting at least one null hypothtesis due to chance increases. Despite this being a widely recognized problem in statistical theory, relatively few studies implement corrections that adjust p-values in order to guard against false positives. My hunch is that scholars fail to adjust for multiple tests because the most easily implemented tests (such as the Bonferroni adjustment or the Holm adjustment) can reduce statistical power dramatically and more forgiving resampling-based tests proposed in the literature are not easy to implement.

Because I’ve needed to implement these types of adjustments in my own work recently, I’ve written a *R* package called ** multitestr** that implements one of the more popular resampling-based multiple testing adjustments proposed by Westfall and Young (1993). If certain assumptions are met

^{1}, this “free step down” bootstrap procedure controls the family wise error rate (FWER)

^{2}like the more famous Bonerroni or similar procedures, but will be more powerful because it does not assume that individual p-values are independent of each other.

The package has been developed with linear regression in mind, since that is the workhorse of social science statistics. Currently, the main function `boot_stepdown`

handles multiple outcomes and multiple subgroups. The function accepts weights and can adjust inference for “clustered” errors. One can choose to use the nonparametric “pairs” or parametric “wild” variants of the bootstrap.

It’s still in “beta”, so feedback is welcome. You can install from Github with the following commands:

```
install.packages("devtools")
devtools::install_github("fdhidalgo/multitestr")
```

Technical and implementation details, as well as more examples, can be found in the package Vignette.

Here is a quick demonstration of how it works. I replicate the famous study by Casey, Glennerster, and Miguel (2012) that helped popularize the use of pre-analysis plans in economics. The study estimates the effect of a “community-driven development” program on 12 composite outcomes in Sierra Leone.

The user must specify a list of `full_formulas`

and `null_formulas`

, where the `full_formulas`

are the full regression models being tested and `null_formulas`

are models under the null hypothesis. In this particular case, the `full_formulas`

are formulas with 12 different outcome variables (`z_score_1`

…`z_score_12`

), the treatment variable (`t`

), pre-treatment covariates (`tothhs`

and `road`

), and strata fixed effects (`ward`

). The `null_formulas`

are the corresponding models without a treatment variable, since the null hyptothesis is that the treatment had no effect.

```
library("multitestr")
# Replicate Casey et al 2012, Table II Column 3
set.seed(123)
# Create a list of formulas with different dependent variables
F <- lapply(sprintf("zscore_%d ~ 0 + t + tothhs + road + ward", 1:12),
as.formula)
#Run the Westfall and Young boostrap step down procedure
pvals <- boot_stepdown(full_formulas = F,
null_formulas = lapply(F, update, . ~ . - t),
data = gobifo,
coef_list = "t", #this parameter specifies the coefficient of interest
nboots = 1000,
parallel = FALSE,
boot_type = "pairs",
pb = FALSE)
dplyr::mutate_if(pvals, is.numeric, round, 3)
```

```
## Hypothesis Variable bs_pvalues_unadjusted bs_pvalues_adjusted
## 1 Hypothesis 1 t 0.001 0.001
## 2 Hypothesis 2 t 0.001 0.001
## 3 Hypothesis 3 t 0.001 0.001
## 4 Hypothesis 4 t 0.738 0.981
## 5 Hypothesis 5 t 0.938 0.981
## 6 Hypothesis 6 t 0.134 0.677
## 7 Hypothesis 7 t 0.359 0.915
## 8 Hypothesis 8 t 0.452 0.915
## 9 Hypothesis 9 t 0.312 0.910
## 10 Hypothesis 10 t 0.043 0.348
## 11 Hypothesis 11 t 0.839 0.981
## 12 Hypothesis 12 t 0.357 0.915
```

In the function output, `bs_pvalues_unadjusted`

are the boostrap p-values that do not adjust for multiple testing, while `bs_pvalues_adjusted`

have been adjusted using the stepdown method. As is evident, adjusted p-values can be larger than their corresponding unadjusted p-values.

Further documentation can be found here and please feel free to contact me regarding bugs or feature requests.

## References

Casey, Katherine, Rachel Glennerster, and Edward Miguel. 2012. “Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan.” *The Quarterly Journal of Economics* 127 (4). MIT Press: 1755–1812.

Westfall, P, and S Young. 1993. “Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment.” New York: Wiley.