What is new?
- •
Controlling the false positive rate to address multiplicity of tests in health studies can result in logical inconsistencies and opportunities for abuse.
- •
Errors in hypothesis test conclusions depend on the frequency of the truth of null hypotheses being tested.
- •
False discovery rate control procedures do not suffer from the philosophical challenges evident with Bonferroni-type procedures.
- •
Health researchers may benefit from relying on false discovery rate control in studies with multiple tests.
We are now in an age of scientific inquiry where health and medical studies are routinely collecting large amounts of data. These studies typically involve the researcher attempting to draw many inferential conclusions through numerous hypothesis tests. Researchers are typically advised to perform some type of significance-level adjustment to account for the increased probability of reporting false positive results through multiple tests. Such adjustments are designed to control study-wide error rates and lower the probability of falsely rejecting true null hypotheses. The most commonly understood downside of these procedures is the loss of power to detect real effects. Arguments have been put forth over the years whether adjustments for controlling study-wide error rates should be made, with plenty of advocates on each side of the argument. It appears doubtful that researchers will coalesce behind a unified point of view any time soon.
Significance level adjustments that control study-wide error rates are still common in peer-reviewed health studies. An examination of recent issues of several highly cited medical and health journals (Journal of the American Medical Association, New England Journal of Medicine, Annals of Internal Medicine, and Medical Care) reveals an abundant use of multiple-test adjustments that control study-wide error rates: We found 191 articles published in 2012 to 2013 making some adjustment for multiple testing, with 102 (53.4%) performing the Bonferroni or another study-wide error adjustment. Some other studies reported explicitly, and almost apologetically, that they had not performed an adjustment, and some even reported consequences of not having adjusted for multiple tests.
Despite the continued popularity of multiple test adjustments in health studies that control false positive error rates, we argue that controlling the false discovery rate [1] is an attractive alternative. The false discovery rate is the expected fraction of tests declared statistically significant in which the null hypothesis is actually true. The false discovery rate can be contrasted with the false positive rate, which is the expected fraction of tests with true null hypotheses that are mistakenly declared statistically significant. In other words, the false positive rate is the probability of rejecting a null hypothesis given that it is true, while the false discovery rate is the probability that a null hypothesis is true given that the null hypothesis has been rejected.
Table 1 illustrates the distinction between the false positive and false discovery rates. Suppose that a set of tests can be cross-classified into a 2 × 2 table according to truth of the hypotheses (whether the null hypothesis is true or not), and the decision made based on the data (whether to reject the null hypothesis or not). Let a be the fraction of tests with true null hypotheses that are not rejected, b be the fraction of tests with true null hypotheses that are mistakenly rejected, c be the fraction of tests with false null hypotheses that are mistakenly not rejected, and d be the fraction of tests with false null hypotheses that are rejected. Assuming these fractions can be viewed as long-run rates, the false positive rate is computed as b/(a + b), whereas the false discovery rate is computed as b/(b + d). Although the numerators of these fractions are the same, the denominator of the false positive rate is the rate of encountering true null hypotheses, and the denominator of the false discovery rate is the overall rate of rejecting null hypotheses.
Conventional hypothesis testing, along with procedures to control study-wide error rates, are set up to limit false positive rates, but not false discovery rates. False discovery rate control has become increasingly standard practice in genomic studies and the analysis of micro-array data where an abundance of testing occurs. Several recent examples of false discovery rate control in health applications include provider profiling [2] and clinical adverse event rates [3], but false discovery rate control has yet to make serious in-roads into more general health studies. Of the 191 articles we found in highly cited journals that mention adjustments for multiple tests, only 14 (7.3%) include false discovery rate adjustments.
This article is intended to remind readers of the fundamental challenges of multiple-test adjustment procedures that control study-wide error rates and explain why false discovery rate control may be an appealing alternative for drawing statistical inferences in health studies. In doing so, we distinguish between tests that are exploratory and those that are hypothesis driven. The explanations we present to discourage use of adjustments based on study-wide error rate control are not new—the case has been made strongly over the past 10 to 20 years [4], [5], [6], [7], [8], [9]. Arguments in favor of using false discovery rate control have been made based on power considerations [10], [11], [12], [13], but we are unaware of explanations based on the distinction between exploratory and hypothesis-driven testing.