Review Article
False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies

https://doi.org/10.1016/j.jclinepi.2014.03.012Get rights and content

Abstract

Objectives

Procedures for controlling the false positive rate when performing many hypothesis tests are commonplace in health and medical studies. Such procedures, most notably the Bonferroni adjustment, suffer from the problem that error rate control cannot be localized to individual tests, and that these procedures do not distinguish between exploratory and/or data-driven testing vs. hypothesis-driven testing. Instead, procedures derived from limiting false discovery rates may be a more appealing method to control error rates in multiple tests.

Study Design and Setting

Controlling the false positive rate can lead to philosophical inconsistencies that can negatively impact the practice of reporting statistically significant findings. We demonstrate that the false discovery rate approach can overcome these inconsistencies and illustrate its benefit through an application to two recent health studies.

Results

The false discovery rate approach is more powerful than methods like the Bonferroni procedure that control false positive rates. Controlling the false discovery rate in a study that arguably consisted of scientifically driven hypotheses found nearly as many significant results as without any adjustment, whereas the Bonferroni procedure found no significant results.

Conclusion

Although still unfamiliar to many health researchers, the use of false discovery rate control in the context of multiple testing can provide a solid basis for drawing conclusions about statistical significance.

Introduction

What is new?

  • Controlling the false positive rate to address multiplicity of tests in health studies can result in logical inconsistencies and opportunities for abuse.

  • Errors in hypothesis test conclusions depend on the frequency of the truth of null hypotheses being tested.

  • False discovery rate control procedures do not suffer from the philosophical challenges evident with Bonferroni-type procedures.

  • Health researchers may benefit from relying on false discovery rate control in studies with multiple tests.

We are now in an age of scientific inquiry where health and medical studies are routinely collecting large amounts of data. These studies typically involve the researcher attempting to draw many inferential conclusions through numerous hypothesis tests. Researchers are typically advised to perform some type of significance-level adjustment to account for the increased probability of reporting false positive results through multiple tests. Such adjustments are designed to control study-wide error rates and lower the probability of falsely rejecting true null hypotheses. The most commonly understood downside of these procedures is the loss of power to detect real effects. Arguments have been put forth over the years whether adjustments for controlling study-wide error rates should be made, with plenty of advocates on each side of the argument. It appears doubtful that researchers will coalesce behind a unified point of view any time soon.

Significance level adjustments that control study-wide error rates are still common in peer-reviewed health studies. An examination of recent issues of several highly cited medical and health journals (Journal of the American Medical Association, New England Journal of Medicine, Annals of Internal Medicine, and Medical Care) reveals an abundant use of multiple-test adjustments that control study-wide error rates: We found 191 articles published in 2012 to 2013 making some adjustment for multiple testing, with 102 (53.4%) performing the Bonferroni or another study-wide error adjustment. Some other studies reported explicitly, and almost apologetically, that they had not performed an adjustment, and some even reported consequences of not having adjusted for multiple tests.

Despite the continued popularity of multiple test adjustments in health studies that control false positive error rates, we argue that controlling the false discovery rate [1] is an attractive alternative. The false discovery rate is the expected fraction of tests declared statistically significant in which the null hypothesis is actually true. The false discovery rate can be contrasted with the false positive rate, which is the expected fraction of tests with true null hypotheses that are mistakenly declared statistically significant. In other words, the false positive rate is the probability of rejecting a null hypothesis given that it is true, while the false discovery rate is the probability that a null hypothesis is true given that the null hypothesis has been rejected.

Table 1 illustrates the distinction between the false positive and false discovery rates. Suppose that a set of tests can be cross-classified into a 2 × 2 table according to truth of the hypotheses (whether the null hypothesis is true or not), and the decision made based on the data (whether to reject the null hypothesis or not). Let a be the fraction of tests with true null hypotheses that are not rejected, b be the fraction of tests with true null hypotheses that are mistakenly rejected, c be the fraction of tests with false null hypotheses that are mistakenly not rejected, and d be the fraction of tests with false null hypotheses that are rejected. Assuming these fractions can be viewed as long-run rates, the false positive rate is computed as b/(a + b), whereas the false discovery rate is computed as b/(b + d). Although the numerators of these fractions are the same, the denominator of the false positive rate is the rate of encountering true null hypotheses, and the denominator of the false discovery rate is the overall rate of rejecting null hypotheses.

Conventional hypothesis testing, along with procedures to control study-wide error rates, are set up to limit false positive rates, but not false discovery rates. False discovery rate control has become increasingly standard practice in genomic studies and the analysis of micro-array data where an abundance of testing occurs. Several recent examples of false discovery rate control in health applications include provider profiling [2] and clinical adverse event rates [3], but false discovery rate control has yet to make serious in-roads into more general health studies. Of the 191 articles we found in highly cited journals that mention adjustments for multiple tests, only 14 (7.3%) include false discovery rate adjustments.

This article is intended to remind readers of the fundamental challenges of multiple-test adjustment procedures that control study-wide error rates and explain why false discovery rate control may be an appealing alternative for drawing statistical inferences in health studies. In doing so, we distinguish between tests that are exploratory and those that are hypothesis driven. The explanations we present to discourage use of adjustments based on study-wide error rate control are not new—the case has been made strongly over the past 10 to 20 years [4], [5], [6], [7], [8], [9]. Arguments in favor of using false discovery rate control have been made based on power considerations [10], [11], [12], [13], but we are unaware of explanations based on the distinction between exploratory and hypothesis-driven testing.

Section snippets

Adjustment for multiple testing through false positive rate control

The usual argument to convince researchers that adjustments are necessary when multiple tests are performed is to point out that, without adjustments, the probability of at least one null hypothesis being rejected is larger than acceptable levels. Suppose, for example, that a researcher performs 100 tests at the α = 0.05 significance level in which the null hypothesis is true in every case. If all the tests are independent, then the probability that at least one test would be incorrectly

Toward an alternative criterion for assessing significance

To appreciate the basis for the difficulties associated with Bonferroni-type adjustments, we first remind the reader of the process by which null hypotheses are rejected. With conventional hypothesis testing:

  • 1.

    A significance level (eg, α = 0.05) is asserted.

  • 2.

    A P-value is computed from the data.

  • 3.

    If the P-value is less than the significance level, the result is declared statistically significant, and the null hypothesis is rejected.

  • 4.

    The researcher then concludes that the null hypothesis is a false

Inferring the probability of a true null hypothesis

When performing a single hypothesis test, it is nearly impossible to infer anything meaningful about the probability the null hypothesis is true. However, when performing multiple tests in a study, the distribution of P-values provides information relevant to inferring the frequency of true null hypotheses. This is because the distribution of P-values is a mixture of two components: The distribution of P-values for true null hypotheses, which by construction is uniformly distributed between 0

False discovery rate control

One way to implement such a process is by controlling the false discovery rate [1]. Many health researchers are unaware of the false discovery rate, although it is a natural concept, and one that has important utility for calibrating error rates in hypothesis tests. Among tests that are declared significant in a study, the false discovery rate is the expected fraction of those tests in which the null hypothesis is true. The main goal of false discovery rate control is to set significance levels

Conclusion

Despite the commonplace use of Bonferroni-type significance level adjustments to address the increased probability of mistakenly rejecting true null hypotheses, we argue that such adjustments are difficult to justify on philosophical grounds. Furthermore, if researchers are concerned about being unable to limit the probability of mistaken conclusions among statistically significant results, then using Bonferroni-type adjustments based on the multiplicity of tests do not directly address this

Acknowledgments

This article is the result of work supported with resources and the use of facilities at the Bedford VA Medical Center, Bedford, MA, USA.

References (33)

  • K. Schulz et al.

    Multiplicity in randomized trials I: endpoints and treatments

    Lancet

    (2005)
  • R. Bender et al.

    Adjusting for multiple testing—when and how?

    J Clin Epidemiol

    (2001)
  • D. Allison et al.

    A mixture model approach for the analysis of microarray gene expression data

    Comput Stat Data Anal

    (2002)
  • Y. Benjamini et al.

    Controlling the false discovery rate: a practical and powerful approach to multiple testing

    J R Stat Soc Ser B

    (1995)
  • H.E. Jones et al.

    Use of the false discovery rate when comparing multiple health care providers

    J Clin Epidemiol

    (2008)
  • D.V. Mehrotra et al.

    Use of the false discovery rate for evaluating clinical safety data

    Stat Methods Med Res

    (2004)
  • A. Gelman et al.

    Why we (usually) don't have to worry about multiple comparisons

    J Res Educ Eff

    (2012)
  • D. O'Keefe

    Should familywise alpha be adjusted?

    Hum Commun Res

    (2003)
  • T. Perneger

    What's wrong with Bonferroni adjustments?

    BMJ

    (1998)
  • K. Rothman

    Adjustments are needed for multiple comparisons

    Epidemiology

    (1990)
  • D. Savitz et al.

    Multiple comparisons and related issues in the interpretation of epidemiologic data

    Amer J Epidemiol

    (1995)
  • S. Nakagawa

    A farewell to Bonferroni: the problems of low statistical power and publication bias

    Behav Ecol

    (2004)
  • M. Aickin et al.

    Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods

    Am J Public Health

    (1996)
  • W. Noble

    How does multiple testing correction work?

    Nat Biotechnol

    (2009)
  • K. Verhoeven et al.

    Implementing false discovery rate control: increasing your power

    Oikos

    (2005)
  • N. Lazar

    The big picture: multiplicity control in large data sets presents new challenges and opportunities

    Chance

    (2012)
  • Cited by (953)

    View all citing articles on Scopus

    Conflict of interest: None.

    Funding: None.

    View full text