False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies

doi:10.1016/j.jclinepi.2014.03.012

Journal of Clinical Epidemiology

Volume 67, Issue 8, August 2014, Pages 850-857

https://doi.org/10.1016/j.jclinepi.2014.03.012 Get rights and content

Abstract

Objectives

Procedures for controlling the false positive rate when performing many hypothesis tests are commonplace in health and medical studies. Such procedures, most notably the Bonferroni adjustment, suffer from the problem that error rate control cannot be localized to individual tests, and that these procedures do not distinguish between exploratory and/or data-driven testing vs. hypothesis-driven testing. Instead, procedures derived from limiting false discovery rates may be a more appealing method to control error rates in multiple tests.

Study Design and Setting

Controlling the false positive rate can lead to philosophical inconsistencies that can negatively impact the practice of reporting statistically significant findings. We demonstrate that the false discovery rate approach can overcome these inconsistencies and illustrate its benefit through an application to two recent health studies.

Results

The false discovery rate approach is more powerful than methods like the Bonferroni procedure that control false positive rates. Controlling the false discovery rate in a study that arguably consisted of scientifically driven hypotheses found nearly as many significant results as without any adjustment, whereas the Bonferroni procedure found no significant results.

Conclusion

Although still unfamiliar to many health researchers, the use of false discovery rate control in the context of multiple testing can provide a solid basis for drawing conclusions about statistical significance.

Introduction

What is new?

•
Controlling the false positive rate to address multiplicity of tests in health studies can result in logical inconsistencies and opportunities for abuse.
•
Errors in hypothesis test conclusions depend on the frequency of the truth of null hypotheses being tested.
•
False discovery rate control procedures do not suffer from the philosophical challenges evident with Bonferroni-type procedures.
•
Health researchers may benefit from relying on false discovery rate control in studies with multiple tests.

We are now in an age of scientific inquiry where health and medical studies are routinely collecting large amounts of data. These studies typically involve the researcher attempting to draw many inferential conclusions through numerous hypothesis tests. Researchers are typically advised to perform some type of significance-level adjustment to account for the increased probability of reporting false positive results through multiple tests. Such adjustments are designed to control study-wide error rates and lower the probability of falsely rejecting true null hypotheses. The most commonly understood downside of these procedures is the loss of power to detect real effects. Arguments have been put forth over the years whether adjustments for controlling study-wide error rates should be made, with plenty of advocates on each side of the argument. It appears doubtful that researchers will coalesce behind a unified point of view any time soon.

Significance level adjustments that control study-wide error rates are still common in peer-reviewed health studies. An examination of recent issues of several highly cited medical and health journals (Journal of the American Medical Association, New England Journal of Medicine, Annals of Internal Medicine, and Medical Care) reveals an abundant use of multiple-test adjustments that control study-wide error rates: We found 191 articles published in 2012 to 2013 making some adjustment for multiple testing, with 102 (53.4%) performing the Bonferroni or another study-wide error adjustment. Some other studies reported explicitly, and almost apologetically, that they had not performed an adjustment, and some even reported consequences of not having adjusted for multiple tests.

Despite the continued popularity of multiple test adjustments in health studies that control false positive error rates, we argue that controlling the false discovery rate [1] is an attractive alternative. The false discovery rate is the expected fraction of tests declared statistically significant in which the null hypothesis is actually true. The false discovery rate can be contrasted with the false positive rate, which is the expected fraction of tests with true null hypotheses that are mistakenly declared statistically significant. In other words, the false positive rate is the probability of rejecting a null hypothesis given that it is true, while the false discovery rate is the probability that a null hypothesis is true given that the null hypothesis has been rejected.

Table 1 illustrates the distinction between the false positive and false discovery rates. Suppose that a set of tests can be cross-classified into a 2 × 2 table according to truth of the hypotheses (whether the null hypothesis is true or not), and the decision made based on the data (whether to reject the null hypothesis or not). Let a be the fraction of tests with true null hypotheses that are not rejected, b be the fraction of tests with true null hypotheses that are mistakenly rejected, c be the fraction of tests with false null hypotheses that are mistakenly not rejected, and d be the fraction of tests with false null hypotheses that are rejected. Assuming these fractions can be viewed as long-run rates, the false positive rate is computed as b/(a + b), whereas the false discovery rate is computed as b/(b + d). Although the numerators of these fractions are the same, the denominator of the false positive rate is the rate of encountering true null hypotheses, and the denominator of the false discovery rate is the overall rate of rejecting null hypotheses.

Conventional hypothesis testing, along with procedures to control study-wide error rates, are set up to limit false positive rates, but not false discovery rates. False discovery rate control has become increasingly standard practice in genomic studies and the analysis of micro-array data where an abundance of testing occurs. Several recent examples of false discovery rate control in health applications include provider profiling [2] and clinical adverse event rates [3], but false discovery rate control has yet to make serious in-roads into more general health studies. Of the 191 articles we found in highly cited journals that mention adjustments for multiple tests, only 14 (7.3%) include false discovery rate adjustments.

This article is intended to remind readers of the fundamental challenges of multiple-test adjustment procedures that control study-wide error rates and explain why false discovery rate control may be an appealing alternative for drawing statistical inferences in health studies. In doing so, we distinguish between tests that are exploratory and those that are hypothesis driven. The explanations we present to discourage use of adjustments based on study-wide error rate control are not new—the case has been made strongly over the past 10 to 20 years [4], [5], [6], [7], [8], [9]. Arguments in favor of using false discovery rate control have been made based on power considerations [10], [11], [12], [13], but we are unaware of explanations based on the distinction between exploratory and hypothesis-driven testing.

Section snippets

Adjustment for multiple testing through false positive rate control

The usual argument to convince researchers that adjustments are necessary when multiple tests are performed is to point out that, without adjustments, the probability of at least one null hypothesis being rejected is larger than acceptable levels. Suppose, for example, that a researcher performs 100 tests at the α = 0.05 significance level in which the null hypothesis is true in every case. If all the tests are independent, then the probability that at least one test would be incorrectly

Toward an alternative criterion for assessing significance

To appreciate the basis for the difficulties associated with Bonferroni-type adjustments, we first remind the reader of the process by which null hypotheses are rejected. With conventional hypothesis testing:

1.
A significance level (eg, α = 0.05) is asserted.
2.
A P-value is computed from the data.
3.
If the P-value is less than the significance level, the result is declared statistically significant, and the null hypothesis is rejected.
4.
The researcher then concludes that the null hypothesis is a false

Inferring the probability of a true null hypothesis

When performing a single hypothesis test, it is nearly impossible to infer anything meaningful about the probability the null hypothesis is true. However, when performing multiple tests in a study, the distribution of P-values provides information relevant to inferring the frequency of true null hypotheses. This is because the distribution of P-values is a mixture of two components: The distribution of P-values for true null hypotheses, which by construction is uniformly distributed between 0

False discovery rate control

One way to implement such a process is by controlling the false discovery rate [1]. Many health researchers are unaware of the false discovery rate, although it is a natural concept, and one that has important utility for calibrating error rates in hypothesis tests. Among tests that are declared significant in a study, the false discovery rate is the expected fraction of those tests in which the null hypothesis is true. The main goal of false discovery rate control is to set significance levels

Conclusion

Despite the commonplace use of Bonferroni-type significance level adjustments to address the increased probability of mistakenly rejecting true null hypotheses, we argue that such adjustments are difficult to justify on philosophical grounds. Furthermore, if researchers are concerned about being unable to limit the probability of mistaken conclusions among statistically significant results, then using Bonferroni-type adjustments based on the multiplicity of tests do not directly address this

Acknowledgments

This article is the result of work supported with resources and the use of facilities at the Bedford VA Medical Center, Bedford, MA, USA.

References (33)

K. Schulz et al.
Multiplicity in randomized trials I: endpoints and treatments
Lancet
(2005)
R. Bender et al.
Adjusting for multiple testing—when and how?
J Clin Epidemiol
(2001)
D. Allison et al.
A mixture model approach for the analysis of microarray gene expression data
Comput Stat Data Anal
(2002)
Y. Benjamini et al.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
J R Stat Soc Ser B
(1995)
H.E. Jones et al.
Use of the false discovery rate when comparing multiple health care providers
J Clin Epidemiol
(2008)
D.V. Mehrotra et al.
Use of the false discovery rate for evaluating clinical safety data
Stat Methods Med Res
(2004)
A. Gelman et al.
Why we (usually) don't have to worry about multiple comparisons
J Res Educ Eff
(2012)
D. O'Keefe
Should familywise alpha be adjusted?
Hum Commun Res
(2003)
T. Perneger
What's wrong with Bonferroni adjustments?
BMJ
(1998)
K. Rothman
Adjustments are needed for multiple comparisons
Epidemiology
(1990)

D. Savitz et al.

Multiple comparisons and related issues in the interpretation of epidemiologic data

Amer J Epidemiol

(1995)

S. Nakagawa

A farewell to Bonferroni: the problems of low statistical power and publication bias

Behav Ecol

(2004)

M. Aickin et al.

Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods

Am J Public Health

(1996)

W. Noble

How does multiple testing correction work?

Nat Biotechnol

(2009)

K. Verhoeven et al.

Implementing false discovery rate control: increasing your power

Oikos

(2005)

N. Lazar

The big picture: multiplicity control in large data sets presents new challenges and opportunities

Chance

(2012)

Cited by (953)

Effect of an immune challenge and two feed supplements on broiler chicken individual breast muscle protein synthesis rate
2024, Journal of Proteomics
Optimization of broiler chicken breast muscle protein accretion is key for the efficient production of poultry meat, whose demand is steadily increasing. In a context where antimicrobial growth promoters use is being restricted, it is important to find alternatives as well as to characterize the effect of immunological stress on broiler chicken's growth. Despite its importance, research on broiler chicken muscle protein dynamics has mostly been limited to the study of mixed protein turnover. The present study aims to characterize the effect of a bacterial challenge and the feed supplementation of citrus and cucumber extracts on broiler chicken individual breast muscle proteins fractional synthesis rates (FSR) using a recently developed dynamic proteomics pipeline. Twenty-one day-old broiler chickens were administered a single ²H₂O dose before being culled at different timepoints. A total of 60 breast muscle protein extracts from five experimental groups (Unchallenged, Challenged, Control Diet, Diet 1 and Diet 2) were analysed using a DDA proteomics approach. Proteomics data was filtered in order to reliably calculate multiple proteins FSR making use of a newly developed bioinformatics pipeline. Broiler breast muscle proteins FSR uniformly decreased following a bacterial challenge, this change was judged significant for 15 individual proteins, the two major functional clusters identified as well as for mixed breast muscle protein. Citrus or cucumber extract feed supplementation did not show any effect on the breast muscle protein FSR of immunologically challenged broilers. The present study has identified potential predictive markers of breast muscle growth and provided new information on broiler chicken breast muscle protein synthesis which could be essential for improving the efficiency of broiler chicken meat production.
The present study constitutes the first dynamic proteomics study conducted in a farm animal species which has characterized FSR in a large number of proteins, establishing a precedent for biomarker discovery and assessment of health and growth status. Moreover, it has been evidenced that the decrease in broiler chicken breast muscle protein following an immune challenge is a coordinated event which seems to be the main cause of the decreased growth observed in these animals.
Me, my thoughts and I – Personality as a moderator of the effect of thoughts on subjective well-being
2024, Personality and Individual Differences
We study how personality impacts people's experiences of their thoughts in terms of experienced happiness and worthwhileness. Over two weeks, 483 participants completed over 20,000 experience sampling questionnaires including reports of hedonic and eudemonic well-being, and type and content of thoughts. Using multi-level modelling we show that personality traits recorded prior to the start of the study for all participants interact with thought variables to significantly predict experiences of worthwhileness. Openness was the personality trait with the greatest impact on how content and type of thoughts affected worthwhileness. Predictions of happiness were not significantly improved by the addition of interactions between personality and thoughts. Implications for the broader literature on the relationship between personality and well-being are discussed.
Metabolomics perspectives into the co-exposure effect of polycyclic aromatic hydrocarbons and metals on renal function: A meet-in-the-middle approach
2024, Science of the Total Environment
Studies on the dose effects of kidney impairment and metabolomes in co-exposure to polycyclic aromatic hydrocarbons (PAHs) and metals are limited. We aimed to identify overall associations and metabolic perturbations in 130 participants (53 petrochemical workers and 77 controls) exposed to a PAHs-metals mixture in Southern China. The urinary 7 hydroxylated PAHs and 15 metal(loid)s were determined, and serum creatinine, beta-2 microglobulin, and estimated glomerular filtration rate were health outcomes. The liquid chromatography-mass spectrometry-based method was applied to serum metabolomics. Generalized weighted quantile sum (gWQS) regressions were used to estimate the overall dose-response relationships, and pathway analysis, “meet-in-the-middle” approach, and mediation effect analyses were conducted to identify potential metabolites and biological mechanisms linking exposure with nephrotoxic effects. Our results indicated that renal function reduction was associated with a PAHs-metals mixture in a dose-dependent manner, and 1-hydroxynaphthalene and copper were the most predominant contributors among the two families of pollutants. Furthermore, the metabolic disruptions associated with the early onset of kidney impairment induced by the combination of PAHs and metals encompassed pathways such as phenylalanine-tyrosine-tryptophan biosynthesis, phenylalanine metabolism, and alpha-linolenic acid metabolism. In addition, the specifically identified metabolites demonstrated excellent potential as bridging biomarkers connecting the reduction in renal function with the mixture of PAHs and metals. These findings shed light on understanding the overall associations and metabolic mechanism of nephrotoxic effects of co-exposure to PAHs and metals.
Exploring the impact of prenatal perfluoroalkyl and polyfluoroalkyl substances exposure on blood pressure in early childhood: A longitudinal analysis
2024, Ecotoxicology and Environmental Safety
Previous research investigating the correlation between prenatal exposure to per- and polyfluoroalkyl substances (PFAS) and subsequent blood pressure (BP) in offspring has yielded limited and contradictory findings. This study was conducted to investigate the potential relationship between maternal PFAS levels during pregnancy and subsequent BP in early childhood. A total of 129 expectant mothers from the Shanghai Birth Cohort were included in the study. Using high-performance liquid chromatography/tandem mass spectrometry, we measured ten PFAS compounds in maternal plasma throughout the pregnancy. When the children reached the age of 4, we examined their systolic BP (SBP) and diastolic BP (DBP), along with mean arterial pressure (MAP) and pulse pressure (PP). Data interpretation employed multiple linear and logistic regression models, complemented by Bayesian kernel machine regression (BKMR).We found that the majority of PFAS concentrations remained stable during pregnancy. The linear and BKMR models indicated a positive relationship between the PFAS mixture in maternal plasma and offspring's DBP and MAP, with perfluorohexanesulphonic acid (PFHxS) having the most significant influence (PFHxS and DBP [first trimester:β=3.03, 95%CI: (1.01,5.05); second trimester: β=2.35, 95%CI: (0.94,3.75); third trimester: β=2.57, 95%CI:(0.80,4.34)]; MAP [first trimester:β=2.55, 95%CI: (0.64,4.45); second trimester: β=2.28, 95%CI: (0.95,3.61); third trimester: β=2.35, 95%CI:(0.68,4.01)]). Logistic regression highlighted an increased risk of prehypertension and hypertension in offspring with higher maternal PFHxS concentrations during all three trimesters [first trimester: OR=2.53, 95%CI:(1.11,5.79), second trimester: OR=2.05, 95%CI:(1.11,3.78), third trimester: OR=3.08, 95%CI:(1.40,6.79)]. A positive correlation was identified between the half-lives of PFAS and the odds ratio (OR) of prehypertension and hypertension in childhood (β=0.139, P=0.010). In conclusion, this research found maternal plasma PFAS concentrations to be positively associated with BP in offspring, with PFHxS showing the most significant influence. This correlation remained consistent throughout pregnancy, and this effect was proportional to the half-lives of PFAS.
Cumulative Lifetime Violence and Bacterial Vaginosis Infection in Sexually Transmitted Infections: Findings From a Retrospective Cohort Study Among Black Women at Risk for HIV
2024, AJPM Focus
Bacterial vaginosis is the most common vaginal condition among women of reproductive age and has been associated with sexually transmitted infections. This study examines the association between cumulative lifetime violence exposure, bacterial vaginosis, and sexually transmitted infections among Black women at risk for HIV.
HIV-negative Black women in a retrospective cohort study (N=230) completed survey questions on cumulative violence (exposure to sexual or physical abuse before age 18 years and exposure to intimate partner violence or sexual violence [partner or other] after age 18 years and past year), bacterial vaginosis (lifetime and past year), and sexually transmitted infection diagnosis (lifetime and past year). Logistic regression models estimated the associations between cumulative violence, bacterial vaginosis, and sexually transmitted infections. Bacterial vaginosis was examined as a moderator in the association between cumulative violence and sexually transmitted infections.
Many women reported cumulative violence exposure (40%), lifetime bacterial vaginosis diagnosis (53%), and lifetime sexually transmitted infection diagnosis (73%). Cumulative violence experience was significantly associated with increased adjusted odds of lifetime bacterial vaginosis diagnosis (AOR=1.98; 95% CI=1.10, 3.54). Lifetime bacterial vaginosis diagnosis (AOR=2.76; 95% CI=1.45, 5.22) and past-year bacterial vaginosis diagnosis (AOR=2.16; 95% CI=1.14, 4.10) were significantly associated with increased odds of lifetime sexually transmitted infection diagnosis. Lifetime bacterial vaginosis diagnosis (AOR=2.10; 95% CI=1.19, 3.70) and past-year bacterial vaginosis diagnosis (AOR=3.00; 95% CI=1.70, 5.31) were significantly associated with past-year sexually transmitted infection diagnosis. Lifetime bacterial vaginosis infection significantly increased the odds of lifetime sexually transmitted infection diagnosis with increasing cumulative violence exposure.
Our findings support educating and screening Black women who experience cumulative violence for bacterial vaginosis to reduce the risk of untreated bacterial vaginosis and sexually transmitted infections.
Co-exposure of petrochemical workers to noise and mixture of benzene, toluene, ethylbenzene, xylene, and styrene: Impact on mild renal impairment and interaction
2024, Environmental Pollution
Epidemiological evidence concerning effects of simultaneous exposure to noise and benzene, toluene, ethylbenzene, xylene, and styrene (BTEXS) on renal function remains uncertain. In 2020, a cross-sectional study was conducted among 1160 petrochemical workers in southern China to investigate effects of their co-exposure on estimated glomerular filtration rate (eGFR) and mild renal impairment (MRI). Noise levels were assessed using cumulative noise exposure (CNE). Urinary biomarkers for BTEXS were quantified. We found the majority of workers had exposure levels to noise and BTEXS below China's occupational exposure limits. CNE, trans, trans-muconic acid (tt-MA), and the sum of mandelic acid and phenylglyoxylic acid (PGMA) were linearly associated with decreased eGFR and increased MRI risk. We observed U-shaped associations for both N-acetyl-S-phenyl-L-cysteine (SPMA) and o-methylhippuric acid (2-MHA) with MRI. In further assessing the joint effect of BTEXS (β, −0.164 [95% CI, −0.296 to −0.033]) per quartile increase in all BTEXS metabolites on eGFR using quantile g-computation models, we found SPMA, tt-MA, 2-MHA, and PGMA played pivotal roles. Additionally, the risk of MRI associated with tt-MA was more pronounced in workers with lower CNE levels (P = 0.004). Multiplicative interaction analysis revealed antagonisms of CNE and PGMA on MRI risk (P = 0.034). Thus, our findings reveal negative dose-effect associations between noise and BTEXS mixture exposure and renal function in petrochemical workers. With the exception of toluene, benzene, xylene, ethylbenzene, and styrene are all concerning pollutants for renal dysfunction. Effects of benzene, ethylbenzene, and styrene exposure on renal dysfunction were more pronounced in workers with lower CNE.

View all citing articles on Scopus

: Conflict of interest: None.

: Funding: None.

View full text

Review ArticleFalse discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies

Abstract

Objectives

Study Design and Setting

Results

Conclusion

Introduction

Section snippets

Adjustment for multiple testing through false positive rate control

Toward an alternative criterion for assessing significance

Inferring the probability of a true null hypothesis

False discovery rate control

Conclusion

Acknowledgments

Lancet

J Clin Epidemiol

Comput Stat Data Anal

Controlling the false discovery rate: a practical and powerful approach to multiple testing

J R Stat Soc Ser B

Use of the false discovery rate when comparing multiple health care providers

J Clin Epidemiol

Use of the false discovery rate for evaluating clinical safety data

Stat Methods Med Res

Why we (usually) don't have to worry about multiple comparisons

J Res Educ Eff

Should familywise alpha be adjusted?

Hum Commun Res

What's wrong with Bonferroni adjustments?

BMJ

Adjustments are needed for multiple comparisons

Epidemiology

Multiple comparisons and related issues in the interpretation of epidemiologic data

Amer J Epidemiol

A farewell to Bonferroni: the problems of low statistical power and publication bias

Behav Ecol

Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods

Am J Public Health

How does multiple testing correction work?

Nat Biotechnol

Implementing false discovery rate control: increasing your power

Oikos

The big picture: multiplicity control in large data sets presents new challenges and opportunities

Chance

Review Article
False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies