Elsevier

Journal of Clinical Epidemiology

Volume 82, February 2017, Pages 71-78.e2
Journal of Clinical Epidemiology

Original Article
Predicting data saturation in qualitative surveys with mathematical models from ecological research

https://doi.org/10.1016/j.jclinepi.2016.10.001Get rights and content

Abstract

Objective

Sample size in surveys with open-ended questions relies on the principle of data saturation. Determining the point of data saturation is complex because researchers have information on only what they have found. The decision to stop data collection is solely dictated by the judgment and experience of researchers. In this article, we present how mathematical modeling may be used to describe and extrapolate the accumulation of themes during a study to help researchers determine the point of data saturation.

Study Design and Setting

The model considers a latent distribution of the probability of elicitation of all themes and infers the accumulation of themes as arising from a mixture of zero-truncated binomial distributions. We illustrate how the model could be used with data from a survey with open-ended questions on the burden of treatment involving 1,053 participants from 34 different countries and with various conditions. The performance of the model in predicting the number of themes to be found with the inclusion of new participants was investigated by Monte Carlo simulations. Then, we tested how the slope of the expected theme accumulation curve could be used as a stopping criterion for data collection in surveys with open-ended questions.

Results

By doubling the sample size after the inclusion of initial samples of 25 to 200 participants, the model reliably predicted the number of themes to be found. Mean estimation error ranged from 3% to 1% with simulated data and was <2% with data from the study of the burden of treatment. Sequentially calculating the slope of the expected theme accumulation curve for every five new participants included was a feasible approach to balance the benefits of including these new participants in the study. In our simulations, a stopping criterion based on a value of 0.05 for this slope allowed for identifying 97.5% of the themes while limiting the inclusion of participants eliciting nothing new in the study.

Conclusion

Mathematical models adapted from ecological research can accurately predict the point of data saturation in surveys with open-ended questions.

Section snippets

Context

Surveys with open-ended questions are a simple design to explore the different aspects of a concept in a given population [1]. This design is popular in many fields, including health research, social science, and marketing. For example, in health research, surveys may help identifying the topics that should be addressed in items of patient-reported outcomes [2]. The use of open-ended questions allows respondents to describe with nuance and detail how they perceive the concept under study. By

Methods

We used mathematical modeling to determine the point of data saturation in surveys using open-ended questions. It is important to note that the aim of our work was not to predict the themes, ideas, and meanings that patients may elicit on the topic of interest but rather to estimate how these new ideas are discovered and accumulated across the whole sample of participants during a study.

Performance of the model

In both our study of the burden of treatment and the simulated data sets, the model reliably predicted the number of themes to be found by doubling the sample size of a study. In our study of the burden of treatment, the prediction errors were <2% (difference between expected and observed number of themes were at most 2 of 123 themes) with initial samples of 25, 50, 100, and 200 participants (Table 1 and Fig. 1).

The excellent predictive capability of the model was confirmed with the first group

Discussion

In this study, we showed that models used in ecology to determine species richness could help with qualitative research involving surveys with open-ended questions to predict what themes will be discovered with the inclusion of more units of analysis. Determining when to stop data collection is a thorny question asked by both novice and experienced researchers in qualitative research [6]. However, there is a surprising paucity of explicit discussion of this basic issue in textbooks and articles

Conclusions

In surveys with open-ended questions, the point of data saturation and number of participants to include can be estimated with mathematical models from ecological research.

Acknowledgments

The authors thank Laura Smales (BioMedEditing) for editing.

Authors' contributions: V.-T.T., R.P., V.-C.T., and P.R. conceived and designed the experiments. V.-T.T. and R.P. analyzed data. V.-T.T. wrote the first draft of the article. V.-T.T., R.P., V.-C.T., and P.R. contributed to the writing of the article. V.-T.T., R.P., V.-C.T., and P.R. met ICMJE criteria for authorship. V.-T.T., R.P., V.-C.T., and P.R. agreed with article results and conclusions. P.R. is the guarantor, had full access to

References (17)

  • C.B. Terwee et al.

    Quality criteria were proposed for measurement properties of health status questionnaires

    J Clin Epidemiol

    (2007)
  • H. Jansen

    The logic of qualitative survey research and its position in the field of social research methods

    Forum Qual Social Res

    (2010)
  • N. Denzin et al.

    The discipline and practice of qualitative research

  • B. Glaser et al.

    The discovery of grounded theory: strategies for qualitative research

    (1967)
  • G. Guest et al.

    How many interviews are enough? An experiment with data saturation and variability

    Field Methods

    (2006)
  • S. Baker et al.

    How many qualitative interviews are enough? Expert voices and early career reflections on sampling and cases in qualitative research

    (2012)
  • M. Sandelowski

    Sample size in qualitative research

    Res Nurs Health

    (1995)
  • K. Ugland et al.

    The species accumulation curve and estimation of species richness

    J Anim Ecol

    (2003)
There are more references available in the full text version of this article.

Cited by (52)

View all citing articles on Scopus

Conflict of interest: None.

Funding: This study was funded by the French Health Ministry (PHRC AOM13127). Our team is supported by an academic grant from the program “Equipe espoir de la Recherche,” Fondation pour la Recherche Médicale, Paris, France (no. DEQ20101221475). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the article.

View full text