Elsevier

NeuroImage

Volume 147, 15 February 2017, Pages 952-959
NeuroImage

Is the statistic value all we should care about in neuroimaging?

https://doi.org/10.1016/j.neuroimage.2016.09.066Get rights and content

Abstract

Here we address an important issue that has been embedded within the neuroimaging community for a long time: the absence of effect estimates in results reporting in the literature. The statistic value itself, as a dimensionless measure, does not provide information on the biophysical interpretation of a study, and it certainly does not represent the whole picture of a study. Unfortunately, in contrast to standard practice in most scientific fields, effect (or amplitude) estimates are usually not provided in most results reporting in the current neuroimaging publications and presentations. Possible reasons underlying this general trend include (1) lack of general awareness, (2) software limitations, (3) inaccurate estimation of the BOLD response, and (4) poor modeling due to our relatively limited understanding of FMRI signal components. However, as we discuss here, such reporting damages the reliability and interpretability of the scientific findings themselves, and there is in fact no overwhelming reason for such a practice to persist. In order to promote meaningful interpretation, cross validation, reproducibility, meta and power analyses in neuroimaging, we strongly suggest that, as part of good scientific practice, effect estimates should be reported together with their corresponding statistic values. We provide several easily adaptable recommendations for facilitating this process.

Introduction

Just as cartography requires a balance to be struck between the loss of important detail and the exactitude of a map that has “the scale of a mile to the mile” (Carroll, 1889), so too science requires careful extraction and summarization following an experiment. In other words, to present concisely the important components of the data and analyses, an investigator reports the experiment and makes a generalized conclusion based on some supporting evidence: a small condensed set of numbers. The crucial question is: How much or to which extent should the investigator compress the information without sacrificing too much? There are arbitrary choices that have to be made, but there are some definite thresholds under which loss of information is too great for optimal utility.

For example, in a typical statistical analysis, two quantitative results are produced for each effect of interest: the estimation for the amplitude of the effect itself (e.g., a β value from regression analysis or GLM) and the associated statistic (e.g., t or z). The former provides the magnitude of a physical measurement, which is the essence of scientific investigation, while the latter offers statistical substantiation for the effect estimate in the form of a significance level (or confidence interval, the implied range that may contain the effect estimate with a certain likelihood). While the relationship between the two quantitates is tight, each conveys distinct information about the result of the experiment; in most scientific disciplines, it is considered unacceptable if only significance is reported (Sullivan and Feinn, 2012): the statistic value serves as auxiliary evidence for the existence of the targeted effect, and it is the effect estimate itself that is the center of investigation as the physical property of interest. For example, suppose that physicists would like to validate the predictions of the general relativity (Einstein, 1915) by investigating the gravitational waves from the merger of two black holes. It would be hard to imagine that they would only report a statistical value or the significance of their measurement (e.g., a chance of 1 event per 203,000 years, or a significance level of 3.4×107), but that they would not reveal the strength of the signal they have detected (a peak gravitational-wave strain of 1.0×1021 in the frequency range of 35 to 250 Hz) (Abbott et al., 2016).

However, within the field of neuroimaging, it has remained the predominantly common practice to report only statistical mapping tests in publications and presentations, a custom which has been largely (and perplexingly) immune to critical scrutiny. For instance, one typically sees brain results provided as blobs whose color spectrum corresponds to t- or z-values (or occasionally to p-values), and most of the time the underlying degrees of freedom are left out, rendering the statistics even harder to interpret. Similarly, in tabulated results for brain regions, standard reports usually contain the coordinates and statistic value at a single peak voxel (which is itself defined, again, as the maximum of the statistical values, not of the effect estimates, within the region), and the effect estimate at such a peak voxel is rarely reported. The same phenomenon commonly occurs in reporting results of seed-based correlation analyses for resting-state data, where the brain maps and tables usually show the statistic (often z) values instead of and without including inter-regional correlations.

Recently there have been a number of discussions about the use and misuse of p-values in the scientific community (e.g., Wasserstein and Lazar, 2016, Nuzzo, 2014), and others have been more critical of the “cult” or “obsession” of statistical significance (e.g., Ziliak and McCloskey, 2009). The editors of the journal, Basic and Applied Social Psychology, have gone so far as to take the seemingly extreme step as to no longer accept papers with p-values due to the concern of the statistics being used to support lower-quality research (Trafimow, 2014). In a sense, our concern here is related, and addressing it would also alleviate many of these other topical issues, but the concern is specifically focused on the need for including the effect estimate in neuroimaging studies. To frame the discussion here, we quote the six guiding principles on p-values in a recent statement released by The American Statistical Association (ASA) (Wasserstein and Lazar, 2016):

  • 1.

    p-values can indicate how incompatible the data are with a specified statistical model.

  • 2.

    p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

  • 3.

    Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

  • 4.

    Proper inference requires full reporting and transparency.

  • 5.

    A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

  • 6.

    By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.”

We believe that the neuroimaging field needs to move forward to promote the reportage of the effect estimates along with the corresponding statistics. We first discuss the statistical terms in the context of FMRI analyses, highlighting specific features related to that field. We then argue that full reporting in FMRI is necessary and promotes good scientific practice, clarity, increased reproducibility, cross-study comparability and allows for proper meta and power analyses. Finally, we provide several recommendations for researchers and software designers to facilitate these “best practices” actions.

Section snippets

What is the effect estimate in neuroimaging?

In neuroimaging, the ultimate focus is on the physical evidence for the brain's neuronal response, which evidence is typically embodied in the strength of the FMRI BOLD signal. For task-related experiments, the response strength is reflected in the effect estimate (or β value) associated with a task/condition or with a linear combination of β's from multiple tasks, such as the contrast between two tasks. For seed-based correlation analyses with resting-state data, time series correlation

What does a t-statistic value reveal in neuroimaging?

A t-statistic value for an effect estimate is calculated as the latter divided by its standard error, which represents the reliability or accuracy of the effect estimate. Thus, the t-statistic is a mixture of two components, the effect estimate and the noise estimate. However, both components vary across the brain. For example, the variability of BOLD response may partially result from the inhomogeneity of vascularization, and to some extent the variability of the noise level may be caused by

Practical realities/difficulties of FMRI

There are several features inherent to FMRI acquisition and analysis that present challenges to an investigator interpreting and reporting results. At first glance, some of these may seem to explain the present practices of reporting only statistic values as results. We describe them briefly here, and then discuss how they actually necessitate, rather than discourage, the inclusion of effect estimates in the end.

Limitations of statistical significance testing

Under the methodology of null hypothesis significance testing (NHST), the statistic value is mainly used to determine the statistical significance level of an effect estimate so that false positive rate is controlled. Once the value surpasses the threshold, the specific value of the statistic is neither as informative nor as important as the response amplitude or effect estimate. The current misplaced focus on statistical significance when reporting a scientific result (Ziliak and McCloskey,

Why is it crucial to report effect estimates?

The effect estimate provides a piece of hard, quantitative evidence in an analysis, and it should be reported as the main finding of a modeled or measured effect (Sullivan and Feinn, 2012). The corresponding statistic or p value usually indicates the reliability or accuracy of the effect estimate, but it cannot replace the information content of the effect estimate itself. For this reason, the importance of reporting the specific effect estimate under study has been repeatedly emphasized in

Recommendations and conclusion

Scientific investigations usually involve data collection from observational studies or meticulously designed experiments. Raw data with no or little extraction and compression would clutter or even obscure the intended message from the investigator. On the other hand, overly summarized data or missing information would present less convincing conclusions, or, worse, lead to misleading impressions. Statistic values alone do not represent the whole scientific endeavor, and there is no reason to

Acknowledgments

We thank Laurentius Huber for useful discussions on the physical dependence of BOLD on MR acquisition. The research and writing of the paper were supported by the NIMH and NINDS Intramural Research Programs (ZICMH002888) of the NIH/HHS, USA.

References (51)

  • D.C. Van Essen et al.

    The Human Connectome Projecta data acquisition perspective

    NeuroImage

    (2012)
  • Abbott et al., 2016. Observation of gravitational waves from a binary black hole merger. Phys. Rev. Lett. 116,...
  • M. Baker

    First results from psychology's largest reproducibility test

    Nature

    (2015)
  • J. Bohannon

    About 40% of economics experiments fail replication survey

    Science

    (2016)
  • Brett, M., Anton, J.-L., Valabregue, R., Poline, J-B., 2002. Region of interest analysis using an SPM toolbox. In: 8th...
  • K.S. Button et al.

    Power failure: why small sample size undermines the reliability of neuroscience

    Nat. Rev. Neurosci.

    (2013)
  • Calaprice, A., 2010. The Ultimate Quotable Einstein. Princeton University Press, Princeton, NJ, pp. 475 and...
  • L. Carroll

    Sylvie and Bruno Concluded

    (1889)
  • G. Chen et al.

    Detecting the subtle shape differences in hemodynamic responses at the group level

    Front. Neurosci.

    (2015)
  • Z.E. D'Esposito et al.

    The effect of normal aging on the coupling of neural activity to the bold hemodynamic response

    NeuroImage

    (1999)
  • Durnez, J., Degryse, J., Moerkerke, B., Seurinck, R., Sochat, V., Poldrack, R.A., Nichols, T.E., 2016. Power and sample...
  • Einstein, A., 1915. Die Feldgleichungen der Gravitation. Sitzungsberichte der Königlich Preußischen Akademie der...
  • S.A. Engel et al.

    Confidence intervals for FMRI activation maps

    PloS One

    (2013)
  • K. Friston et al.

    Nonlinear event-related responses in fMRI

    Magn. Reson. Med.

    (1998)
  • A. Gelman et al.

    Type S error rates for classical and Bayesian single and multiple comparison procedures

    Comput. Stat.

    (2000)
  • Cited by (76)

    • Statistical power in network neuroscience

      2023, Trends in Cognitive Sciences
    • HIV infection is linked with reduced error-related default mode network suppression and poorer medication management abilities

      2021, Progress in Neuro-Psychopharmacology and Biological Psychiatry
      Citation Excerpt :

      Subject- and group-level fMRI analyses were performed in AFNI (http://afni.nimh.nih.gov/afni/). Following preprocessing, the six EAT runs were smoothed to 8 mm FWHM (3dBlurToFWHM) and time series were scaled to the voxel-wise mean (3dcalc) thereby allowing regression (β) coefficients, calculated per regressor and participant, to be interpreted as an approximation of percent BOLD signal change (% BOLD Δ) (Chen et al., 2017) from the implicit baseline. The first five functional volumes of each run and those with framewise displacement greater than 0.35 mm were censored (1.2 ± 3.9% of TRs).

    View all citing articles on Scopus
    View full text