## The truth wears off (sometimes) and the importance of confidence intervals

A couple of months ago, the New Yorker published an article entitled “The truth wears off: Is there something wrong with the scientific method?” by Jonah Lehrer. The article reviews several examples of a not uncommon fallacy in science, initially overestimating how big an effect some variable X has on another variable Y, and explores several reasons for such mistakes. Some of Lehrer’s examples of this fallacy from various scientific disciplines are:

- The efficacy of second-generation antipyschotics (e.g., Abilify, Zyprexa)
- The effect of describing a picture on later memory for that picture (“verbal overshadowing”)
- The relationship between body symmetry and sexual attractiveness

Working scientists will recognize some, if not all, of the reasons Lehrer cites for the frequency of this problem. These are:

- Regression to the mean (i.e., being fooled by the occasional initial outlier)
- A bias in science against reporting null results (either due to a personal benefit of keeping the null results private or due to the tendency of journals to reject such findings as unimportant or unreliable)
- Conscious and unconscious biases to confirm one’s hypotheses
- Significance chasing (i.e., analyzing a data set in multiple, post hoc ways until a significant result is found)

Another problem, not mentioned by Lehrer, is that scientists sometimes simply report p-values of their effects and neglect to compute or report confidence intervals for the size of the effects. In other words, scientists often simply report that there is good evidence of a relationship between two variables (i.e., the effect has a small p-value) but they fail to provide explicit estimates of the size of the effect.

While there is a close relationship between p-values and confidence intervals (the smaller the p-value the greater the distance between the confidence interval boundaries and the distribution of the null hypothesis being tested), deriving one from the other is often not straightforward. Thus, when only p-values are reported, one only clearly gets a sense of how likely there is to be some relationship between two variables, but one has little sense of how large that relationship is likely to be.

Not reporting confidence intervals is particularly likely to lead to considerably overestimating effect sizes when studies are “underpowered.” An underpowered study is one in which the sample size is too small to be likely to detect the effects being studied and consequently the confidence intervals of the estimated effects are generally quite large (Gelman & Weakliem, 2009). This will occur when effects are of small magnitude (e.g., less than 20% the size of the standard deviation of measurement noise) or when the number of comparisons is large (e.g., an analysis of the 40,000 to 500,000 voxels typical of whole brain fMRI analyses–Vul, Harris, Winkielman, & Pashler, 2009). The latter leads to underpowered studies due to corrections for multiple comparisons (e.g., techniques such as Bonferroni correction or false discovery rate controls) that lower the p-value and thus increase the confidence interval of each individual comparison.

When studies are underpowered and one is fortunate enough to detect an effect, the magnitude of the estimated effect is necessarily going to be much larger than the true effect since the confidence intervals on the effect are so large. To illustrate, consider a hypothetical relationship between a cup of coffee and IQ test performance. Say that in reality, a cup of coffee increases IQ test performance by 3 points on average. Since the IQ test has been designed to have a mean of 100 and a standard deviation of 15 points across the population of test takers, this amounts to a true post-coffee mean of 103 points and an effect size that is 20% the standard deviation of the measurement noise (a small effect by Cohen’s standards–Cohen, 1988). Imagine that we perform an experiment to determine if a cup of coffee affects IQ test performance three times, each time using a different number of participants (4, 36, and 100). Imagine also that each time we do the experiment, we find that caffeine does improve test performance and get the exact same p-value of 0.01. These hypothetical results are illustrated in the figure below, which shows that when the sample is small, the effect of coffee is dramatically overestimated. If only the mean effect of coffee and the p-value of 0.01 had been reported, one might interpret this as good evidence that a simple cup of coffee can increase IQ by around 19 points (enough to bump someone of average IQ up to the 90th percentile!). Indeed one might mistakenly think that the evidence from the experiment with only 4 participants is just as compelling as the evidence from the two larger experiments since their p-values are equal. However, the confidence intervals on the bar graph help to avoid such a fallacy. Thanks to the confidence intervals, one directly sees that the estimate of the size of the effect is highly imprecise when there are only four participants and that the true effect size might be quite small. Moreover, one can clearly evaluate the precision of effect size estimation across the three experiments.

That being said, it is important to point out that when studies are underpowered, it is unlikely that effects will be detected. Indeed, when performing the hypothetical coffee/IQ experiment with only four participants, there is only a 6.9% chance of producing a significant test result (p<0.05, assuming a two-tailed test). However, small effects are likely to be very prevalent in the cognitive sciences due to our often noisy measures of behavior and brain function and due to the large number of comparisons typical of many types of neural data (e.g., fMRI, EEG, optical imaging). Thus, given the likely prevalence of small effects and the large number of studies being executed across the globe, the potential for overestimation of small effect sizes is surely high. Indeed, a well-known meta-analysis of fMRI studies performed by Vul and colleagues (2009) found that the scientific literature was biased to overestimate the magnitude of correlations between fMRI activation and measures of emotion, personality and social cognition (though it is not clear how representative Vul et al.’s sample of fMRI articles is of those fields of neuroscience nor were the authors of those articles necessarily attempting to estimate the size of those correlations).

In sum, reporting confidence intervals is a great protection against the not uncommon fallacy of overestimating effect sizes and one should do so whenever possible. There is no excuse not to add confidence intervals (or at least standard error bars) to figures such as bar graphs or to textual descriptions of average effects in manuscripts. For some analyses, reporting confidence intervals can be difficult due to visualization or technical constraints (for example when reporting the results of a whole brain analysis of fMRI data–Kriegeskorte, Lindquist, Nichols, Poldrack, & Vul, 2010). When this is the case, keep in mind that one is only reporting the likelihood that the null hypothesis could have generated such data and *not* estimating the size of an effect. If cognitive scientists can take such lessons to heart, we’ll surely waste less time chasing effects that eventually fade to unimportance and more quickly discover yet more phenomena that stand the test of time.

-David Groppe

REFERENCES:

Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Hillsdale, N.J.: Lawrence Erlbaum Associates.

Gelman, A., & Weakliem, D. (2009). Of Beauty, Sex and Power American Scientist, 97 (4) DOI: 10.1511/2009.79.310

Kriegeskorte, N., Lindquist, M., Nichols, T., Poldrack, R., & Vul, E. (2010). Everything you never wanted to know about circular analysis, but were afraid to ask Journal of Cerebral Blood Flow & Metabolism, 30 (9), 1551-1557 DOI: 10.1038/jcbfm.2010.86

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition Perspectives on Psychological Science, 4 (3), 274-290 DOI: 10.1111/j.1745-6924.2009.01125.x

FOLLOW UP: Recently, one of the scientists featured in the New Yorker article, psychologist Jonathan Schooler, published an essay in Nature about this fallacy, which he calls the “decline effect.” To reduce the frequency of the decline effect, he advocates the development of open access databases in which scientists post their research programs before beginning a project and all of their eventual results (including null results). You can read the whole article here:

http://www.nature.com/news/2011/110223/full/470437a.html

eeging said this on March 3, 2011 at 12:58 am |

A recent example of some experiments whose “truth” with soon wear off is a paper by Daryl Bem that claims to find evidence of precognition. Tal Yarkoni has a nice post on the multiple reasons why Bem’s findings are likely to be false positives:

http://www.talyarkoni.org/blog/2011/01/10/the-psychology-of-parapsychology-or-why-good-researchers-publishing-good-articles-in-good-journals-can-still-get-it-totally-wrong/

It’s edifying to read even if you don’t care about precognition since the mistakes Bem likely made are probably prevalent in the cognitive sciences.

eeging said this on June 5, 2011 at 8:57 pm |