Comparing confidence intervals

Back to home page Often we are interested in the difference between the means of two populations and whether we can infer from samples from the populations that the means are different.

This is often a silly question: the means of real populations are almost always different, even if the difference is microscopic. More useful would be to estimate the difference and the probability that it is big enough to be of practical importance. See the BEST software for a way to do this in R.

Sometimes we are presented with confidence intervals for each of the means. This happens in particular with the standard packages we use for wildlife data analysis, where the output includes confidence intervals for each coefficient or real value. Can we infer evidence of a difference from confidence intervals in the same way as for a p-value from a test of significance?

Specifically, if the confidence intervals (CIs) don't overlap, does that mean that the difference is statistically significant?

I was prompted to revisit this by a blog post by Wesley, and the discovery of a paper by Payton et al. (2003) on the topic.

Type I error rate

If there is no difference between the populations underlying the two samples, we should nevertheless get a "significant at the 5% level" result 5% of the time. Does the overlapping-CIs test give the proper Type I error rate?

I ran 100,000 simulations with two samples of 15 from a Normal(0, 1) distribution, calculated 95% CIs, and checked for overlap. There was no overlap in 0.5% of the cases, far less than the desired 5%. Still, if 95% CIs don't overlap, you can claim "convincing evidence" of a difference. This is consistent with Schenker & Gentleman's (2001) theoretical results.

Payton et al. (2003) show that the error rate depends on the ratio of standard errors (SEs) for the two samples, but suggest CIs of 84-85% if that ratio is < 2. I ran the simulations again for 85% CIs, and the error rate was 4.1%, and with one sample Normal(0, 1) and the other Normal(0, 2) it was 5.2%.

So, if the ratio of SEs is < 2 (which it usually will be if sample sizes are approximately equal), calculate 85% CIs instead of 95% CIs; if they don't overlap the result is "significant" at approximately the 5% level.

Unfortunately, packages such as PRESENCE and MARK just give 95% CIs, with no option to change this. So what can we do with 95% CIs?

One idea is in the middle row of the diagram below, where the point estimates for each mean lie outside the other confidence interval:

The diagram is based on Ramsey & Schafer (2002, p139), but they do stress that "The proper course of action for judging whether two means are equal is to carry out a t-test directly" and they don't link "convincing" or "strong" evidence to a p-value or "significance".

I reran 100,000 simulations with two samples of 15 from a Normal(0, 1) distribution, calculated 95% CIs, and checked to see how frequently mean(x) was outside CI(y) and mean(y) was outside CI(x). This happened 11% of the time, more than the target 5%. With a ratio of SEs of 2, that dropped to 7.4%. With larger sample sizes, these rates got worse. This doesn't seem to be a safe basis for inferences in general.

Comparing overlap-test to t-test

The approach in Wesley's post is different. They looked at examples where the null hypothesis was false (population means were 0 and 1), and compared the overlap-test with the p-value from a t-test for each simulated data set. They only used 95% CIs.

All the samples with no overlap of 95% CIs had p < 5%; indeed almost all had p < 1%. So again, no overlap corresponds to convincing evidence of a difference. With larger samples (30 instead of 15) the p-values decreased further.

On the other hand, about 65% of samples with overlap still had p < 5%, so overlapping CIs does not mean not significant.

Of the samples with overlap but mean(x) outside CI(y) and mean(y) outside CI(x), about 80% were "significant", ie. a t-test on the same data gave p < 5%. But only 80%, not 100%. With larger samples this went up.

Conclusion

If 95% confidence intervals don't overlap, you have convincing evidence of a difference. You can't deduce anything if they do overlap.

If they overlap, checking where the sample means lie in relation to the CIs doesn't seem to be a workable solution.

If you can use 85% confidence intervals instead (and SEs are approximately equal), non-overlap corresponds to "significance" at the 5% level.

Code to try this for yourself is here.

References

Payton, M E; M H Greenstone; N Schenker. 2003. Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? Journal of Insect Science 3:34.

Ramsey, F L; D W Schafer. 2002. The statistical sleuth: a course in methods of data analysis. Duxbury Press, Belmont CA.

Schenker, N; J F Gentleman. 2001. On judging the significance of differences by examining the overlap between confidence intervals. American Statistician 55:182-186.

Updated 27 March 2013 by Mike Meredith