The Mythical Three-Sigma Result

In a recent paper ("The Physics Case for Extended Tevatron Running", arXiv:1011.1532), D. Wood presented the physics case for running the Tevatron 3 years beyond the end of 2011, accummulating a total of 16 fb^{-1} of data, at a total operating cost of $120 million. The big prize that the Tevatron would chase was the discovery of the Standard Model Higgs. Figure 5 showed the sensitivity that would be achieved, and it showed that over the entire preferred mass range, the experiments would be able to detect a Higgs particle at no less than 3 sigma significance. Let us recap - we would spend $120 million to achieve a 3 sigma result.

Also recently (arXiv:1104.0699), CDF published a result hinting at the detection of a new particle in the existing data. If real, it would signal the existence of new physics beyond the standard model. The "bump" in the energy histogram had a significance of something also like 3 sigma. However, soon thereafter D0 published its version of the same measurement and found nothing. The two groups went off to reconcile their differences.

I then asked my collider experiment friends about how it is that one 3-sigma result is worth spending $120 million on while another 3-sigma result turns out to be specious. The responses were rather interesting: "No no, the Higgs result would be much more significant than the CDF bump." Really? Why is it that some 3-sigma results are more significant than others? (Here is a blog post that echoes this concept more blithely: "... there are more significant 3 sigmas, less significant 3 sigmas, and astrophysical 3 sigmas.")

What is going on? Here are some issues that seem to be ignored.

  1. How many places did you search? Experiments produce many different data sets, and one is always on the lookout for an unusual outlier in the data that might signify something new. If you search in only one place in the data, the sigificance of a result is straightforward to compute. However, if you search in dozens of places, you are likely to find an outlier by chance even though it may be of no significance in the end; a proper calculation of significance should incorporate an allowance for the number of places searched. How many experiments do so?

  2. Are your statistics Gaussian? Formally, for a Gaussian distribution of errors, a 3-sigma result will occur by chance of order 0.1% of the time. Since many measurements involve the sum of many random variables, the central limit theorem tells us that the error distribution does, indeed, approach a Gaussian; however, deviations from a Gaussian will likely show up most strongly out in the tails of the distribution. A simple case from astrophysics is found in absorption lines profiles in stellar atmospheres or in the interstellar medium. To a high degree of accuracy, the absorption line profile is governed by thermal Doppler broadening, which is a very Gaussian process; however, at large optical depths, the profile is dominated by a Lorentz profile, which is the intrinsic absorption cross-section due to quantum mechanical processes. The 1-sigma width of the Lorentz profile is usually much smaller than that of the Doppler profile, but a Lorentz profile has a power-law shape well away from the line center and dominates over the Gaussian fall-off of the Doppler profile eventually.

  3. What the the "error on your error"? In other words, even if your statistics are Gaussian, how well do you know the true standard deviation? It is rare that one knows it with infinite precision. Often, the limitation on knowing the significance of a result is limited, not by errors in the measurement, but rather errors in the error.

Combining all together, it is clear that the generic concept of "three-sigma" (or even "n-sigma", whatever the value of n) is of limited meaning - it usually comes with baggage.