The Mythical Three-Sigma Result
In a recent paper ("The Physics Case for Extended Tevatron Running",
arXiv:1011.1532),
D. Wood presented the physics case for running the Tevatron 3 years beyond
the end of 2011, accummulating a total of 16 fb^{-1} of data, at a total
operating cost of $120 million. The big prize that the Tevatron would
chase was the discovery of the Standard Model Higgs. Figure 5 showed the
sensitivity that would be achieved, and it showed that over the entire
preferred mass range, the experiments would be able to detect a Higgs
particle at no less than 3 sigma significance. Let us recap - we would
spend $120 million to achieve a 3 sigma result.
Also recently (arXiv:1104.0699),
CDF published a result hinting at the detection of a new
particle in the existing data. If real, it would signal the existence
of new physics beyond the standard model. The "bump" in the energy
histogram had a significance of something also like 3 sigma. However,
soon thereafter D0 published its version of the same measurement and found
nothing. The two groups went off to reconcile their differences.
I then asked my collider experiment friends about how it is that
one 3-sigma result is worth spending $120 million on while another 3-sigma
result turns out to be specious. The responses were rather interesting:
"No no, the Higgs result would be much more significant than the CDF bump."
Really? Why is it that some 3-sigma results are more significant than others?
(Here is a
blog post
that echoes this concept more blithely: "... there are
more significant 3 sigmas, less significant 3 sigmas,
and astrophysical 3 sigmas.")
What is going on? Here are some issues that seem to be ignored.
- How many places did you search? Experiments produce many different data
sets, and one is always on the lookout for an unusual outlier in the
data that might signify something new. If you search in only one place
in the data, the sigificance of a result is straightforward to compute.
However, if you search in dozens of places, you are likely to find an
outlier by chance even though it may be of no significance in the end;
a proper calculation of significance should incorporate an allowance
for the number of places searched. How many experiments do so?
- Are your statistics Gaussian? Formally, for a Gaussian distribution
of errors, a 3-sigma result will occur by chance of order 0.1% of the time.
Since many measurements involve the sum of many random variables, the
central limit theorem tells us that the error distribution does, indeed,
approach a Gaussian; however, deviations from a Gaussian will likely
show up most strongly out in the tails of the distribution. A simple
case from astrophysics is found in absorption lines profiles in stellar
atmospheres or in the interstellar medium. To a high degree of accuracy,
the absorption line profile is governed by thermal Doppler broadening,
which is a very Gaussian process; however, at large optical depths,
the profile is dominated by a Lorentz profile, which is the intrinsic
absorption cross-section due to quantum mechanical processes. The
1-sigma width of the Lorentz profile is usually much smaller than that of
the Doppler profile, but a Lorentz profile has a power-law shape
well away from the line center and dominates over the Gaussian
fall-off of the Doppler profile eventually.
-
What the the "error on your error"? In other words, even if your statistics
are Gaussian, how well do you know the true standard deviation? It is rare
that one knows it with infinite precision. Often, the limitation on knowing
the significance of a result is limited, not by errors in the measurement, but
rather errors in the error.
Combining all together, it is clear that the generic concept of "three-sigma"
(or even "n-sigma", whatever the value of n) is of limited meaning -
it usually comes with baggage.