Friday, October 8, 2010

A Cult of Statistical Significance?

There is an old saying that's made the rounds for many years: "Lies, damned lies, and statistics". The imputation is that the use of statistics more often than not constitutes an abuse of empirical observations and not a proper use. The feeling persists that somehow the one performing the statistical analysis is making fools out of his intended audience.

This finds new meaning in a recent mathematical book, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, by S.T. Ziliak and D. McCloskey. Of course, one can most easily grasp this sort of cult when basic percentages are misused. For example, as reported some years ago, a certain drug trial of a particular statin found that the rate of heart attacks in a control group was 3% vs. 2% for those who took the statin, hence they concluded the statin is: (3% - 2%)/ 3% x 100% = 1/3 x 100% = 33% more effective in reducing heart attacks.

Which is one of the most egregious claims ever made. In fact, at the level of the actual percentages there is a very minor positive difference in significance between the control group result and that for the statin -takers. Don't take the statins and 97 of every 100 live, do take them and 98 of every 100 live. Big deal! Is it worth it to be one of the 98, given the liver damage they can cause?

In the book, the authors bring up the standard template for much ill-practice: A scientist formulates some null hypothesis, then by means of a "significance test" (e.g. Student's t -test) tries to falsify it with the result leading to a "p-value" which determines how likely the null hypothesis is to be true.

A key point made by the authors (and also germane to the statin study) is tht too many researchers get so obsessive about statistical significance they omit to ask themsleves whether the detected discrepanices are large enough to be of any subject matter significance.

Another specific error which crops up a lot in many studies is conflating the probability of the observed data given the null hypothesis with the probability of the null hypothesis given the data. The latter can't be obtained unless one uses Bayesian statistics, an area still relatively rare in fields from medicine to economics. The error then made is called "the fallacy of the transposed conditional".

This error, by the way, doesn't require lots of numerical data to be exposed. A classic way it appears is for a biblical archaeologist, for example, to locate or find some "ark" purported to be thousands of years old, which has parts that are dated using radio-isotopes (like C12) and it is then concluded this was the "ark" used by Noah in Genesis. The null hypothesis effectively set out is: "There is no significant difference between the radioactive dating findings, materials found and that expected for Noah's Ark".

But this is precisely conflating the probability of the observed data (say some ark found in Turkey) given the null hypothesis (above) with the probability of the null hypothesis given the data. Thus the claimant certainly commits "the fallacy of the transposed conditional". The key point missed by the claimant (from the Bayesian probability perspective) is that the ark found need not have been "Noah's" but one of many that were commonly used in the era. Hence, the data proferred- whether radioactive or other- is simply insufficient to make a unique identification of Noah's ark (assuming Noah even existed in the first place and wasn't a myth like "Paul Bunyan".)

Now, in the setting of solar flare studies, one commonly forms the null hypothesis by reference to observations that have been previously made over long times and validated- then applies these to more specific regions, and sunspot groups. One of the ways one forges the null hypothesis is directly in terms of changing sunspot morphology: say based upon increasing (or decreasing sunspot area), magnetic field strength, and a quantitative measure called "complexity ratio" which is the ratio of complex sunspot configurations to non-complex ones in a given sunspot group over a defined period.

A fairly straightforward null hpothesis might then be: The frequency of solar flares (F) is independent of sunspot area (A).

If this null hypothesis is valid, then there should not be any significant correlation between the incidence of flares (say occuring per day per sunspot group) and the area of a spot group. However, if a plot is obtained such as shown in Fig. 1, and Pearson product-moment correlation coefficient is r = 0.75, the null hypothesis is likely invalidated and the opposite holds: There IS a relationship between the area (size) of a spot group and flare frequency.

If I obtain a linear regression equation of the form:

F = m(A) + C

then the testing of the null hypothesis is in fact a test of the statistical hypothesis that the gradient, m, equals some specified value for the population, in this case: m = 0. For an application of statistical signficance, one would likely use the t-distribution for which the standard error of the regression line, s(m) is:

s(m) = [s(y,x)/ s(x)^2(n - 1)^2]^1/2

where s(y,x) is the standard error of the regression line, and s(x)^2 is the variance associated with the independent variable, and n = no. of data points. The standard error of the regression line can be computed from:

s(y,x) = {(n- 1)/( n - 2)} (s(y)^2 - ms(x)^2)

where s(y)^2 and s(x)^2 are just the sample variances in the dependent and independent variables, respectively.

The t-variate or t-statistic is then:

t = (m - 0)/ s(m)

which has a t-distribution with (n - 2) degrees of freedom. With the relevant value of t, it is possible to determine the p-value, or that threshold at which the null hypothesis should be rejected. Here, the smaller the value of p, the stronger the evidence against the null hypothesis. for t-values in terms of solar flare parameters, the generally accepted norm is p = 0.025 or less which translates to significance levels of 2 ½% or less at either end of the distribution.

Another way to pose this is in terms of the "confidence level". Then if the computed t-variate has a corresponding p <(=) 0.025, we may be 95% sure the null hypothesis is false, or a 95% confidence level. If the limits were changed to p equal or less than 0.05 we'd reduce the confidence level to 90%. These t-variates and p-values are used in conjunction with the correlation coefficient to ensure one doesn't make the error of making a signficance test too powerful. Thus in the real world of solar phenomena we'd not expect that m= 0 in practice but rather m ~ 0, e.g. close to it. The aim of testing then is to ensure the deviation in gradient is sufficient to make the cut for actual signficance. (Another reason one needs to factor in variances - or the magnitudes of the error bars as well).
All the above is presented to show that all solar data can easily be checked and confirmed via outside sources. There's no hidden mumbo jumbo or "baffle with bullshit". Once a person has the data, he can similarly ramp up his computer, plug the values in, get the graph of F vs. A, get the standard errors, the variances, and gradient (m) as well as t-values, p-values and confidence levels and ascertain if the claim is verified.
Of course, not all significance findings are easily obtained! In the case of Poisson investigations, these are not so straightforward. (In these investigations, one typically may look at "no flare" days for particular spot classes and assess whether any null hypothesis makes the mark. Fig. 2 shows an example for one Poisson distribution obtained by Beta-class no flare days in 1980. To find signficance in these cases, one usually resorts to the chi-squared distribution and chi-squared test. One may also examine different Poisson distributions, including standard and modified.
The bottom line here is that any "cult of statistical signficance" which undermines integrity in assorted studies is usually a function of the studies themselves, or the people conducting them. Once people play by the rules, and use open methods (and data available to all) there is no excuse for formulating any "cults", statistical or other. If, however, people such as prescription drug makers can hide behind "propietary information" to nullify sharing their statistical data, then anything goes.

No comments: