For example, in observing the Sun over a period of months or years, one can accumulate data on: changing sunspot area, occurrence of specific types of solar flares (e.g. different optical and morphological classes). When this data is assembled into categories (e.g. A for sunspot area, F for flare frequency etc.), and fitted statistically, it can describe a rudimentary empirical system.

If I do a linear regression to find a fit for F (flare frequency, or flares per day) vs. A(area) of a sunspot group, I may obtain a statistical function of the form:

F = A(S1) + C

where C is some quantity that intersects the particular (y-axis), in this case the one for F. This is a mathematical representation of an empirical relationship. The mathematical form is dictated by how the data fits, and is a matter of using standard statistical tests. The empirical system that results is not a matter of “belief” but simply the degree of goodness of fit for the data.

In acquiring the data via random sample, and then assembling it, a precise statistical procedure called a “least squares regression” line represents the best fit given the data points. The beauty is that once the raw data is available, and key parameters, such as r, the

**Pearson product-moment correlation coefficient**are known - one can easily ascertain whether any trickery is afoot.

In my seminal paper on the linkage of sunspot morphology and geo-effective solar flare occurrence (

**Solar Physics**, Vol. 88, p. 137) , for example, I found for the case of major flares (those with a sudden ionospheric importance > +2) that r = 0.838. This is almost as good as it gets, given a perfect correlation is 1.0 while a total non-correlation is 0.0. Even better, the t-variate was computed to be t = 7.83 and the P-value was < 0.0001. For non-statisticians, the P-value denotes the smallest level at which the null hypothesis can be rejected. (The null hypothesis in this case would be: “Sunspot area has no relation to flare occurrence.”)

I set all this in background, to provide a basis by which to judge a new study that apparently uses correlation to ascertain that “taller people tend to have higher IQs than shorter people”.

According to a Denver Post story (

*'Link Between Height and IQ: Size, Smarts Scale Same Biological Ladder'*, 9/8, p. 14A) , researchers at the University of Colorado have “cracked the connection between height and IQ”. The article quotes the correlation as “modest” but then cites the correlation coefficient as “0.1” - which is the next thing to a bare joke! If I had presented a correlation coefficient that low for my thesis, even before submitting a paper to Solar Physics, I’d have been advised by my supervisor to look for another project! This would be before being laughed out of his office!

The same Post article states:

*“there is an overlap of the genes to define size and intelligence. The other factor is that taller people are likelier to mate with smarter people and vice versa, producing children that are likelier to have both traits.”*

According to one of the CU researchers, an assistant prof named Matthew Keller, while these relationships have long been theorized “it has always been very difficult to tease apart the two potential genetic reasons”.

Hmmmmm…..perhaps because there are NONE, and you are like Don Quixote tilting at windmills of your own mind.

Now, let’s consider this further. We have an independent variable we designate as IQ (and let’s be clear that this is still a controversial measure) with two dependent variables – putatively called height (h) and mating propensity (m), say the willingness of a similarly heighted smart person to hitch to another. That means there are actually 2 dependent variables so a multiple regression correlation would be what is sought.

In this case one will also wish to look at the coefficient of multiple determination which is basically given by: V =100 r^2 % where again, r is the Pearson-product moment correlation coefficient- but this time for multiple regression. In this case, for r = 0.1, V = 1%, which means only 1 percent of the significance is accounted for by regression on the dependent variables. This means that 99% is based on random factors, i.e. not part of the null hypothesis! In other words, the Post’s claim (which I am certain was also vetted by the study’s authors) that though the correlation is weak

“it is highly unlikely to be due to chance”

Is total balderdash. In fact, it’s nearly

**ALL due to chance**!

At least the prime researcher, Keller, did concede in the article that “the correlation is very insignificant” which introduces the question of why bring it up at all? For fanfare and media attention? It would have been better not to go down that path, certainly if you’re not going to supply us with the t-variate and P-value too! (Or, alternatively, emphasize to the Post not to hype or make much out of the so-called correlation!) Besides, Keller also admits the “findings were not the study’s true purpose”. So again, why run to the press with it, when you also ought to know the media is chronically lazy and inadequate to translate any findings that are less obvious than 2 + 2 = 4?

At the end of the day, what we have with this height-IQ malarkey is just one more example of statistics misapplied and run amuck. Much like the case of Sir Cyril Burt who tried to show using phony statistics (in this case from almost too perfect correlations) that class and IQ are linked. Born into a lower class? Well, then you’re more likely to be a dummy and fit only for manual work!

What we need to start doing is to hold all these forms of statistical research, including and especially economics, to higher standards and at least as high as those in the physical sciences like physics and astronomy.

## No comments:

Post a Comment