This post also contains technical material about statistical inference. Thinking about statistical inference tends to cause pain in the brain so for those who don’t want to struggle through the technical stuff the important points are:
- The statement ‘The result is not statistically significant at the 0.05 level’ does not imply that there is no effect, only that the data do not rule out that there is no effect. The corollary of ‘The result is not statistically significant at the 0.05 level’ is ‘We are still not that sure whether there is an effect or not.’ (‘Absence of evidence is not evidence of absence.’)
- Statements about statistical significance are statements about the data not about the effect. With enough data a tiny difference of no practical importance can be statistically significant and a practically important difference can be not statistically significant if there is only a small amount of data.
I had gained the impression from my undergraduate textbook on probability and statistics that hypothesis testing was a routine procedure based on ideas that had been around since at least the 1930s. This is true. It is also true that there is considerable controversy.
As I understand it (and I’m not a statistician, so please comment if you know about these issues) one controversy goes back to a dispute between Fisher and Neyman, basically about whether it is necessary to consider Type II errors. In a significance test you calculate the probability, p, that you would have observed an effect of at least the size that you did observe on the assumption that there is no effect. If this probability is small you reject the hypothesis that there is no effect. If you decide to reject the ‘no-effect’ (null) hypothesis whenever p is less than 0.05 then in the long run the rate at which you will incorrectly reject the null hypothesis (Type I error) is 1 in 20. Alternatively, you can tell people the actual value of p that you obtained.
A statement of statistical significance is a statement about the data not about the hypothesis. It tells us how confident you are about rejecting the null hypothesis on the basis of the data you have obtained. In some circumstances, for example, a quality controller in a factory deciding whether to reject or accept a shipment of components, you would need to know not only the probability of being fooled by fluctuations into rejecting the ‘no effect’ hypothesis when it is true (rejecting a shipment when it is OK) but also of failing to reject the ‘no-effect’ hypothesis when there actually is an effect (Type II error) (accepting a shipment with an unacceptable number of defects). The Neyman-Pearson approach allows you to construct a test that, for a given probability of rejecting the null hypothesis when it is true, minimizes the probability of accepting it when it is false. (More accurately, it minimizes the probability of incorrectly rejecting an alternative hypothesis. The assumption is that if you are going to take action on the basis of the test then you will either reject the null hypothesis, implicitly accepting the alternative, or you will reject the alternative hypothesis, implicitly accepting the null hypothesis.) You can always make the probability of Type II error smaller by accepting a higher probability of Type I error. In real world applications there will usually be arguments about relative costs and benefits to inform the choice. Fisher took the view that this procedure was incorrect in science:
“It is important that the scientific worker introduces no cost functions for faulty decisions, as it is reasonable and necessary to do with an Acceptance Procedure. To do so would imply that the purposes to which new knowledge was to be put were known and capable of evaluation. If, however, scientific findings are communicated for the enlightenment of other free minds, they may be put sooner or later to the service of a number of purposes, of which we can know nothing.”
[Statistical Methods and Scientific Inference, 1956]
In this view you either reject the null hypothesis with some probability of error or you regard the null hypothesis as not (yet) proved wrong. This is basically the logical point that you can use data only to disprove a hypothesis not to prove it. It does not matter how much data you have that are consistent with a hypothesis, there is always the possibility that you will eventually encounter data that disprove it. In this case you can’t make the error of accepting a hypothesis when it is false because you would never accept any hypothesis. However, when people do need to choose what action to take based on data they act as though they accept a particular hypothesis.
One of the points of contention has been that people don’t just need to know whether there is an effect or not they also need to know how big it might be. This is important even if you have not been able to reject the null hypothesis. For example, if you are comparing the rate of death from heart attacks for patients undergoing treatment A with that for patients undergoing treatment B it would be quite important to know that the data that gave a high (greater than 0.05) probability of error for rejecting the null hypothesis gave a not much higher probability of error for rejecting the hypothesis that one rate was twice the other. In other words, how well does the test discriminate between the null hypothesis and other hypotheses. This information should be given as a confidence interval (the interval that you expect to include the true value of a parameter in 95% of cases) or a calculation of the power, which measures how well the test discriminates between different possibilities.
Why do we tie ourselves up in logical knots calculating the probability that an effect of at least the size we have observed would have been observed if the hypothesis we are trying to show is false is true? Why don’t we just calculate the probability that, for example, two means are different, given the data we have observed? This is the heart of one of the great divides of twentieth century, and now twenty-first century, science. In the theory of classical statistical inference the probability that two means are different is not a meaningful concept. The two means are fixed; we just don’t know what they are. It is the estimates of the means that we calculate from the data that have probability distributions. There is an alternative method for statistical inference based on Bayes Theorem. There are two problems with this method. First, it requires us to interpret probability as meaning a degree of belief. This upsets many scientists because one person’s degree of belief in a statement might be different from another’s. They prefer an objective definition of probability. This is called the frequentist position because a popular interpretation of probability is as the long term frequency in a large number of trials, for example, if you keep tossing a fair coin long enough the ratio (Number of tails / Number of tosses) would be close to 1/2. Those who favour the degree of belief interpretation are known as Bayesians, for obvious reasons. The other problem with the Bayesian approach is that it involves using the data to calculate a final distribution for the parameter of interest, called the posterior distribution, from an initial assumed distribution, called the prior distribution. Unfortunately, there is no accepted method for choosing a prior distribution. This does not matter if you have lots of data because then the prior does not have much influence on the posterior. (If you are not familiar with statistical inference you may at this point wonder why people doing classical statistics do not need to assume a distribution in order to calculate p. The reason is that there is a wonderful mathematical result called the ‘Central Limit Theorem’ that implies that the sample mean must follow a normal distribution when the sample size is large.)
In the context of using statistics on academic appointments in a university department to inform action, the weaknesses of the Bayesian approach could be regarded as strengths. People act because they believe something to be true not because they have failed to reject the hypothesis that it is false. Also, in most departments there will be a range of pre-existing views on whether recruitment is biased in favour of men – a few who are convinced it is, a few who are convinced it is fair, some who have no idea and possibly a small group who believe that women are favoured over men. Some of these views will be strongly held. Why not make these differences explicit by incorporating them in different priors?
[If some of the above sounds vaguely like something you once learnt in a stats course try ‘The Cartoon Guide to Statistics’ by Larry Gonick and Woollcott Smith, Collins, 1993, ISBN-13: 978-0062731029.]