This post gets a bit technical so for number-phobes the point is that interpreting numerical data involves value judgements, specifically about whether it is better to be fooled by random fluctuations into devoting time and resources to fixing a process that is not broken or to risk not fixing a process that is broken.
When scientists start thinking about the issues related to women in science their first instinct is to collect numerical data. Natural scientists tend to want to measure a baseline, make an intervention and then re-measure to see if the intervention made any difference. Social scientists tend to want to measure everything they can think of and then do a multi-variate statistical analysis. This is natural. It is what we have been trained to do. We also often hear 'We must have hard data, not just anecdote'.
Indeed, we do need good data but we should be aware of the limitations of the statistical approach. Let's take the example of a university department that wants to check whether its recruitment procedures are unbiased. If it is a life sciences department it might expect that 50% of academic appointments would be women, so in this case we can define unbiased recruitment as meaning that 50% of appointments are women. Mathematically the problem of determining whether this recruitment process is unbiased is equivalent to determining whether a coin is equally likely to land heads or tails. Here are the results of tossing a 10p piece ten times: HTTTHHTTTT. Seven of the ten tosses turned up tails. Is my coin-tossing biased? Seven out of ten seems quite large. What happens if I try tossing the coin twenty times? Here is the result of that experiment: THTHHHHTTT HHHHHHHHHH, which is five tails out of twenty tosses. At this point I wondered whether there was something about the way I toss coins that biased against tails so I tried another twenty tosses with the following results: HTHTTTHTHT TTTHHTTTTT or fourteen out of twenty. So from ten tosses there were 70% tails, from the first twenty tosses 25% tails and from the second twenty tosses 70% tails. Overall there were 26 tails out of 50 tosses or 52%. How can we decide whether or not my coin tossing is biased when the results are so variable?
Conventional (Fisher) hypothesis testing says that we should calculate the probability, p, of seeing an effect at least as large as that observed on the assumption that the null hypothesis, in this case, that my coin tossing is unbiased, is true. If p is less than some value, conventionally taken to be 0.05 (or 0.01 or 0.001), then the null hypothesis is rejected. This test reduces the risk of us being fooled by random fluctuations into thinking that my coin tossing is biased when it is not. Coin tossing follows a binomial distribution, shown in the graph for N=10 and N = 50, and the relevant probabilities are P(7 or more tails out of 10 tosses) = 0.17, P(5 or fewer tails out of 20 tosses) = 0.02, P(14 or more tails out of 20 tosses) = 0.06, P(26 or more tails out of 50 tosses) = 0.44. So, 70% of ten tosses is not statistically significant, 25% of twenty tosses is statistically significant, 70% of twenty tosses is not statistically significantly and overall 52% of fifty tosses is not statistically significant. Calculating statistical significance guards against what statisticians call a type I error, in this case, believing that my coin tossing is biased when it is not. There is another possible type of error, which statisticians call a type II error, which is concluding that my coin tossing is unbiased when, in fact, it is biased. This type of error is not particularly important in many contexts, for example, my coin-tossing. However, if we are talking about appointments in a university department we might be very concerned if it was concluded that the appointments process is unbiased when, in fact, it is biased. The Neyman-Pearson procedure divides the possible outcomes of a measurement into two regions: the rejection region where the null hypothesis is rejected and the acceptance regions where it is accepted. For example, for ten coin tosses we might reject the hypothesis that the coin is unbiased if we get two or fewer tails (P(2 or fewer tails in 10 tosses) = 0.055 or if we get eight or more tails (P(8 or more tails in 10 tosses) = 0.055. In this case the probability of rejecting the null hypothesis when it is true is 0.11. The probability of accepting the null hypothesis when it is actually false depends on what the true value of the parameter p of the binomial distribution is. In this case, the probability of accepting p=0.5 when is 0.62 when p is 0.3, 0.82 when p is 0.4, 0.89 when p is 0.5, 0.82 when p is 0.6 and 0.62 when p is 0.7. So we have a better than even chance of accepting that the coin-tossing is unbiased when in fact the probability of getting a tail on any toss is anywhere between 0.3 and 0.7. If we change the rejection criterion to three or fewer tails or seven or more tails then the probability of rejecting when the hypothesis is true is 0.34 and there is a better than even chance of accepting the unbiased hypothesis when the probability of getting a tail on any toss is anywhere between 0.4 and 0.6.
So, when we try to interpret our ‘hard fact’ that seven out of ten coin tosses came up tails in terms of making an inference about whether the coin-tossing process is unbiased the ‘hard fact’ disappears into a morass of value judgements about whether we prefer a higher probability of rejecting the hypothesis that my coin-tossing is unbiased when it actually is unbiased or a higher probability of accepting the hypothesis that my coin-tossing is unbiased when it is actually biased. In the case of academic appointments the judgements become do we prefer a higher probability that we devote time and resources to fixing a process that is not broken or is it more important that we are confident that the appointments process is not biased? We could, of course, use a larger number of tosses or appointments. The problem with this approach for academic appointments is that while it takes a few minutes to toss a coin fifty times it would take a department with fifty academic staff four years to make ten new appointments and twenty to make fifty new appointments if turnover is 5%. This does not seem a recipe for quick identification and correction of problems.
After working for many years as a research physicist, I became a part-time project officer with the Women in Science, Engineering and Technology Initiative (WiSETI) at the University of Cambridge in the UK. I've also been a member of the steering group of the Cambridge AWiSE networking for women in SET. I am now based in New Zealand.