I was alerted to this post, 'Women and BigLaw: Yes, Virginia, it is Sexism' on The Mama Bee by my friend, Suzanne Doyle-Morris, via LinkedIn . It reminded me of a naturenews story 'Data show extent of sexism in physics' describing a study by Sherry Towers of post-docs at Fermilab (http://arxiv.org/abs/0804.2026) that showed women were only one third as likely to be allocated conference talks as their male colleagues. Two reactions particularly interested me. One was quoted in the news story and is from a male particle physicist who commented that the small numbers in the study (48 men and 9 women) were not enough to prove systematic bias, though Towers used an analysis that took into account the very small number of women. This illustrates one of the problems with investigating reasons for the slow progress of women in science: there are few women so it is difficult to obtain statistically significant results. The other was the first comment on the post which contained the observation: '...women in physics are generally harder working than male colleagues and are great co-workers in terms of encouragement, diligence, and backup support. They do not, however, contribute a great deal of original ideas and rigorous logical analysis to the research. Female judgment seems to more emotionally biased.' In a subsequent comment the same person complained 'I am amazed at the furious response to my comment.' The 'women are diligent but lack originality' is not an uncommon belief among physicists so the author of the comment may well have been surprised to discover that not everyone agrees. What I find uncomfortable is the condemnation evident in some of the responses. Most people have unconscious biases. I once found myself making the assumption that the reader of a book about electronics would be male - on the train returning from a meeting about women in engineering. Rather than attacking people for erroneous beliefs we should educate ourselves and others about the effects of unconscious bias and design our processes to minimize its effects.

The previous post on The Mama Bee, 'Wake Up, Ivory Tower: We Need You!' is also interesting. The author makes a plea for academic researchers on work-life balance to be more cognizant of the actual needs of employees who would benefit from flexible working. In particular, that people who need flexible working are often not in an ideal position to fight for it. It seems to me that a similar situation pertains to women in science. Academic researchers carry out projects that are often of limited relevance to women who are working in science and engineering while data that would be useful are not obtainable, for example, is there gender bias at the step from post-doc to independent scientist, for example, with a Royal Society University Research Fellowship? Do women who work part-time in science find it harder to progress?

## Tuesday, March 30, 2010

## Monday, March 29, 2010

### A plea

At the bottom of my posts you will see five boxes labelled: 'I agree', 'Interesting', 'I have no opinion', 'I have reservations', 'I disagree'. I would be very grateful if you would click on one of these reaction boxes. It only takes a second. Of course, if you disagree with me ideally I would like to know why but I recognise that you probably do not have time to frame a response.

Thanks, Esther

Thanks, Esther

### A Diversion

Last week we spent a couple of days visiting the Catlins, an area on the south east of the South Island of New Zealand (www.catlins-nz.com). What does this have to do with science? Well, the pictures are

- A petrified log from the Jurassic at Curio Bay
- Lake Wilkie, which is a 'bog lake' trapped between sand dunes and a cliff since the last ice age. The vegetation around the lake a good demonstration of plant succession as the lake is gradually replaced by forest.
- The Purakaunui Falls, which are photogenic.

There are more pictures at picasaweb.

## Sunday, March 21, 2010

### A Digression on Inference

In the previous post I used standard hypothesis testing techniques to show that interpreting data requires value judgements. The following are some musings set off by that exercise.

This post also contains technical material about statistical inference. Thinking about statistical inference tends to cause pain in the brain so for those who don’t want to struggle through the technical stuff the important points are:

I had gained the impression from my undergraduate textbook on probability and statistics that hypothesis testing was a routine procedure based on ideas that had been around since at least the 1930s. This is true. It is also true that there is considerable controversy.

As I understand it (and I’m not a statistician, so please comment if you know about these issues) one controversy goes back to a dispute between Fisher and Neyman, basically about whether it is necessary to consider Type II errors. In a significance test you calculate the probability, p, that you would have observed an effect of at least the size that you did observe on the assumption that there is no effect. If this probability is small you reject the hypothesis that there is no effect. If you decide to reject the ‘no-effect’ (null) hypothesis whenever p is less than 0.05 then in the long run the rate at which you will incorrectly reject the null hypothesis (Type I error) is 1 in 20. Alternatively, you can tell people the actual value of p that you obtained.

A statement of statistical significance is a statement about the data not about the hypothesis. It tells us how confident you are about rejecting the null hypothesis on the basis of the data you have obtained. In some circumstances, for example, a quality controller in a factory deciding whether to reject or accept a shipment of components, you would need to know not only the probability of being fooled by fluctuations into rejecting the ‘no effect’ hypothesis when it is true (rejecting a shipment when it is OK) but also of failing to reject the ‘no-effect’ hypothesis when there actually is an effect (Type II error) (accepting a shipment with an unacceptable number of defects). The Neyman-Pearson approach allows you to construct a test that, for a given probability of rejecting the null hypothesis when it is true, minimizes the probability of accepting it when it is false. (More accurately, it minimizes the probability of incorrectly rejecting an alternative hypothesis. The assumption is that if you are going to take action on the basis of the test then you will either reject the null hypothesis, implicitly accepting the alternative, or you will reject the alternative hypothesis, implicitly accepting the null hypothesis.) You can always make the probability of Type II error smaller by accepting a higher probability of Type I error. In real world applications there will usually be arguments about relative costs and benefits to inform the choice. Fisher took the view that this procedure was incorrect in science:

In this view you either reject the null hypothesis with some probability of error or you regard the null hypothesis as not (yet) proved wrong. This is basically the logical point that you can use data only to disprove a hypothesis not to prove it. It does not matter how much data you have that are consistent with a hypothesis, there is always the possibility that you will eventually encounter data that disprove it. In this case you can’t make the error of accepting a hypothesis when it is false because you would never accept any hypothesis. However, when people do need to choose what action to take based on data they act as though they accept a particular hypothesis.

One of the points of contention has been that people don’t just need to know whether there is an effect or not they also need to know how big it might be. This is important even if you have not been able to reject the null hypothesis. For example, if you are comparing the rate of death from heart attacks for patients undergoing treatment A with that for patients undergoing treatment B it would be quite important to know that the data that gave a high (greater than 0.05) probability of error for rejecting the null hypothesis gave a not much higher probability of error for rejecting the hypothesis that one rate was twice the other. In other words, how well does the test discriminate between the null hypothesis and other hypotheses. This information should be given as a confidence interval (the interval that you expect to include the true value of a parameter in 95% of cases) or a calculation of the power, which measures how well the test discriminates between different possibilities.

Why do we tie ourselves up in logical knots calculating the probability that an effect of at least the size we have observed would have been observed if the hypothesis we are trying to show is false is true? Why don’t we just calculate the probability that, for example, two means are different, given the data we have observed? This is the heart of one of the great divides of twentieth century, and now twenty-first century, science. In the theory of classical statistical inference the probability that two means are different is not a meaningful concept. The two means are fixed; we just don’t know what they are. It is the estimates of the means that we calculate from the data that have probability distributions. There is an alternative method for statistical inference based on Bayes Theorem. There are two problems with this method. First, it requires us to interpret probability as meaning a degree of belief. This upsets many scientists because one person’s degree of belief in a statement might be different from another’s. They prefer an objective definition of probability. This is called the frequentist position because a popular interpretation of probability is as the long term frequency in a large number of trials, for example, if you keep tossing a fair coin long enough the ratio (Number of tails / Number of tosses) would be close to 1/2. Those who favour the degree of belief interpretation are known as Bayesians, for obvious reasons. The other problem with the Bayesian approach is that it involves using the data to calculate a final distribution for the parameter of interest, called the posterior distribution, from an initial assumed distribution, called the prior distribution. Unfortunately, there is no accepted method for choosing a prior distribution. This does not matter if you have lots of data because then the prior does not have much influence on the posterior. (If you are not familiar with statistical inference you may at this point wonder why people doing classical statistics do not need to assume a distribution in order to calculate p. The reason is that there is a wonderful mathematical result called the ‘Central Limit Theorem’ that implies that the sample mean must follow a normal distribution when the sample size is large.)

In the context of using statistics on academic appointments in a university department to inform action, the weaknesses of the Bayesian approach could be regarded as strengths. People act because they believe something to be true not because they have failed to reject the hypothesis that it is false. Also, in most departments there will be a range of pre-existing views on whether recruitment is biased in favour of men – a few who are convinced it is, a few who are convinced it is fair, some who have no idea and possibly a small group who believe that women are favoured over men. Some of these views will be strongly held. Why not make these differences explicit by incorporating them in different priors?

[If some of the above sounds vaguely like something you once learnt in a stats course try ‘The Cartoon Guide to Statistics’ by Larry Gonick and Woollcott Smith, Collins, 1993, ISBN-13: 978-0062731029.]

This post also contains technical material about statistical inference. Thinking about statistical inference tends to cause pain in the brain so for those who don’t want to struggle through the technical stuff the important points are:

- The statement ‘The result is not statistically significant at the 0.05 level’ does not imply that there is no effect, only that the data do not rule out that there is no effect. The corollary of ‘The result is not statistically significant at the 0.05 level’ is ‘We are still not that sure whether there is an effect or not.’ (‘Absence of evidence is not evidence of absence.’)
- Statements about statistical significance are statements about the data not about the effect. With enough data a tiny difference of no practical importance can be statistically significant and a practically important difference can be not statistically significant if there is only a small amount of data.

I had gained the impression from my undergraduate textbook on probability and statistics that hypothesis testing was a routine procedure based on ideas that had been around since at least the 1930s. This is true. It is also true that there is considerable controversy.

As I understand it (and I’m not a statistician, so please comment if you know about these issues) one controversy goes back to a dispute between Fisher and Neyman, basically about whether it is necessary to consider Type II errors. In a significance test you calculate the probability, p, that you would have observed an effect of at least the size that you did observe on the assumption that there is no effect. If this probability is small you reject the hypothesis that there is no effect. If you decide to reject the ‘no-effect’ (null) hypothesis whenever p is less than 0.05 then in the long run the rate at which you will incorrectly reject the null hypothesis (Type I error) is 1 in 20. Alternatively, you can tell people the actual value of p that you obtained.

A statement of statistical significance is a statement about the data not about the hypothesis. It tells us how confident you are about rejecting the null hypothesis on the basis of the data you have obtained. In some circumstances, for example, a quality controller in a factory deciding whether to reject or accept a shipment of components, you would need to know not only the probability of being fooled by fluctuations into rejecting the ‘no effect’ hypothesis when it is true (rejecting a shipment when it is OK) but also of failing to reject the ‘no-effect’ hypothesis when there actually is an effect (Type II error) (accepting a shipment with an unacceptable number of defects). The Neyman-Pearson approach allows you to construct a test that, for a given probability of rejecting the null hypothesis when it is true, minimizes the probability of accepting it when it is false. (More accurately, it minimizes the probability of incorrectly rejecting an alternative hypothesis. The assumption is that if you are going to take action on the basis of the test then you will either reject the null hypothesis, implicitly accepting the alternative, or you will reject the alternative hypothesis, implicitly accepting the null hypothesis.) You can always make the probability of Type II error smaller by accepting a higher probability of Type I error. In real world applications there will usually be arguments about relative costs and benefits to inform the choice. Fisher took the view that this procedure was incorrect in science:

“It is important that the scientific worker introduces no cost functions for faulty decisions, as it is reasonable and necessary to do with an Acceptance Procedure. To do so would imply that the purposes to which new knowledge was to be put were known and capable of evaluation. If, however, scientific findings are communicated for the enlightenment of other free minds, they may be put sooner or later to the service of a number of purposes, of which we can know nothing.”

[Statistical Methods and Scientific Inference, 1956]

In this view you either reject the null hypothesis with some probability of error or you regard the null hypothesis as not (yet) proved wrong. This is basically the logical point that you can use data only to disprove a hypothesis not to prove it. It does not matter how much data you have that are consistent with a hypothesis, there is always the possibility that you will eventually encounter data that disprove it. In this case you can’t make the error of accepting a hypothesis when it is false because you would never accept any hypothesis. However, when people do need to choose what action to take based on data they act as though they accept a particular hypothesis.

One of the points of contention has been that people don’t just need to know whether there is an effect or not they also need to know how big it might be. This is important even if you have not been able to reject the null hypothesis. For example, if you are comparing the rate of death from heart attacks for patients undergoing treatment A with that for patients undergoing treatment B it would be quite important to know that the data that gave a high (greater than 0.05) probability of error for rejecting the null hypothesis gave a not much higher probability of error for rejecting the hypothesis that one rate was twice the other. In other words, how well does the test discriminate between the null hypothesis and other hypotheses. This information should be given as a confidence interval (the interval that you expect to include the true value of a parameter in 95% of cases) or a calculation of the power, which measures how well the test discriminates between different possibilities.

Why do we tie ourselves up in logical knots calculating the probability that an effect of at least the size we have observed would have been observed if the hypothesis we are trying to show is false is true? Why don’t we just calculate the probability that, for example, two means are different, given the data we have observed? This is the heart of one of the great divides of twentieth century, and now twenty-first century, science. In the theory of classical statistical inference the probability that two means are different is not a meaningful concept. The two means are fixed; we just don’t know what they are. It is the estimates of the means that we calculate from the data that have probability distributions. There is an alternative method for statistical inference based on Bayes Theorem. There are two problems with this method. First, it requires us to interpret probability as meaning a degree of belief. This upsets many scientists because one person’s degree of belief in a statement might be different from another’s. They prefer an objective definition of probability. This is called the frequentist position because a popular interpretation of probability is as the long term frequency in a large number of trials, for example, if you keep tossing a fair coin long enough the ratio (Number of tails / Number of tosses) would be close to 1/2. Those who favour the degree of belief interpretation are known as Bayesians, for obvious reasons. The other problem with the Bayesian approach is that it involves using the data to calculate a final distribution for the parameter of interest, called the posterior distribution, from an initial assumed distribution, called the prior distribution. Unfortunately, there is no accepted method for choosing a prior distribution. This does not matter if you have lots of data because then the prior does not have much influence on the posterior. (If you are not familiar with statistical inference you may at this point wonder why people doing classical statistics do not need to assume a distribution in order to calculate p. The reason is that there is a wonderful mathematical result called the ‘Central Limit Theorem’ that implies that the sample mean must follow a normal distribution when the sample size is large.)

In the context of using statistics on academic appointments in a university department to inform action, the weaknesses of the Bayesian approach could be regarded as strengths. People act because they believe something to be true not because they have failed to reject the hypothesis that it is false. Also, in most departments there will be a range of pre-existing views on whether recruitment is biased in favour of men – a few who are convinced it is, a few who are convinced it is fair, some who have no idea and possibly a small group who believe that women are favoured over men. Some of these views will be strongly held. Why not make these differences explicit by incorporating them in different priors?

[If some of the above sounds vaguely like something you once learnt in a stats course try ‘The Cartoon Guide to Statistics’ by Larry Gonick and Woollcott Smith, Collins, 1993, ISBN-13: 978-0062731029.]

## Thursday, March 4, 2010

### The Myth of Hard Facts

This post gets a bit technical so for number-phobes the point is that interpreting numerical data involves value judgements, specifically about whether it is better to be fooled by random fluctuations into devoting time and resources to fixing a process that is not broken or to risk not fixing a process that is broken.

When scientists start thinking about the issues related to women in science their first instinct is to collect numerical data. Natural scientists tend to want to measure a baseline, make an intervention and then re-measure to see if the intervention made any difference. Social scientists tend to want to measure everything they can think of and then do a multi-variate statistical analysis. This is natural. It is what we have been trained to do. We also often hear 'We must have hard data, not just anecdote'.

Indeed, we do need good data but we should be aware of the limitations of the statistical approach. Let's take the example of a university department that wants to check whether its recruitment procedures are unbiased. If it is a life sciences department it might expect that 50% of academic appointments would be women, so in this case we can define unbiased recruitment as meaning that 50% of appointments are women. Mathematically the problem of determining whether this recruitment process is unbiased is equivalent to determining whether a coin is equally likely to land heads or tails. Here are the results of tossing a 10p piece ten times: HTTTHHTTTT. Seven of the ten tosses turned up tails. Is my coin-tossing biased? Seven out of ten seems quite large. What happens if I try tossing the coin twenty times? Here is the result of that experiment: THTHHHHTTT HHHHHHHHHH, which is five tails out of twenty tosses. At this point I wondered whether there was something about the way I toss coins that biased against tails so I tried another twenty tosses with the following results: HTHTTTHTHT TTTHHTTTTT or fourteen out of twenty. So from ten tosses there were 70% tails, from the first twenty tosses 25% tails and from the second twenty tosses 70% tails. Overall there were 26 tails out of 50 tosses or 52%. How can we decide whether or not my coin tossing is biased when the results are so variable?

Conventional (Fisher) hypothesis testing says that we should calculate the probability, p, of seeing an effect at least as large as that observed on the assumption that the null hypothesis, in this case, that my coin tossing is unbiased, is true. If p is less than some value, conventionally taken to be 0.05 (or 0.01 or 0.001), then the null hypothesis is rejected. This test reduces the risk of us being fooled by random fluctuations into thinking that my coin tossing is biased when it is not. Coin tossing follows a binomial distribution, shown in the graph for N=10 and N = 50, and the relevant probabilities are P(7 or more tails out of 10 tosses) = 0.17, P(5 or fewer tails out of 20 tosses) = 0.02, P(14 or more tails out of 20 tosses) = 0.06, P(26 or more tails out of 50 tosses) = 0.44. So, 70% of ten tosses is not statistically significant, 25% of twenty tosses is statistically significant, 70% of twenty tosses is not statistically significantly and overall 52% of fifty tosses is not statistically significant. Calculating statistical significance guards against what statisticians call a type I error, in this case, believing that my coin tossing is biased when it is not. There is another possible type of error, which statisticians call a type II error, which is concluding that my coin tossing is unbiased when, in fact, it is biased. This type of error is not particularly important in many contexts, for example, my coin-tossing. However, if we are talking about appointments in a university department we might be very concerned if it was concluded that the appointments process is unbiased when, in fact, it is biased. The Neyman-Pearson procedure divides the possible outcomes of a measurement into two regions: the rejection region where the null hypothesis is rejected and the acceptance regions where it is accepted. For example, for ten coin tosses we might reject the hypothesis that the coin is unbiased if we get two or fewer tails (P(2 or fewer tails in 10 tosses) = 0.055 or if we get eight or more tails (P(8 or more tails in 10 tosses) = 0.055. In this case the probability of rejecting the null hypothesis when it is true is 0.11. The probability of accepting the null hypothesis when it is actually false depends on what the true value of the parameter p of the binomial distribution is. In this case, the probability of accepting p=0.5 when is 0.62 when p is 0.3, 0.82 when p is 0.4, 0.89 when p is 0.5, 0.82 when p is 0.6 and 0.62 when p is 0.7. So we have a better than even chance of accepting that the coin-tossing is unbiased when in fact the probability of getting a tail on any toss is anywhere between 0.3 and 0.7. If we change the rejection criterion to three or fewer tails or seven or more tails then the probability of rejecting when the hypothesis is true is 0.34 and there is a better than even chance of accepting the unbiased hypothesis when the probability of getting a tail on any toss is anywhere between 0.4 and 0.6.

So, when we try to interpret our ‘hard fact’ that seven out of ten coin tosses came up tails in terms of making an inference about whether the coin-tossing process is unbiased the ‘hard fact’ disappears into a morass of value judgements about whether we prefer a higher probability of rejecting the hypothesis that my coin-tossing is unbiased when it actually is unbiased or a higher probability of accepting the hypothesis that my coin-tossing is unbiased when it is actually biased. In the case of academic appointments the judgements become do we prefer a higher probability that we devote time and resources to fixing a process that is not broken or is it more important that we are confident that the appointments process is not biased? We could, of course, use a larger number of tosses or appointments. The problem with this approach for academic appointments is that while it takes a few minutes to toss a coin fifty times it would take a department with fifty academic staff four years to make ten new appointments and twenty to make fifty new appointments if turnover is 5%. This does not seem a recipe for quick identification and correction of problems.

Subscribe to:
Posts (Atom)