When I started working for WiSETI in Cambridge in 2003 I
was employed to work on a project specifically aimed at increasing the number
of women applying for positions in science and engineering. At that time I
thought that evaluating such a project would be easy. We knew what the
proportion of applicants for lecturing positions who were female was before the
project started. All we would need to do would be to monitor the proportion of
applicants who were female during and after the project and see if it
increased. As the project progressed I came to see this expectation as being
very naïve.
The problems are:
- Small numbers. How do you know whether an observed increase is due to the intervention or just a random fluctuation?
- If you aggregate data from different departments are you doing it in a meaningful way? You expect more applications from women for a position in Botany than for one in Computer Science. It could even be that some areas within a subject have more women than others.
- Measures are not evenly applied across departments. Some departments are enthusiastic and follow advice, some are enthusiastic but do their own thing, some are not enthusiastic but go through the motions and some are not enthusiastic and ignore advice. (And these descriptions are points along a continuum, not categories.)
- Other relevant variables change during the course of the project – legislation changes, policies change, heads of department and departmental administrators retire and are replaced, and new nurseries open. It is impossible to be sure that you are comparing apples with apples.
- Is the proportion of job applicants who are female even the right quantity to monitor? What matters in the end is how many women are appointed. Does increasing the number of female applicants from say 5 out of 35 to say 10 out of 40 actually increase the likelihood that a woman is appointed? Or does it just mean that five extra women have devoted a considerable amount of their precious time to writing a job application and another group of people have had to spend their time reading them? If relatively few women are applying for lecturing positions in STEM does this mean that women are establishing themselves as independent researchers and then not applying for academic positions or does it mean that they are not establishing themselves as independent researchers and hence qualifying for the academic positions in the first place? Is it that women lack confidence and hence don’t apply for positions? Or is it that women tend to be time poor and therefore less likely to spend some of that precious time applying for a job unless they think they have a reasonable chance of success? If women are less likely to apply for a job for which they are qualified than men are, is the solution necessarily to persuade women to behave more like men?
So, even a project with relatively
well-defined objectives is not necessarily straightforward to evaluate.
If you are trying to decide how to most
effectively deploy available resources to achieve the best effect then you also
need to take into account what resources were devoted to the project. When you
are estimating the resources used by a project do you take into account time
effectively donated to the project? How do you compare a project that has a
positive effect on a small fraction of participants but can be delivered to a
large number of people with a project that makes a difference to most of its
participants but can be delivered to only a few people?
There are a lot of questions here. I would
really like to see some rigorous discussion of how interventions are evaluated, provided, of course, that this is not at the expense of actually doing something.