Sunday, 13 May 2018

How to survive on Twitter – a simple rule to reduce stress

In recent weeks, I’ve seen tweets from a handful of people I follow saying they are thinking of giving up Twitter because it has become so negative. Of course they are entitled to do so, and they may find that it frees up time and mental space that could be better used for other things. The problem, though, is that I detect a sense of regret. And this is appropriate because Twitter, used judiciously, has great potential for good.

For me as an academic, the benefits include:
·      Finding out about latest papers and other developments relevant to my work
·      Discovering new people with interesting points of view – often these aren’t eminent or well-known and I’d never have come across them if I hadn’t been on social media
·      Being able to ask for advice from experts – sometimes getting a remarkably quick and relevant response
·      Being able to interact with non-academics who are interested in the same stuff as me
·      Getting a much better sense of the diversity of views in the broader community about topics I take for granted – this often influences how I go about public engagement
·      Having fun – there are lots of witty people who brighten my day with their tweets

The bad side, of course, is that some people say things on Twitter that they would not dream of saying to your face. They can be rude, abusive, and cruel, and sometimes mind-bogglingly impervious to reason. We now know that some of them are not even real people – they are just bots set up by those who want to sow discord among those with different political views. So how do we deal with that?

Well, I have a pretty simple rule that works for me, which is that if I find someone rude, obnoxious, irritating or tedious, I mute them. Muting differs from blocking in that the person doesn’t know they are muted. So they may continue hurling abuse or provocations at you, unaware that they are now screaming into the void.

A few years ago, when I first got into a situation where I was attacked by a group of unpleasant alt-right people (who I now realise were probably mostly bots), it didn’t feel right to ignore them, for three reasons:
·      First, they were publicly maligning me, and I felt I should defend myself.
·      Second, we’ve been told to beware the Twitter bubble. If we only interact on social media with those who are like-minded: it can create a totally false impression of what the world is like.
·      Third, walking away from an argument is not a thing a good academic does: we are trained experts in reasoned debate, and our whole instinct is to engage with those who disagree with us, examine what they say and make a counterargument.

But I soon learned that some people on social media don’t play by the rules of academic engagement. They are not sincere in their desire to discuss topics: they have a viewpoint that nothing will change, and they will use any method they can find to discredit an opponent. This includes ad hominem attacks, lying and wilful misrepresentation of what you say.  It's not cowardly to avoid these people: it's just a sensible reaction. So I now just mute anyone where I get a whiff of such behaviour – directed either towards me or anyone else.

The thing is, social media is so different from normal face-to-face interaction, that it needs different rules. Just imagine if you were sitting with friends at the pub, having a chat, and someone barged in and started shouting at you aggressively. Or someone sat down next to you, uninvited, and proceeded to drone on about a very boring topic, impervious to the impact they are having. People may have different ways of extricating themselves from these situations, but one thing you can be sure of: when you next go to the pub, you would not seek these individuals out and try to engage them in discussion.

So my rule boils down to this: Ask yourself, if I was talking to this person in the pub, would I want to prolong the interaction? Or, if there was a button that I could press to make them disappear, would I use it?  Well, on social media, there is such a button, and I recommend taking advantage of it.*

*I should make it clear that there are situations when a person is subject to such a volume of abuse that this isn’t going to be effective. Avoidance of Twitter for a while may be the only sensible option in such cases. My advice is intended for those who aren’t the centre of a vitriolic campaign, but who are turned off Twitter because of the stress it causes to observe or participate in hostile Twitter exchanges.

Wednesday, 9 May 2018

My response to the EPA's 'Strengthening Transparency in Regulatory Science'

Incredible things have happened at the US Environmental Protection Agency since Donald Trump was elected. The agency is responsible for creating standards and laws that promote the health of individuals and the environment. During previous administrations it has overseen laws concerned with controlling pollution and regulating carbon emissions. Now, under Administrator Scott Pruitt, the voice of industry and climate scepticism is in the ascendant. 

A new rule that purports to 'Strengthen Transparency in Regulatory Science' has now been proposed - ironically, at a time when the EPA is being accused of a culture of secrecy regarding its own inner workings. Anyone can comment on the rule here: I have done so, but my comment appears to be in moderation, so I am posting it here.

Dear Mr Pruitt

re: Regulatory Science- Docket ID No. EPA-HQ-OA-2018-0259

The proposed rule, ‘Strengthening transparency in regulatory science’ brings together two strands of contemporary scientific activity. On the one hand, there is a trend to make policy more evidence-based and transparent. On the other hand, there has, over the past decade, been growing awareness of problems with how science is being done, leading to research that is not always reproducible (the same results achieved by re-analysis of the data) or replicable (similar results when an experiment is repeated). The proposed rule by the Environmental Protection Agency (EPA) brings these two strands together by proposing that policy should only be based on research that has openly available public data. While this may on the surface sound like a progressive way of integrating these two strands, it rests on an over-simplified view of how science works and has considerable potential for doing harm.

I am writing in a personal capacity, as someone at the forefront of moves to improve reproducibility and replication of science in the UK. I chaired a symposium at the Academy of Medical Sciences on this topic in 2015; this was jointly organised with UK major funders: Wellcome Trust, Medical Research Council and Biotechnology and Biological Science Research Council ( I am involved in training early career researchers in methods to improve reproducibility, and I am a co-author of Munafò, M. R et al  (2017). A manifesto for reproducible science. Nature Human Behavior, 1(1: 0021). doi:10.1038/s41562-016-0021. I would welcome any move by the US Government that would strengthen research by encouraging adoption of methods to improve science, including making analysis scripts and data open when this is not in conflict with legal/ethical issues. Unfortunately, this proposal will not do that. Instead, it will weaken science by drawing a line in the sand that effectively disregards scientific discoveries prior to the first part of the 21st century when the importance of open data started to be increasingly recognised.

The proposal ignores a key point about how scientific research works: the time scale. Most studies that would be relevant to EPA take years to do, and even longer to filter through to affect policy. Consequences for people and the planet that are relevant to environmental protection are often not immediately obvious: if they were, we would not need research. Recognition, for instance, of the dangers of asbestos, took years because the impacts on health were not immediate. Work demonstrating the connection occurred many years ago and I doubt that the data are anywhere openly available, yet the EPA’s proposed rule would imply that it could be disregarded. Similarly, I doubt there is open data demonstrating the impact of lead in paint or exhaust fumes, or of pesticides such as DDT: does this mean that manufacturers would be free to reintroduce these?

A second point is that scientific advances never depend on a single study: having open scripts and data is one way of improving our ability to check findings, but it is a relatively recent development, and it is certainly not the only way to validate science. The growth of knowledge has always depended on converging evidence from different sources, replications by different scientists and theoretical understanding of mechanisms. Scientific facts become established when the evidence is overwhelming. The EPA proposal would throw out the mass of accumulated scientific evidence from the past, when open practices were not customary – and indeed often not practical before computers for big data were available.

Contemporary scientific research is far from perfect, but the solution is not to ignore it, but to take steps to improve it and to educate policy-makers in how to identify strong science; government needs advisors who have scientific expertise and no conflict of interest, who can integrate existing evidence with policy implications. The ‘Strengthening Transparency’ proposal is short-sighted and dangerous and appears to have been developed by people with little understanding of science. It puts citizens at risk of significant damage – both to health and prosperity -  and it will make the US look scientifically illiterate to the rest of the world.

Yours sincerely

D. V. M. Bishop FMedSci, FBA, FRS

Thursday, 3 May 2018

Power, responsibility and role models in academia

Last week, Robert J. Sternberg resigned as Editor of Perspectives on Psychological Science after a series of criticisms of his behaviour on social media. I first became aware of this issue when Bobbie Spellman wrote a blogpost explaining why she was not renewing her membership of the Association for Psychological Science, noting concerns about Sternberg’s editorial bias and high rate of self-citation, among other issues.

Then a grad student at the University of Leicester, Brendan O’Connor, noted that Sternberg not only had a tendency to cite his own work; he also recycled large portions of written text in his publications. Nick Brown publicised some striking examples on his blog, and Retraction Watch subsequently published an interview with O’Connor explaining the origins of the story.

In discussing his resignation, Sternberg admitted to ‘lapses in judgement and mistakes’ but also reprimanded those who had outed him for putting their concerns online, rather than contacting him directly. A loyal colleague, James C. Kaufman, then came to his defence, tweeting:

The term ‘witch-hunt’ is routinely trotted out whenever a senior person is criticised. (Indeed, it has become one of Donald Trump’s favourite terms to describe attempts to call him out for various misbehaviours). It implies that those who are protesting at wrongdoing are self-important people who are trying to gain attention by whipping up a sense of moral panic about relatively trivial matters.

I find this both irritating and symptomatic of a deep problem in academic life. I do not regard Sternberg’s transgressions as particularly serious: He used his ready access to a publishing platform for self-promotion and self-plagiarism, was discovered, and resigned his editorial position with a rather grumbly semi-apology. If that was all there was to it, I would agree that everyone should move on.  

The problem is with the attitude of senior people such as Kaufman. A key point is missed by those who want to minimise Sternberg’s misbehaviour: He is one of the most successful psychologists in the world, and so to the next generation, he is a living embodiment of what you need to do to become a leader in the field.  So early-career scientists will look at him and conclude that to get to the top you need to bend the rules.

In terms of abuse of editorial power, Sternberg’s behaviour is relatively tame. Consider the case of Johnny Matson, Jeff Sigafoos, Giuliano Lancioni and Mark O’Reilly, who formed a coterie of editors and editorial board members who enhanced their publications and citations by ditching usual practices such as peer review when handling one another’s papers. I documented the evidence for this back in 2015, and there appear to have been no consequences for any of these individuals. You might think it isn’t so important if a load of dodgy papers make it into a few journals, but in this case, there was potential for damage beyond academia: the subject matter concerned developmental disorders, and methods of assessment and intervention were given unjustified credibility by being published in journals that were thought to be peer-reviewed. In addition, the corrosive influence on the next generation of psychologists was all too evident: When I first wrote about this, I was contacted by several early-career people who had worked with the dodgy editors: they confirmed that they were encouraged to adopt similar practices if they wanted to get ahead.

When we turn to abuse of personal power, there have been instances in academia that are much, much worse than editorial misdemeanours – clearly documented cases of senior academics acting as sexual predators on junior staff – see, for instance, here and here. With the #MeToo campaign (another ‘witch-hunt’), things are starting to change, but the recurring theme is that if you are sufficiently powerful you can get away with almost anything.

Institutions that hire top academics seem desperate to cling on to them because they bring in grants and fame.  Of course, accusations need to be fully investigated in a fair and impartial fashion, but in matters such as editorial transgressions, the evidence is there for all to see, and a prompt response is required.

The problem with the academic hierarchy is that at the top there is a great deal of power and precious little responsibility. Those who make it to positions of authority should uphold high professional standards and act as academic role models. At a time when many early-career researchers are complaining that their PIs are encouraging them to adopt bad scientific practices, it’s all the more important that we don’t send the message that you need to act selfishly and cut corners in order to succeed.

I don’t want to see Sternberg vilified, but I do think the onus is now on the academic establishment to follow Bobbie Spellman’s lead and state publicly that his behaviour fell below what we would expect from an academic role model – rather than sweeping it under the carpet or, even worse, portraying him as a victim. 

Saturday, 7 April 2018

Should research funding be allocated at random?

Earlier this week, a group of early-career scientists had an opportunity to quiz Jim Smith, Director of Science at the Wellcome Trust. The ECRs were attending a course on Advanced Methods for Reproducible Science that I ran with Chris Chambers and Marcus Munafo, and Jim kindly agreed to come along for an after-dinner session which started in a lecture room and ended in the bar.

Among other things, he was asked about the demand for small-scale funding. In some areas of science, a grant of £20-30K could be very useful in enabling a scientist to employ an assistant to gather data, or to buy a key piece of equipment. Jim pointed out that from a funder’s perspective, small grants are not an attractive proposition, because the costs of administering them (finding reviewers, running grant panels, etc.) are high relative to the benefits they achieve. And it’s likely that there will be far more applicants for small grants.

This made me wonder whether we might retain the benefits of small grants by dispensing with the bureaucracy. A committee would have to scrutinise proposals to make sure that the proposal met the funder’s remit, and were of high methodological quality; provided that were so, then the proposal could be entered into a pool, with winners selected at random.

Implicit in this proposal is the idea that it isn’t possible to rank applications reliably. If a lottery approach meant we ended up funding weak research and denying funds to excellent project, this would clearly be a bad thing. But research rankings by committee and/or peer review is notoriously unreliable, and it is hard to compare proposals that span a range of disciplines. Many people feel that funding is already a lottery, albeit an unintentional one, because the same grant that succeeds in one round may be rejected in the next. Interviews are problematic because they mean that a major decision – fund or not – is decided on the basis of a short sample of a candidate’s behaviour, and that people with great proposals but poor social skills may be turned down in favour of glib individuals who can sell themselves more effectively.

I thought it would be interesting to float this idea in a Twitter poll.  I anticipated that enthusiasm for the lottery approach might be higher among those who had been unsuccessful in getting funding, but in fact, the final result was pretty similar, regardless of funding status of the respondent: most approved of a lottery approach, with 66% in favour and 34% against.

As is often the way with Twitter, the poll encouraged people to point me to an existing literature I had not been aware of. In particular, last year, Mark Humphries (@markdhumphries) made a compelling argument for randomness in funding allocations, focusing on the expense and unreliability of current peer review systems. Hilda Bastian and others pointed me to work by Shahar Avin , who has done a detailed scholarly analysis of policy implications for random funding – in the course of which he mentions three funding systems where this has been tried.  In another manuscript, Avin presented a computer simulation to compare explicit random allocation with peer review. The code is openly available, and the results from the scenarios modelled by Avin are provocative in supporting the case for including an element of randomness in funding. (Readers may also be interested in this simulation of the effect of luck on a meritocracy, which is not specific to research funding but has some relevance.) Others pointed to even more radical proposals, such as collective allocation of science funding, giving all researchers a limited amount of funding, or yoking risk to reward.

Having considered these sources and a range of additional comments on the proposal, I think it does look as if it would be worth a funder such as Wellcome Trust doing a trial of random allocation of funding for proposals meeting a quality criterion. As noted by Dylan Wiliam, the key question is whether peer review does indeed select the best proposals. To test this, those who applied for Seed Funding could be randomly directed to either stream A, where proposals undergo conventional evaluation by committee, or stream B, where the committee engages in a relatively light touch process to decide whether to enter the proposal into a lottery, which then decides its fate. Streams A and B could each have the same budget, and their outcomes could be compared a few years later.

One reason I’d recommend this approach specifically for Seed Funding is because of the disproportionate administrative burden for small grants. There would, in principle, be no reason for not extending the idea to larger grants, but I suspect that the more money is at stake, the greater will be the reluctance to include an explicit element of chance in the funding decision. And, as Shahar Avin noted, very expensive projects need long-term support, which makes a lottery approach unsuitable.

Some of those responding to the poll noted potential drawbacks. Hazel Phillips suggested that random assignment would make it harder to include strategic concerns, such as career stage or importance of topic. But if the funder had particular priorities of this kind, they could create a separate pool for a subset of proposals that met additional criteria and that would be given a higher chance of funding. Another concern was gaming by institutions or individuals submitting numerous proposals in scattergun fashion. Again, I don’t see this as a serious objection, as (a) use of an initial quality triage would weed out proposals that were poorly motivated and (b) applicants could be limited to one proposal per round. Most of the other comments that were critical expressed concerns about the initial triage: how would the threshold for entry into the pool be set?  A triage stage may look as if one is just pushing back the decision-making problem to an earlier step, but in practice, it would be feasible to develop transparent criteria for determining which proposals didn’t get into the pool: some have methodological limitations which mean they couldn’t give a coherent answer to the question they pose; some research questions are ill-formed; others have already been answered adequately -  this blogpost by Paul Glasziou and Iain Chalmers makes a good start in identifying characteristics of research proposals that should not be considered for funding.

My view is that there are advantages for the lottery approach over and above the resource issues. First, Avin’s analysis concludes that reliance on peer review leads to a bias against risk-taking, which can mean that novelty and creativity are discouraged. Second, once a proposal was in the pool, there would be no scope for bias against researchers in terms of gender or race – something that can be a particular concern when interviews are used to assess. Third, the impact on the science community is also worth considering. Far less grief would be engendered by a grant rejection if you knew it was that you were unlucky, rather than that you were judged to be wanting. Furthermore, as noted by Marina Papoutsi, some institutions evaluate their staff in terms of how much grant income they bring in – a process that ignores the strong element of chance that already affects funding decisions. A lottery approach, where the randomness is explicit, would put paid to such practices.


Friday, 9 February 2018

Improving reproducibility: the future is with the young

I've recently had the pleasure of reviewing the applications to a course on Advanced Methods for Reproducible Science that I'm running in April together with Marcus Munafo and Chris Chambers.  We take a broad definition of 'Reproducibility' and cover not only ways to ensure that code and data are available for those who wish to reproduce experimental results, but also focus on how to design, analyse and pre-register studies to give replicable and generalisable findings.

There is a strong sense of change in the air. Last year, most applicants were psychologists, even though we prioritised applications in biomedical sciences, as we are funded by the Biotechnology and Biological Sciences Research Council and European College of Neuropsychopharmacology. The sense was that issues of reproducibility were not not so high on the radar of disciplines outside psychology. This year things are different. We again attracted a fair number of psychologists, but we also have applicants from fields as diverse as gene expression, immunology, stem cells, anthropology, pharmacology and bioinformatics.

One thing that came across loud and clear in the letters of application to the course was dissatisfaction with the status quo. I've argued before that we have a duty to sort out poor reproducibility because it leads to enormous waste of time and talent of those who try to build on a glitzy but non-replicable result. I've edited these quotes to avoid identifying the authors, but these comments – all from PhD students or postdocs in a range of disciplines - illustrate my point:
  • 'I wanted to replicate the results of an influential intervention that has been widely adopted. Remarkably, no systematic evidence has ever been published that the approach actually works. So far, it has been extremely difficult to establish contact with initial investigators or find out how to get hold of the original data for re-analysis.' 

  • 'I attempted a replication of a widely-cited study, which failed. Although I first attributed it to a difference between experimental materials in the two studies, I am no longer sure this is the explanation.' 

  • 'I planned to use the methods of a widely cited study for a novel piece of research. The results of this previous study were strong, published in a high impact journal, and the methods apparently straightforward to implement, so this seemed like the perfect approach to test our predictions. Unfortunately, I was never able to capture the previously observed effect.' 

  • 'After working for several years in this area, I have come to the conclusion that much of the research may not be reproducible. Much of it is conducted with extremely small sample sizes, reporting implausibly large effect sizes.' 

  • 'My field is plagued by irreproducibility. Even at this early point in my career, I have been affected in my own work by this issue and I believe it would be difficult to find someone who has not themselves had some relation to the topic.' 

  • 'At the faculty I work in, I have witnessed that many people are still confused about or unaware of the very basics of reproducible research.'

Clearly, we can't generalise to all early-career researchers: those who have applied for the course are a self-selected bunch. Indeed, some of them are already trying to adopt reproducible practices, and to bring about change to the local scientific environment. I hope, though, that what we are seeing is just the beginning of a groundswell of dissatisfaction with the status quo. As Chris Chambers suggested in this podcast, I think that change will come more from the grassroots than from established scientists.

We anticipate that the greater diversity of subjects covered this year will make the course far more challenging for the tutors, but we expect it will also make it even more stimulating and fun than last year (if that is possible!). The course lasts several days and interactions between people are as important as the course content in making it work. I'm pretty sure that the problems and solutions from my own field have relevance for other types of data and methods, but I anticipate I will learn a lot from considering the challenges encountered in other disciplines.

Training early career researchers in reproducible methods does not just benefit them: those who attended the course last year have become enthusiastic advocates for reproducibility, with impacts extending beyond their local labs. We are optimistic that as the benefits of reproducible working become more widely known, the face of science will change so that fewer young people will find their careers stalled because they trusted non-replicable results.

Friday, 12 January 2018

Do you really want another referendum? Be careful what you wish for

Many people in my Twitter timeline have been calling for another referendum on Brexit. Since most of the people I follow regard Brexit as an unmitigated disaster, one can see they are desperate to adopt any measure that might stop it.

Things have now got even more interesting with arch-Brexiteer, Nigel Farage, calling yesterday for another referendum. Unless he is playing a particularly complicated game, he presumably also thinks that his side will win – and with an increased majority that will ensure that Brexit is not disrupted.

Let me be clear. I think Brexit is a disaster. But I really do not think another referendum is a good idea. If there's one thing that the last referendum demonstrated, it is that this is a terrible method for making political decisions on complicated issues.

I'm well-educated and well-read, yet at the time of the referendum, I understood very little about how the EU worked. My main information came from newspapers and social media – including articles such as this nuanced and thoughtful speech on the advantages and disadvantages of EU membership by Theresa May. (The contrast between this and her current mindless and robotic pursuit of extreme Brexit is so marked that I do wonder if she has been kidnapped and brainwashed at some point).

I was pretty sure that it would be bad for me as a scientist to lose opportunities to collaborate with European colleagues, and at a personal level I felt deeply European while also proud of the UK as a tolerant and fair-minded society. But I did not understand the complicated financial, legal, and trading arrangements between the UK and Europe, I had no idea of possible implications for Northern Ireland – this topic was pretty much ignored by the media that I got my information from. As far as I remember, debates on the topic on the TV were few and far between, and were couched as slanging matches between opposite sides – with Nigel Farage continually popping up to tell us about the dangers of unfettered immigration. I remember arguing with a Brexiteer group in Oxford Cornmarket who were distributing leaflets about the millions that would flow to the NHS if we left the EU, but who had no evidence to back up this assertion. There were some challenges to these claims on radio and TV, but the voices of impartial experts were seldom heard.

After the referendum, there were some stunning interviews with the populace exploring their reasons for voting. News reporters were despatched to Brexit hotspots, where they interviewed jubilant supporters, many of whom stated that the UK would now be cleansed of foreigners and British sovereignty restored. Some of them also mentioned funding of the NHS: the general impression was that being in the EU meant that an emasculated Britain had to put up with foreigners on British soil while at the same time giving away money to foreigners in Europe. The EU was perceived as a big bully that took from us and never gave back, and where the UK had no voice. The reporters never challenged these views, or asked about other issues, such as financial or other benefits of EU membership.

Of course there were people who supported Brexit for sound, logical reasons, but they seemed to be pretty thin on the ground. A substantial proportion of those voting seemed swayed by arguments about decreasing the number of foreigners in the UK and/or spending money on the NHS rather than 'giving it to Europe'.

Remainers who want another referendum seem to think that, now we've seen the reality of the financial costs of Brexit, and the exodus of talented Europeans from our hospitals, schools, and universities, the populace will see through the deception foisted on them in 2016. I wonder. If Nigel Farage wants a referendum, this could simply mean that he is more confident than ever of his ability to manipulate mainstream and social media to play on people's fears of foreigners. We now know more about sophisticated new propaganda methods that can be used on social media, but that does not mean we have adequate defences against them.

The only thing that would make me feel positive about a referendum would be if you had to demonstrate that you understood what you were voting for. You'd need a handful of simple questions about factual aspects of EU membership – and a person's vote would only be counted if these questions were accurately answered. This would, however, disenfranchise a high proportion of voters, and would be portrayed as an attack on democracy. So that is not going to happen. I think there's a strong risk that if we have another referendum, it will either be too close to call, or give the same result as before, and we'll be no further ahead.

But the most serious objection to another referendum is that it is a flawed method for making political decisions. As noted in this blogpost:

(A referendum requires) a complex, and often emotionally charged issue, to be reduced to a binary yes/no question.  When considering a relationship the UK has been in for over 40 years a simple yes/no or “remain/leave” question raises many complex and inter-connected questions that even professional politicians could not fully answer during or after the campaign. The EU referendum required a largely uninformed electorate to make a choice between the status quo and an extremely unpredictable outcome.

Rather than a referendum, I'd like to see decisions about EU membership made by those with considerable expertise in EU affairs who will make an honest judgement about what is in the best interests of the UK. Sadly, that does not seem to be an option offered to us.

Tuesday, 26 December 2017

Using simulations to understand p-values

Intuitive explanations of statistical concepts for novices #4

The p-value is widely used but widely misunderstood. I'll demonstrate this in the context of intervention studies. The key question is how confident can we be that an apparently beneficial effect of treatment reflects a change due to the intervention, rather than arising just through the play of chance. The p-value gives one way of deciding that. There are other approaches, including those based on Bayesian statistics, which are preferred by many statisticians. But I will focus here on the traditional null hypothesis significance testing (NHST) approach, which dominates statistical reporting in many areas of science, and which uses p-values.

As illustrated in my previous blogpost, where our measures include random noise, the distorting effects of chance mean that we can never be certain whether or not a particular pattern of data reflects a real difference between groups. However, we can compute the probability that the data came from a sample where there was no effect of intervention.

There are two ways to do this. One way is by simulation. If you repeatedly run the kind of simulation described in my previous blogpost, specifying no mean difference between groups, each time taking a new sample, for each result you can compute a standardized effect size. Cohen's d is the mean difference between groups expressed in standard deviation units, which can be computed by subtracting the group A mean from the group B mean, and dividing by the pooled standard deviation (i.e. the square root of the average of the variances for the two groups). You then see how often the simulated data give an effect size at least as large as the one observed in your experiment.
Histograms of effecct sizes obtained by repeatedly sampling from population where there is no difference between groups*
Figure 1 shows the distribution of effect sizes for two different studies: the first has 10 participants per group, and the second has 80 per group. For each study, 10,000 simulations were run; on each run, a fresh sample was taken from the population, and the standardized effect size, d, computed for that run. The peak of each distribution is at zero: we expect this, as we are simulating the case of no real difference between groups – the null hypothesis. But note that, though the shape of the distribution is the same for both studies, the scale on the x-axis covers a broader range for the study with 10 per group than the study with 80 per group. This relates to the phenomenon shown in Figure 5 of the previous blogpost, whereby estimates of group means jump around much more when there is a small sample.

The dotted red lines show the cutoff points that identify the top 5%, 1% and 0.1% of the effect sizes. Suppose we ran a study with 10 people and it gave a standardized effect size of 0.3. We can see from the figure that a value in this range is fairly common when there is no real effect: around 25% of the simulations gave an effect size of at least 0.3. However, if our study had 80 people per group, then the simulation tells us this is an improbable result to get if there really is no effect of intervention: only 2.7% of simulations yield an effect size as big as this.

The p-value is the probability of obtaining a result at least as extreme as the one that is observed, if there really is no difference between groups. So for the study with N = 80, p = .027. Conventionally, a level of p < .05 has been regarded as 'statistically significant', but this is entirely arbitrary. There is an inevitable trade-off between false positives (type I errors) and false negatives (type II errors). If it is very important to avoid false positives, and you do not mind sometimes missing a true effect, then a stringent p-value is desirable. If, however, you do not want to miss any finding of potential interest, even if it turns out to be a false positive, then you could adopt a more lenient criterion.

The comparison between the two sample sizes in Figure 1 should make it clear that statistical significance is not the same thing as practical significance. Statistical significance simply tells us how improbable a given result would be if there was no true effect. The larger the sample size, the smaller the effect size that would be detected at a threshold such as p < .05. Small samples are generally a bad thing, because they only allow us to reliably detect very large effects. But very large samples have the opposite problem: they allow us to detect as 'significant' effect that are so small as to be trivial. The key point that the researcher who is conducting an intervention study should start by considering how big an effect would be of practical interest, given the cost of implementing the intervention. For instance, you may decide that staff training and time spent on a vocabulary intervention would only be justified if it boosted children's vocabulary by at least 10 words. If you knew how variable children scores were on the outcome measure, the sample size could then be determined so that the study has a good chance of detecting that effect while minimising false positives. I will say more about how to do that in a future post.

I've demonstrated p-values using simulations in the hope that this will give some insight into how they are derived and what they mean. In practice, we would not normally derive p-values this way, as there are much simpler ways to do this, using statistical formulae. Provided that data are fairly normally distributed, we can use statistical approaches such as ANOVA, t-tests and linear regression to compute probabilities of observed results (see this blogpost). Simulations can, however, be useful in two situations. First, if you don't really understand how a statistic works, you can try running an analysis with simulated data. You can either simulate the null hypothesis by creating data from two groups that do not differ, or you can add a real effect of a given size to one group. Because you know exactly what effect size was used to create the simulated data, you can get a sense of whether particular statistics are sensitive to detect real effects, and how these might vary with sample size.

The second use of simulations is for situations where the assumptions of statistical tests are not met – for instance, if data are not normally distributed, or if you are using a complex design that incorporates multiple interacting variables. If you can simulate a population of data that has the properties of your real data, you can then repeatedly sample from this and compute the probability of obtaining your observed result to get a direct estimate of a p-value, just as was done above.

The key point to grasp about a p-value is that it tells you how likely your observed evidence is, if the null hypothesis is true. The most widely used p-value is .05: if the p-value in your study is less than .05, then the chance of your observed data arising when the intervention had no effect is 1 in 20. You may decide on that basis that it's worth implementing the intervention, or at least investing in the costs of doing further research on it.

The most common mistake is to think that the p-value tells you how likely the null hypothesis is given the evidence. But that is something else. The probability of A (observed data) given B (null hypothesis) not the same as the probability of B (null hypothesis) given A (observed data). As I have argued in another blogpost, the probability that if you are a man you are a criminal is not high, but if you are a criminal, the probability that you are a man is much higher. This may seem fiendishly complicated, but a concrete example can help.

Suppose Bridget Jones has discovered three weight loss pills: if taken for a month, pill A is totally ineffective placebo, pill B leads to a modest weight loss of 2 lbs, and pill C leads to an average loss of 7 lb. We do studies with three groups of 20 people; in each group, half are given A, B or C and the remainder are untreated controls. We discover that after a month, one of the treated groups has an average weight loss of 3 lb, whereas their control group has lost no weight at all. We don't know which pill this group received. If we run a statistical test, we find the p-value is .45. This means we cannot reject the null hypothesis of no effect – which is what we'd expect if this group had been given the placebo pill, A. But the result is also compatible with the participants having received pills B or C. This is demonstrate in Figure 2 which shows the probability density function for each scenario - in effect, the outline of the histogram. The red dotted line corresponds to our obtained result, and it is clear it is highly probable regardless of which pill was used. In short, this result doesn't tell us how likely the null hypothesis is – only that the null hypothesis is compatible with the evidence that we have.
Probability density function for weight loss pills A, B and C, with red line showing observed result

Many statisticians and researchers have argued we should stop using p-values, or at least adopt more stringent levels of p. My view is that p-values can play a useful role in contexts such as the one I have simulated here, where you want to decide whether an intervention is worth adopting, provided you understand what they tell you. It is crucial to appreciate how dependent a p-value is on sample size, and to recognise that the information it provides is limited to telling you whether an observed difference could just be due to chance. In a later post I'll go on to discuss the most serious negative consequence of misunderstanding of p-values: the generation of false positive findings by the use of p-hacking.

*The R script to generate Figures 1 and 2 can be found here.

Thursday, 21 December 2017

Using simulations to understand the importance of sample size

Intuitive explanations of statistical concepts for novices #3

I'll be focusing here on the kinds of stats needed if you conduct an intervention study. Suppose we measured the number of words children could define on a 20-word vocabulary task. Words were selected so that at the start of training, none of the children knew any of them. At the end of 3 months of training, every child in the vocabulary training group (B) knew four words, whereas those in a control group (A) knew three words. If we had 10 children per group, the plot of final scores would look like Figure 1 panel 1.
Figure 1. Fictional data to demonstrate concept of random error (noise)

In practice, intervention data never look like this. There is always unexplained variation in intervention outcomes, and real results look more like panel 2 or panel 3. That is, in each group, some children learn more than average and some less than average. Such fluctuations can reflect numerous sources of uncontrolled variation: for instance, random error will be large if we use unreliable measures, or there may be individual differences in responsiveness to intervention in the people included in the study, as well as things that can fluctuate from day to day or even moment to moment, such as people's mood, health, tiredness and so on.

The task for the researcher is to detect a signal – the effect of intervention – from noise – the random fluctuations. It is important to carefully select our measures and our participants to minimise noise, but we will never eliminate it entirely.

There are two key concepts behind all the statistics we do: (a) data will contain random noise, and (b) when we do a study we are sampling from a larger population. We can make these ideas more concrete through simulation.

The first step is to generate a large quantity of random numbers. Random numbers can be easily generated using the free software package R: if you have this installed, you can follow this demo by typing in the commands shown in italic at your console. R has a command, rnorm, that generates normally distributed random numbers. For instance:


will generate 10 z-scores, i.e. random numbers with mean of 0 and standard deviation of 1.You get new random numbers each time you submit the command, (unless you explicitly set something known as the random number seed to be the same each time). Now let's use R to generate 100,000 random numbers, and plot the output in a histogram. Figure 2 can be generated with the commands:

myz = rnorm(100000,0,1) 

Figure 2: Distribution of z-scores simulated with rnorm

This shows that numbers close to zero are most common, and the further we get from zero in either direction, the lower the frequency of the number. The bell-shaped curve is a normal distribution, which we get because we opted to generate random numbers following a normal distribution using rnorm. (Other distributions of random number are also possible; you can see some options here).

So you might be wondering what we do with this list of numbers. Well, we can simulate experimental data based on this population of numbers by specifying two things:
1. The sample size
2. The effect size – i.e., Cohen's d, the mean difference between groups in standard deviation (SD) units.

Suppose we want two groups, A and B, each with a sample size of 10, where group B has scores that are on average 1 SD larger than group A. First we select 20 values at random from myz:

mydata = sample(myz, 20) 

Next we create a variable corresponding to group, which is created by just making a variable, mygroup, that combines ten repeats of 'A' with ten repeats of 'B'.

mygroup = c(rep('A', 10), rep('B', 10))  

Next we add the effect size, 1, to the last 10 numbers, i.e. those for group B

mydata[11:20] = mydata[11:20] + 1 

Now we can plot the individual points clustered by group. First install and activate the beeswarm package to make a nice plot format:


Then you can make the plot with the command:

beeswarm(mydata ~ mygroup) 

The resulting plot will look something like one of the graphs in Figure 3. It won't be exactly the same as any of them because your random sample will be different from the ones we have generated. In fact, this is one point of this exercise: to show you how numbers will vary from one occasion to another when you sample from a population.

If you just repeatedly run these lines, you will see how things vary just by chance:

mydata = sample(myz, 20) 
mydata[11:20] = mydata[11:20] + 1 
beeswarm(mydata ~ mygroup) 

Figure 3: Nine runs of simulated data from 2 groups: A comes from population with mean score of 0 and B from population with mean score of 1
Note how in Figure 3, the difference between groups A and B is far more marked in runs 7 and 9 than in runs 4 and 6, even though each dataset was generated by the same script. This is what is meant by the 'play of chance' affecting experimental data.

Now let's look at Figure 4, which gives output from another nine runs of a simulation. This time, some runs were set so that there was a true effect of intervention (by adding .6 to values for group B) and some were set with no difference between groups. Can you tell which simulations were based on a real effect?

Figure 4: Some of these runs were generated with effect size of .6, others had no difference between A and B
The answer is that runs 1, 2, 4, 8 and 9 came from runs where there was a real effect of .6 (which, by the standard of most intervention studies is a large effect). You may have identified some of these runs correctly, but you may also to have falsely selected run 3 as showing an effect. This would be a false positive, where we wrongly conclude there is an intervention effect when the apparent superiority of the intervention group is just down to chance. This type of error is known as a type I error. Run 2 looks like a false negative – we are likely to conclude there is no effect of intervention, when in fact there was one. This is a type II error. One way to remember this distinction is that a type I error is when you think you've won (1) but you haven't.

The importance of sample size 
Figures 3 and 4 demonstrate that, when inspecting data from intervention trials, you can't just rely on the data in front of your eyes. Sometimes, they will suggest a real effect when the data are really random (type I error) and sometimes they will fail to reveal a difference when the intervention is really effective (type II error). These anomalies arise because data incorporates random noise which can generate spurious effects or mask real effects. This masking is particularly problematic when samples are small.

Figure 5 shows two sets of data: the top panel and the bottom panel were derived by the same simulation, the only difference being the sample size: 10 per group in the top panels, and 80 per group in the bottom panels. In both cases, the simulation specified that group B scores were drawn from a population that had higher scores than group A, with an effect size of 0.6.  The bold line shows the group average. The figure shows that the larger the sample, the closer the results from the sample will agree with the population from which it was drawn.
Figure 5: Five runs of simulation where true effect size = .6

When samples are small, estimates of the means will jump around much more than when samples are large. Note, in particular, that with the small sample size, on the third run, the mean difference between A and B is overestimated by about 50%, whereas in the fourth run, the mean for B is very close to that for A.

In the population from which these samples are taken the mean difference between A and B is 0.6, but if we just take a sample from this population, by chance we may select atypical cases, and these will have a much larger impact on the observed mean when the sample is small.

 In my next post, I will show how we can build on these basic simulations to get an intuitive understanding of p-values.

P.S. Scripts for generating the figures in this post can be found here.

Monday, 27 November 2017

Reproducibility and phonics: necessary but not sufficient

Over a hotel breakfast at an unfeasibly early hour (I'm a clock mutant) I saw two things on Twitter that appeared totally unrelated but which captured my interest for similar reasons.

The two topics were the phonics wars and the reproducibility crisis. For those of you who don't work on children's reading, the idea of phonics wars may seem weid. But sadly, there we have it: those in charge of the education of young minds locked in battle over how to teach children to read. Andrew Old (@oldandrewuk), an exasperated teacher, sounded off this week about 'phonics denialists', who are vehemently opposed to phonics instrution, despite a mountain of evidence indicating this is an important aspect of teaching children to read. He analysed three particular arguments used to defend an anti-phonics stance. I won't summarise the whole piece, as you can read what Andrew says in his blogpost. Rather, I just want to note one of the points that struck a chord with me. It's the argument that 'There's more to phonics than just decoding'. As Andrew points out, those who say this want to imply that those who teach phonics don't want to do anything else.
'In this fantasy, phonics denialists are the only people saving children from 8 hours a day, sat in rows, being drilled in learning letter combinations from a chalkboard while being banned from seeing a book or an illustration.'
This is nonsense: see, for instance, this interview with my colleague Kate Nation, who explains how phonics knowledge is necessary but not sufficient for competent reading.

So what has this got to do with reproducibility in science? Well, another of my favourite colleagues, Dick Passingham, started a little discussion on Twitter - in response to a tweet about a Radiolab piece on replication. Dick is someone I enjoy listening to because he is a fount of intelligence and common sense, but on this occasion, what he said made me a tad irritated:

This has elements of the 'more to phonics than just decoding' style of argument. Of course scientists need to know more than how to make their research reproducible. They need to be able to explore, to develop new theories and to see how to interpret the unexpected. But it really isn't an either/or. Just as phonics is necessary but not sufficient for learning to read, so are reproducible practices necessary but not sufficient for doing good science. Just as phonics denialists depicts phonics advocates as turning children into bored zombies who hate books, those trying to fix reproducibility problems are portrayed as wanting to suppress creative geniuses and turn the process of doing research into a tedious and mechanical exercise. The winds of change that are blowing through psychology won't stop researchers being creative, but they will force them to test their ideas more rigorously before going public.

For those, like Dick, who was trained to do rigorous science from the outset, the focus on reproducibiity may seem like a distraction from the important stuff. But the incentive structure has changed dramatically in recent decades with the rewards favouring the over-hyped sensational result over the careful, thoughful science that he favours. The result is an enormous amount of waste - of resources, of time and careers. So I'm not going to stop 'obsessing about the reproducibility crisis.' As I replied rather sourly to Dick:

Friday, 24 November 2017

ANOVA, t-tests and regression: different ways of showing the same thing

Intuitive explanations of statistical concepts for novices #2

In my last post, I gave a brief explainer of what the term 'Analysis of variance' actually means – essentially you are comparing how much variation in a measure is associated with a group effect and how much with within-group variation.

The use of t-tests and ANOVA by psychologists is something of a historical artefact. These methods have been taught to generations of researchers in their basic statistics training, and they do the business for many basic experimental designs. Many statisticians, however, prefer variants of regression analysis. The point of this post is to explain that, if you are just comparing two groups, all three methods – ANOVA, t-test and linear regression – are equivalent. None of this is new but it is often confusing to beginners.

Anyone learning basic statistics probably started out with the t-test. This is a simple way of comparing the means of two groups, and, just like ANOVA, it looks at how big that mean difference is relative to the variation within the groups. You can't conclude anything by knowing that group A has a mean score of 40 and group B has a mean score of 44. You need to know how much overlap there is in the scores of people in the two groups, and that is related to how variable they are. If scores in group A range from to 38 to 42 and those in group B range from 43 to 45 we have a massive difference with no overlap between groups – and we don't really need to do any statistics! But if group A ranges from 20 to 60 and group B ranges from 25 to 65, then a 2-point difference in means is not going to excite us. The t-test gives a statistic that reflects how big the mean difference is relative to the within-group variation.  What many people don't realise is that the t-test is computationally equivalent to the ANOVA. If you square the value of t from a t-test, you get the F-ratio*.

Figure 1: Simulated data from experiments A, B, and C.  Mean differences for two intervention groups are the same in all three experiments, but within-group variance differs

Now let's look at regression. Consider Figure 1. This is similar to the figure from my last post, showing three experiments with similar mean differences between groups, but very different within-group variance. These could be, for instance, scores out of 80 on a vocabulary test. Regression analysis focuses on the slope of the line between the two means, shown in black, which is referred to as b. If you've learned about regression, you'll probably have been taught about it in the context of two continuous variables, X and Y, where the slope b, tells you how much change there is in Y for every unit change in X. But if we have just two groups, b is equivalent to the difference in means.

So, how can it be that regression is equivalent to ANOVA, if the slopes are the same for A, B and C? The answer is that, just as illustrated above, we can't interpret b unless we know about the variation within each group. Typically, when you run a regression analysis, the output includes a t-value that is derived by dividing b by a measure known as the standard error, which is an index of the variation within groups.

An alternative way to show how it works is to transform data from the three experiments to be on the same scale, in a way that takes into account the within-group variation. We achieve this by transforming the data into z-scores. All three experiments now have the same overall mean (0) and standard deviation (1). Figure 2 shows the transformed data – and you see that after the data have been rescaled in this way, the y-axis now ranges from -3 to +3, and the slope is considerably larger for Experiment C than Experiment A. The slope for z- transformed data is known as beta, or the standardized regression coefficient.

Figure 2: Same data as from Figure 1, converted to z-scores

The goal of this blogpost is to give an intuitive understanding of the relationship between ANOVA, t-tests and regression, so I am avoiding algebra as far as possible. The key point is when you are comparing two groups, t and F are different ways of representing the ratio between variation between groups and variation within groups, and t can be converted into F by simply squaring the value. You can derive t from linear regression by dividing the b or beta by its standard error - and this is automatically done by most stats programmes. If you are nerdy enough to want to use algebra to transform beta into F, or to see how Figures 1 and 2 were created, see the script Rftest_with_t_and_b.r here.

How do you choose which statistics to do? For a simple two-group comparison it really doesn't matter and you may prefer to use the method that is most likely to be familiar to your readers. The t-test has the advantage of being well-known – and most stats packages also allow you to make an adjustment to the t-value which is useful if the variances in your two groups are different. The main advantage of ANOVA is that it works when you have more than two groups. Regression is even more flexible, and can be extended in numerous ways, which is why it is often preferred.

Further explanations can be found here:

*It might not be exactly the same if your software does an adjustment for unequal variances between groups, but it should be close. It is identical if no correction is done.