The Back Story on “Can cancer researchers accurately judge whether preclinical reports will reproduce?

How well can researchers accurately predict whether high profile preclinical findings will reproduce? This week in PLoS Biology, STREAM reports the result of a study suggesting the answer is “not very well.” You can read about our methods, assumptions, results, claims, etc. in the original report (here) or in various press coverage (here and here). Instead I will use this blog entry to reflect on how we pulled this paper off.

This was a bear of a study to complete. For many reasons. Studying experts is difficult- partly because, by definition, experts are scarce. They also have limited time. Defining who is and who is not an expert is also difficult. Another challenge is studying basic and preclinical research. Basic and preclinical researchers do not generally follow pre-specified protocols, and they certainly do not register their protocols publicly. This makes it almost impossible to conduct forecasting studies in this realm. We actually tried a forecast study asking PI’s to forecast the results of experiments in their lab (we hope to write up results at a later date); to our surprise, a good many planned experiments were never done, or when they were done, they were done differently than originally intended, rendering forecasts irrelevant. So when it became clear the Reproducibility Project: Cancer Biology project was a go and that they were working with pre-specified and publicly registered protocols, we leapt at the opportunity.

For our particular study of preclinical research forecast, there was another challenge. Early on, we were told that the Reproducibility Project: Cancer Biology was controversial. I got a taste of that controversy in many conversations with cancer biologists, including one who described the initiative as “radioactive- people don’t even want to acknowledge its there.”

This probably accounts for some of the challenges we faced in recruiting a meaningful sample, and to some extent in peer review. Regarding the former, my sedulous and perseverant postdoc, Danny Benjamin- working together with some great undergraduate research assistants- devised and implemented all sorts of methods to boost recruitment. In the end, we were able to get a good size (and representative, it turns out) sample. But this is a tribute to Danny’s determination.

Our article came in for some pretty harsh comments on initial peer review. In particular, one referee seemed fiendishly hostile to the RP:CB. The reviewer was critical of our focusing on xenograft experiments, which “we now know are impossible to evaluate due to technical reasons.” Yes- that’s right, we NOW know this. What we were trying to determine was if people could predict this!

The reviewer also seemed to pre-judge the replication studies (as well as the very definition of reproducibility, which is very slippery): “we already know that the fundamental biological discovery reported in several of these has been confirmed by other published papers and by drug development efforts in biopharma.” But our survey was not asking people to predict whether fundamental biological discoveries were true. We were asking whether particular experiments- when replicated based on publicly available protocols- could produce the same relationships.

The referee was troubled by our reducing reproducibility to a binary (yes/no). That was something we struggled with in design. But forecasting exercises are only useful insofar as events are verifiable and objective (no point in asking for foreacasts if we can’t define the goalposts, or if the goalposts move once we see the results). We toyed with creating a jury to referee reproducibility- and using jury judgments to verify forecasts. But in addition to being almost completely impractical, it would be methodologically dubious: forecasts would- in the end- be forecasts of jury judgments, not of an objectively verifiable data. To be a good forecaster, you’d need to peer into the souls of the jurors, as well as the machinery of the experiments themselves. But we were trying to study scientific judgment, not social judgment.

Our paper- in the end- potentially pours gasoline/petrol/das Benzin on a fiery debate about reproducibility (i.e. not only do many studies not reproduce- but also, scientists have limited awareness of which studies will reproduce). Yet we caution against facile conclusions. For one, there were some good forecasters in our sample. But perhaps more importantly, ours is one study-one ‘sampling’ of reality subject to all the limitations that come with methodology, chance, and our own very human struggles with bias. In the end- I think the findings are hopeful insofar as they suggest that part of what we need to work on in science is not merely designing and reporting experiments, but learning to make proper inferences (and communicating effectively) about the generalizability of experimental results. Those inferential skills seem on display with one of our star forecasters- Yale grad student Taylor Sells (named on our leaderboard)- “We often joke about the situations under which things do work, like it has to be raining and it’s a Tuesday for it to work properly…as a scientist, we’re taught to be very skeptical of even published results… I approached [the question of whether studies would reproduce] from a very skeptical point of view.”

The Back Story on “Can cancer researchers accurately judge whether preclinical reports will reproduce?

Leave a Reply Cancel reply