A couple of weeks ago, I posted a proposal for four academic reforms. Most notably, I suggested that academic journals should pre-accept articles based on their design before results are available to authors, which would reduce the system's bias toward statistically significant findings that fail to replicate in future studies.
Unsurprisingly, other people have had similar ideas. Here are some comments on the related proposals that I've come across:
1. Chris Said, a postdoc at the Center for Neural Science at NYU, emphasizes the importance of funding agencies in promoting replication:
Granting agencies should reward scientists who publish in journals that have acceptance criteria that are aligned with good science. In particular, the agencies should favor journals that devote special sections to replications, including failures to replicate. More directly, the agencies should devote more grant money to submissions that specifically propose replications.
The problem, however, is that most replications will continue to fail given the incentives produced by the current system. That's why I'm most interested in his proposal for funding agencies to give preference to scientists who publish in "outcome-unbiased" journals:
I would like to see some preference given to fully “outcome-unbiased” journals that make decisions based on the quality of the experimental design and the importance of the scientific question, not the outcome of the experiment. This type of policy naturally eliminates the temptation to manipulate data towards desired outcomes.
He makes a convincing argument that NIH, NSF, etc. could play a key role in overcoming the collective action problem inherent to switching to a new system. I still think leading journals could make a contribution on their own, but the funding agencies could play a key role, especially in fields that are heavily grant-driven. In practice, this could mean both rewarding scientists who published in outcome-unbiased journals in the past as well as grant proposals that promise to submit the proposed study to such a journal.
2. George Mason economist Robin Hanson proposes results-blind peer review, a potentially more general approach that could be applied to non-experimental data:
I’d add an extra round of peer review. In the first found, all conclusions about signs, amounts, and significance would be blanked out. After a paper had passed the first round, the reviewers would see the full paper. While reviewers might then allow the conclusions to influence their evaluation, they could not as easily hide such bias. Reviewers who rejected on the second round after accepting on the first round would feel pressure to explain what about the actual results, over and above the method, suggested that the paper was poor.
Glymour and Kawachi offered a similar proposal in BMJ in 2005:
We offer a solution to this problem that lies at the disposal of journal editors. Preliminary editorial decisions could be based solely on the peer review of the introduction and methods sections of submitted papers. These two sections deal with the key issues on which editorial decisions would ideally be based: the importance of the research question and the potential for the study design and proposed analyses to inform that question.
Blinding reviewers to the results and discussion sections may pose some challenges to the reviewing process because elements of these later sections are also relevant for editorial decisions. However, these difficulties would probably be outweighed by the benefits of reducing publication bias. Peer reviewers might be asked to make a preliminary recommendation to the editor (reject or continue further review) on the basis of the merit of the study design and proposed data analyses—not on the findings themselves.
If manuscripts pass this initial stage then reviewers could be unblinded to the results and discussion sections. Our proposal could have the additional benefit of improving the clarity and detail of methods sections.
The problem, as a commenter on Hanson's blog notes, is that many reviewers have already read relevant papers in their field or seen talks about them at conferences, especially in social science (which is characterized by very long publication lags). Even if the reviewer has not read the paper in question before being assigned the review, it's often easy to look them up online and find the results. As a result, this approach could only work in fields where papers are not made public before they are published.
3. Columbia's Macartan Humphreys, Raul Sanchez de la Sierra, and Peter van der Windt have proposed "comprehensive registration" for experiments in political science:
Researchers in political science generally enjoy substantial latitude in selecting measures and models for hypothesis testing. Coupled with publication and related biases, this latitude raises the concern that researchers may intentionally or unintentionally select models that yield positive findings, leading to an unreliable body of published research. To combat this problem of "data fishing" in medical studies, leading journals now require preregistration of designs that emphasize the prior identification of dependent and independent variables. However, we demonstrate here that even with this level of advanced specification, the scope for fishing is considerable when there is latitude over selection of covariates, subgroups, and other elements of an analysis plan. These concerns could be addressed through the use of a form of "comprehensive registration." We experiment with such an approach in the context of an ongoing field experiment for which we drafted a complete "mock report" of findings using fake data on treatment assignment. We describe the advantages and disadvantages of this form of comprehensive registration and propose that a comprehensive but non-binding approach be adopted as a first step in registration in political science. Likely effects of a comprehensive but non-binding registration are discussed, the principal advantage being communication rather than commitment, in particular that it generates a clear distinction between exploratory analyses and genuine tests.
Unfortunately, the incentives to engage in this form of registration are weak. The comprehensive report format limits authors' ability to produce the statistically significant findings that reviewers demand and may lead authors to opt out of registration or to shelve non-significant findings. That's why it's essential that pre-accepted articles be offered as an option to authors by top journals.
4. MIT's David Karger has suggested changing the submission requirements for conference papers in computer science so that evaluations of the proposed system are conducted after acceptance, increasing the incentives for evaluation and reducing incentives to report that the evaluation results were successful.
5. Perhaps most notably, Northwestern's Philip Greenland, the past editor of the Archives of Internal Medicine, conducted a pilot study of "mechanisms that might identify and reduce biases," including a two-stage review process:
First, to understand the tendency of authors to submit mostly positive studies, we assessed the percentage of positive articles that authors submitted to the Archives. Of 100 consecutive submitted manuscripts assessed in June and July of 2008, 77% reported significant primary results, based on editors' assessments of the results. If the articles had been categorized based on the authors' interpretation of their analyses, a higher percentage of manuscripts would have fallen into the positive category. Of the manuscripts sent out for external peer review, over 83% of positive studies were accepted by the Archives. Only 3 negative studies were sent to external review, of which only 1 was ultimately accepted. Overall, only 5.3% of all negative studies that were submitted were accepted.
Recognizing that publication bias can result from reviewers' enthusiasm for positive results, we next evaluated the willingness of our 58 most highly rated and prolific peer reviewers to participate in an alternate peer-review process. The proposed hypothetical alternate process involved 2 steps. First, peer reviewers would have access only to a modified abstract containing no mention of results, the full introduction describing the nature of the research question, and a complete "Methods" section to allow an evaluation of the quality of the research. With this information available, the reviewers would be asked to provide a preliminary assessment of the manuscript in the absence of the "Results" section. Following this preliminary assessment, we proposed that reviewers would gain access to the full article, including the "Results" section, and be asked to make a final evaluation of the manuscript. We hypothesized that this 2-stage procedure would force peer reviewers to make an initial evaluation solely based on the quality of the methods and that the result would be a more equitable consideration of well-performed negative studies. Of the 43 respondents, 37 (>86%) stated that they were willing to complete a full review following an abbreviated one as described herein.
We then turned to an assessment of the role of the editorial board. Prior to peer review, editors may decide to reject articles on their face value. Furthermore, editors assign reviewers and render final decisions after receiving reviewer comments. At the Archives, an editorial estimate of study rejection without any external peer review was roughly 70% of all submissions, whereas a JAMA study reported a 50% editorial rejection rate at that journal. These substantial figures suggest that any investigation of publication bias at the journal level ought to begin with, or at least include, the editors. Consequently, the aforementioned alternate review process was applied to the editorial review that occurred prior to outside peer review. In a pilot study, among a selection of submitted articles, a study was characterized as positive if an author's conclusion about his or her primary outcome was portrayed as such. Of the 46 articles examined, 28 were positive, and 18 were negative (with an explicit attempt to oversample negative studies in this pilot research). Ultimately, 36 of the 46 articles (>77%) were rejected, consistent with prior publication decisions at this journal. Of note, editors were consistent in their assessment of a manuscript in both steps of the review process in over 77% of cases. This suggests that most of the time the editors' decision after reviewing the "Methods" section alone does not change after reviewing the full results.
Although this provides some comfort, it is important to look at not only the majority of manuscripts but also the tail ends of the curve, because this is most likely where any bias would lie. In doing so, we found that over 7% of positive articles benefited from editors changing their minds between steps 1 and 2 of the alternate review process, deciding to push forward with peer review after reading the results. By contrast, in this small study, we found that this never occurred with the negative studies. Indeed, 1 negative study, which was originally queued for peer review after an editor's examination of the introduction and "Methods" section, was removed from such consideration after the results were made available.
We admit that these findings are neither conclusive nor definitive but rather a descriptive analysis from a pilot study. Certainly, it is reassuring that the editors were mostly consistent in their opinions regardless of the results. However, in the minority of cases in which bias matters, the influence of the results on the editor's decision to move to peer review and ultimately to publication is still uncertain. There is a dearth of rigorous research on editorial bias and the possible interventions that may attenuate it. The alternate review process piloted at the Archives has never been performed before, to the best of our knowledge, although it has been suggested. Importantly, such a mechanism can be implemented both with editors and peer reviewers, addressing 2 sources of potential bias over which a medical journal can have the most direct impact. The negative trial by Etter et al published in this issue of the Archives was a part of our pilot study. Obviously, the editors supported peer review and publication of this study based on the rigor and quality of its methods alone, and that decision was sustained even when the negative results were revealed to them.
Greenland is to be commended for his willingness to innovate, but the results reported above suggest some of the challenges that a two-stage review system will face and the need for further experimentation by journals. Most disappointingly, the current instructions to reviewers provided by the journal make no mention of the two-stage process, suggesting that the approach has been abandoned by his successors. Let's hope some other journal editor out there is willing to experiment further.
Update 4/30 10:06 AM: One challenge raised by Chris Said via email is how these approaches could be adapted to fields like neuroscience in which articles typically include several studies that build on each other. Here are two possible approaches I've contemplated:
1. The journal offers rounds of results-blind reviewing in which authors propose Study 1, get results, and then come back for a second round of results-blind reviewing. This approach would ensure that each round was fully outcome-unbiased, but would increase the burden on reviewers and editors.
2. An alternate option would be for authors to conduct a set of exploratory studies 1...x and then submit the design and analysis plan for study x+1 on a pre-accepted basis. Readers would then be told that the results of studies 1....x were not pre-specified but that study x+1 was pre-specified.
Also, I've updated the Humphreys item above to include his co-authors on the paper in question (which is not yet available publicly). Finally, see Hanson's followup item here.
A few weeks ago, I noted the arrival of the GSA scandal as the first under President Obama to meet the standard used in my research: a front-page Washington Post story that focuses on the controversy and describes it as a "scandal" in the reporter's own voice. I then taped an interview with NPR's On the Media about my research in which I noted the role that slow news periods play in fomenting scandal and suggested that Obama was vulnerable to executive branch scandal in the period before the fall campaign.
Just two days after the interview was taped, news broke of Secret Service agents hiring prostitutes in Columbia. This controversy quickly became Obama's second scandal according to the standard I've proposed. Indeed, it has racked up six front-page Post stories since April 17 (by comparison, the GSA scandal has had only two).* After years of avoiding scandal, President Obama is learning how easily it can engulf an administration - it's quite a reversal.
* The Post appears to produce different versions of its front page for different editions. To maintain consistency with my research, my posts on Obama scandal coverage in the Post use the articles and page numbers archived in the Nexis news database.
Academia tends to be slow to embrace change, but here are a few ideas that I think are worth considering for improving how we evaluate students, conduct research, and run our journals.
1. The pass/fail first semester
Two of the most significant problems we face in higher education are grade inflation and underprepared students. There are no easy answers to either problem, but one of the best approaches I've seen is the pass/fail first semester used at Swarthmore College (my alma mater). Let me quote from a blog post written by a first-year student there last fall, which I just came across on Google -- it's completely consistent with my experience:
The first semester for every first-year at Swat is pass/fail. I love this system, and it’s one of so many reasons why the approach to academics at Swarthmore is fantastic.
Taking classes pass/fail deemphasizes the importance of grades. That seems obvious, and we heard that over and over again from the administration, our advisors, and upper class students. I didn’t really internalize the significance of that, however, until just recently…
The pass/fail semester helps first-years adjust to college. With some stress removed from academics, there’s more time to focus on other aspects of college: meeting new friends, joining interesting clubs, and trying not to get lost on the way to the fitness center (I had particular trouble with that last one). I’m not saying that this first semester is a breeze, or that it should be. It’s important to learn study habits that work for college, and figuring out how to manage your time is obviously essential (for example, spending one hour online-shopping for every half hour spent reading did not end up working for me). What’s great is being about to adjust without having to simultaneously stress out about grades.
Grades will come next semester, but the class of 2015 will tackle our workload with a greater appreciation for the material learned, and an understanding of the importance of the learning process, not just the grade received at the end of the year. I’m so glad Swarthmore gave us this adjustment period.
The pass/fail semester helps students get excited about learning for learning's sake before worrying about grades, and it provides underprepared students with a chance to catch up before their performance is recorded on their permanent transcript. It's worth considering whether the practice should be adopted both here at Dartmouth and elsewhere in higher education.
2. The pre-accepted article
Academics face intense pressure to publish new findings in top journals. In practice, those incentives create massive publication bias. Social scientists tend to think of medical and scientific journals as being more rigorous, but even most of the results published in those journals tend to fail to replicate. While some fraud may occur, the problem is more likely to be one of self-deception -- as human beings, we're simply too good at rationalizing choices that produce the results we want.
One response to this concern is preregistration of experimental trials -- a practice that is mandated in some areas of medicine and is beginning to be done voluntarily by some social science researchers conducting field experiments (particularly in development economics). The idea is that the author has publicly stated his or her hypotheses before the data have been collected and that the results are therefore less likely to be spurious. The best example of this that I know of is the Oregon Health Insurance Experiment, which publicly archived its analysis plan before any data were available and explicitly labeled all unplanned analyses in their manuscript (PDF).
Unfortunately, preregistration alone will not solve the problem of publication bias. First, authors have little incentive to engage in the practice unless it is mandated by regulators or the journal to which they are submitting. In addition, authors may still make arbitrary choices in how they code, analyze, and present the results of preregistered trials. But most fundamentally, if trial results are more likely to be published when they deliver statistically significant results, then publication bias is still likely to ensue.
In the case of experimental data, a better practice would be for journals to accept articles before the study was conducted. The article should be written up to the point of the results section, which would then be populated using a pre-specified analysis plan submitted by the author. The journal would then allow for post-hoc analysis and interpretation by the author that would be labeled as such and distinguished from the previously submitted material. By offering such an option, journals would create a positive incentive for preregistration that would avoid file drawer bias. More published articles would have null findings, but that's how science is supposed to work. A shift to a preregistered article system would also create healthy pressure on authors, editors, and reviewers to (a) focus on topics where we care about the null hypothesis; (b) keep articles short; and (c) make sure studies have enough statistical power to have a high likelihood of capturing the effect of interest (if real).
3. The replication audit
Ideally, every journal should follow the practice of the American Economic Review and require authors to submit a full replication archive before publication. However, my colleague Brian Greenhill has suggested a way that journals or professional associations could go even further to encourage careful research practice: conduct replication audits of a random subset of published articles. At a minimum, these audits would verify that all the results in an article could be replicated. They could conceivably go further in some cases and try to recreate the author's data and results from publicly available sources, re-run lab experiments, etc. when possible. An audit system would of course work best for journals that require replication archives to be made available -- otherwise, it could discourage authors from sharing replication data.
4. A frequent flier system for journals
Journals depend on the free labor provided by academics in the peer review process. Reviewing is a largely thankless task whose burden falls disproportionately on prominent and public-minded scholars, who receive little credit for the work that they do. As a result, manuscripts are often stuck in review limbo for months, slowing the publication process and stalling both the production of knowledge and the careers of the authors in question. How can we do better?
One idea is to develop a points system for each journal analogous to frequent flier miles. Each review would earn a scholar a certain number of points with bonuses awarded by editors for especially timely or high quality reviews. The author could then cash in those points when they submit to that journal in order to request a rapid review of their own manuscript. The journal would in turn offer those points to reviewers who review the manuscript quickly, helping to speed it through the process. It would not be useful for reviewers who don't submit to the journal in question, but for reviewers and authors who interact with a journal over a period of decades, it could help provide greater incentives for rapid and thoughtful reviewing.
Update 4/27 10:16 AM: Please see my followup post for more on pre-accepted articles.
Also, it turns out that a large group of psychologists are engaging in a collaborative replication audit of psychology articles published in top journals in 2008 called The Reproducibility Project - see this article in the Chronicle of Higher Education for more about the project.
Finally, I recently discovered that the American Medical Association offers continuing medical education credits to reviewers for Archives of Internal Medicine who "have completed their review in 21 days or less with a rating of good or better." CME credits are presumably not as strong an incentive as faster review of one's own articles, but I assume they're better than nothing.
I've set up a Tumblr version of this blog so that those who use the service can follow me there, reblog posts, etc. Please check it out if you're a Tumblr user!
I have a new column at Columbia Journalism Review on how bored reporters and social media can hype fake controversies and spread misinformation. Here's how it begins:
When Rick Santorum suspended his candidacy for the GOP presidential nomination on Tuesday, he removed any remaining doubt that Mitt Romney would be the Republican presidential nominee. The result is a news vacuum that can easily be filled by spin and misinformation.
Consider the ridiculous debate over comments made on CNN about stay-at-home moms by Hilary Rosen, which dominated the news cycle and the political Twittersphere yesterday. As NBC’s First Read points out, while “manufactured controversies are nothing new in American politics,” what is new “is how much faster and professionalized—due to Twitter and the drive to make something go viral—these manufactured controversies have become.” Such controversies can be especially potent as we enter what First Read calls “silly season.” When few competing stories exist and political reporters are starved for material, any whiff of scandal or controversy can create a feeding frenzy. A bored media is dangerous for politicians.
Read the whole thing for more.
I have a new post up at Columbia Journalism Review on how the media's focus on swing states and voter demographics misses the big picture. Here's how it begins:
With Mitt Romney’s hold on the GOP nomination becoming too obvious to deny, horse race enthusiasts in the political media have quickly shifted to handicapping the general election. Unfortunately, their recent focus on key states and demographics—in particular, the effects of the contraception controversy on women voters in battleground states—threatens to obscure more fundamental factors that are most likely to shape the outcome of the campaign.
Read the whole thing for more.
The first Obama scandal has arrived.
Last May, I wrote a column on how the Obama administration had managed to avoid scandal* for longer than we might otherwise expect:
My research (PDF) on presidential scandals shows that few presidents avoid scandal for as long as he has. In the 1977-2008 period, the longest that a president has gone without having a scandal featured in a front-page Washington Post article is 34 months – the period between when President Bush took office in January 2001 and the Valerie Plame scandal in October 2003. Obama has already made it almost as long despite the lack of a comparable event to the September 11 terrorist attacks. Why?
I attributed Obama's resilience in part to "the number and magnitude of competing news stories" during his tenure, which I show play a key role in the likelihood and severity of presidential scandal (PDF). (See Jonathan Alter's Washington Monthly piece for a discussion of other possible explanations.) However, I predicted that the "the likelihood of a presidential or executive branch scandal before the 2012 election are quite high" and that, "[g]iven Obama’s reputation for personal integrity, the controversy will likely concern actions taken within the executive branch."
Obama survived for longer than I expected since that column was published. Despite close calls with Solyndra and Operation Fast and Furious, Obama broke George W. Bush's record in October for the longest scandal-free period among presidents in the contemporary era using the measure described above from my research (a front-page Washington Post story focused on a scandal that describes it as such in the reporter's own voice) -- see Elspeth Reeve's coverage at The Atlantic Wire here and here.
Today, however, my predictions were validated when the Washington Post published a front-page story that twice describes a controversy over alleged excessive spending at a General Services Administration conference in Las Vegas as a "scandal." In print, the story ran under the headline "GSA rocked by spending scandal" (PDF). While this controversy seems unlikely to have much staying power or to damage Obama politically, its emergence is consistent with the news cycle theory I advance -- the improving economy and Mitt Romney's impending triumph in the Republican presidential nomination race have reduced the newsworthiness of two stories that have dominated the news in recent months, which in turn increases the likelihood that allegations of unethical or improper behavior will receive prominent coverage. The question now is whether the GSA controversy signals the resumption of scandal politics as usual in Washington.
* I define scandal as a widespread elite perception of wrongdoing. My research analyzes the effects of political and media context on when scandals are thought to have occurred, not whether Obama or other presidents actually engaged in misbehavior (a question that cannot easily be measured or quantified objectively).
I am an assistant professor in the Department of Government at Dartmouth College. I received my Ph.D. from the Department of Political Science at Duke University in 2009 and served as a RWJ Scholar in Health Policy Research at the University of Michigan from 2009-2011. I also tweet at @BrendanNyhan and serve as a contributor to The Upshot, a New York Times politics/policy website that is currently in development. Previously, I served as a media critic for Columbia Journalism Review, co-edited Spinsanity, a non-partisan watchdog of political spin, and co-authored All the President's Spin. For more, see my Dartmouth website.