Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
You can also search for this author in PubMed Google Scholar
You have full access to this article via your institution.
Anna Severin and her colleagues used artificial intelligence to analyse peer-review reports.
Do more-highly cited journals have higher-quality peer review? Reviews are generally confidential and the definition of ‘quality’ is elusive, so this is a difficult question to answer. But researchers who used machine learning to study 10,000 peer-review reports in biomedical journals have tried. They invented proxy measures for quality, which they term thoroughness and helpfulness.
Their work, reported in a preprint article1 in July, found that reviews at journals with higher impact factors seem to spend more time discussing a paper’s methods but less time on suggesting improvements than do reviews at lower-impact journals. However, the differences between high- and low-impact journals were modest and variability was high. The authors say this suggests that a journal’s impact factor is “a bad predictor for the quality of review of an individual manuscript”.
How to write a thorough peer review
Anna Severin, who led the study as part of her PhD in science policy and scholarly publishing at the University of Bern and the Swiss National Science Foundation (SNSF), spoke to Nature about this work and other efforts to study peer review on a large scale. Severin is now a health consultant at management consultancy Capgemini Invent in Germany.
The website Publons (owned by analytics firm Clarivate) has a database of millions of reviews, submitted by journals or by academics themselves. They gave us access because they’re interested in a better understanding of peer-review quality.
There is no definition. My focus groups with scientists, universities, funders and publishers showed me that ‘quality’ peer review means something different to everyone. Authors often want timely suggestions for improving their paper, for instance, whereas editors often want recommendations (with reasons) about whether to publish.
One approach is to use a checklist to systematically score one’s subjective opinion of a review, such as to what extent it comments on a study’s methods, interpretation or other aspects. Researchers have developed the Review Quality Instrument2 and the ARCADIA checklist3. But we couldn’t manually run these checklists at scale on thousands of reviews.
We at the SNSF teamed up with political scientist Stefan Müller at University College Dublin, a specialist in using software to analyse texts, to evaluate the content of reviews using machine learning. We focused on thoroughness (whether sentences could be categorized as commenting on materials and methods, presentation, results and discussion, or the paper’s importance), and helpfulness (if a sentence related to praise or criticism, provided examples or made improvement suggestions).
We randomly picked 10,000 reviews from medical and life-sciences journals, and manually assigned the content of 2,000 sentences from them to none, one or more of these categories. Then we trained a machine-learning model to predict the categories of a further 187,000 sentences.
Journal impact factor does seem to be associated with peer-review content, and with the characteristics of reviewers. We found that reports provided for higher-impact journals tend to be longer, and the reviewers are more likely to be from Europe and North America. A greater proportion of the sentences in higher-impact journal reports tend to be about materials and methods; a lesser proportion are on the paper’s presentation, or make suggestions to improve the paper, compared with the reviews at lower-impact journals.
But these proportions varied widely even among journals with similar impact factors. So I would say this suggests that impact factor is a bad predictor for ‘thoroughness’ and ‘helpfulness’ of reviews. We interpret this as a proxy for aspects of ‘quality’.
Of course, this technique has limitations: machine learning always labels some sentences incorrectly, although our check suggests that these errors don’t systematically bias results. Also, we couldn’t examine whether the claims made in reviews we coded are actually correct.
One computer-assisted study4 looked at aspects of the tone and sentiment of nearly half a million review texts — finding no link to area of research, type of reviewer or reviewer gender. This was done by members of the European Union-funded ‘PEERE’ research consortium, which has called for more sharing of data on peer review. In a separate study5 of gender bias in some 350,000 reviews, members of the PEERE team found that peer review doesn’t penalize manuscripts from female authors (although this doesn’t mean there’s not discrimination against women in academia, the authors add).
Another team worked with the publisher PLOS ONE and examined more than 2,000 reports from its database, looking at aspects including sentiment and tone6.
We think our research is a first step showing that it is possible to assess the thoroughness and helpfulness of a review in a systematic, scalable way.
To improve peer review, training reviewers and giving clear instructions and guidelines on what journals want from a review will be helpful. To study it, a really important step would be to come up with measures of quality peer review that different stakeholders agree on — because different groups think it serves different functions. And making peer-review texts open instead of confidential, as some journals are starting to do, would help with all of this.
doi: https://doi.org/10.1038/d41586-022-02787-5
This interview has been edited for length and clarity.
Severin, A. et al. Preprint at https://arxiv.org/abs/2207.09821 (2022).
van Rooyen, S., Black, N. & Godlee, F. J. Clin. Epidemiol. 52, 625–629 (1999).
PubMed Article Google Scholar
Superchi, C. et al. BMJ Open 10, e035604 (2020).
PubMed Article Google Scholar
Buljan, I., Garcia-Costa, D., Grimaldo, F., Squazzoni, F. & Marušić, A. eLife 9, e53249 (2020).
PubMed Article Google Scholar
Squazzoni, F. et al. Sci. Adv. 7, eabd0299 (2021).
PubMed Article Google Scholar
Eve, M. P. et al. Reading Peer Review (Cambridge Univ. Press, 2021).
Google Scholar
Download references
Editorial 07 SEP 22
News 02 SEP 22
Career Feature 30 AUG 22
Correspondence 26 JUL 22
World View 26 JUL 22
Correspondence 07 JUN 22
Francis Crick Institute
London, United Kingdom
Luxembourg Institute of Health (LIH)
Luxembourg, Luxembourg
The University of British Columbia (UBC)
Kelowna, Canada
Helmholtz-Zentrum Dresden-Rossendorf (HZDR)
Dresden Rossendorf, Germany
You have full access to this article via your institution.
An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday.
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
Advanced search
© 2022 Springer Nature Limited