=Paper=
{{Paper
|id=Vol-3230/paper-03
|storemode=property
|title=Can AI-estimated article quality be used to rank scholarly documents?
|pdfUrl=https://ceur-ws.org/Vol-3230/paper-03.pdf
|volume=Vol-3230
|authors=Mike Thelwall
|dblpUrl=https://dblp.org/rec/conf/birws/Thelwall22
}}
==Can AI-estimated article quality be used to rank scholarly documents?==
Can AI-estimated article quality be used to rank
scholarly documents?
Mike Thelwall
University of Wolverhampton, Wolverhampton WV1 1LY, UK
Abstract
This paper discusses the potential for machine learning predict the quality of scholarly documents to help
rank them in information retrieval systems. Quality-based rankings may help users without the time or
expertise to assess the value of the publications suggested by a system. It is argued that systems to learn
the quality of documents with a degree of accuracy may be possible from the increasing availability of
reviews and scores online.
A key feature of scholarly information retrieval systems is their ranking algorithms. Users
may focus on the first documents that they see, unless they are attempting a comprehensive
review. The use of citation information to rank scholarly search results is arguably appropriate
for academics because citations are an obvious, but partial, indicator of scholarly uptake or
utility. A document that has been cited a lot is very likely to have been read by many publishing
researchers and found useful enough to cite. In contrast, end users may be more interested in
applied research. For this goal, citations may be less helpful, especially if they tend to point
to basic or methodological papers rather than practical applications. Their value may also be
undermined by attempts to manipulate them (e.g., [1]). For end users, it may therefore be better
to rank papers by quality rather than by citation impact. Ranking-by-quality may also help in
the era of predatory publishing, by pointing end users and junior academics to high quality work
that is relevant to their needs. Both user groups may lack the time or experience to perform
effective quality control on search results. Whilst this issue may be resolved by collaborative
filtering approaches (e.g., [2]) it would be useful to rank documents before they have been seen.
The main reason why academic articles are not ranked by quality in any mainstream scholarly
database may be that such quality scores are not available for most articles. Both journals and
conferences usually make binary publishing decisions (accept/reject) after reviewing and do
not publish a quality assessment or reviewers’ quality scores. Since there are increasingly many
exceptions (e.g., some open peer review conferences, F1000 post-publication ratings [3]) and
there may be a future increase in post-publication peer review scores for articles, there soon
may be enough public peer review scoring data for systems to harness, when available. The
score data may be supplemented by algorithms to classify reviews or post-publication comments
(e.g., [4]) for sentiment, or to detect problematic content in articles (e.g., [5, 6]). An alternative
BIR 2022: 12th International Workshop on Bibliometric-enhanced Information Retrieval at ECIR 2022, April 10, 2022,
hybrid.
$ m.thelwall@wlv.ac.uk (M. Thelwall)
https://researchers.wlv.ac.uk/M.Thelwall (M. Thelwall)
0000-0001-6065-205X (M. Thelwall)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
10
method of generating article quality scores would be to apply machine learning on a sample of
articles with peer reviews, perhaps from the aforementioned sources, and then use the trained
algorithms to estimate the quality scores for the remainder.
Following on from the above, is it possible and desirable to use machine learning to estimate
the quality of an academic article to support ranking in academic information retrieval systems?
One study has predicted proxy-quality scores for articles with machine learning, using journal
impact (split into thirds) as a proxy for article quality. Using this heuristic, it is possible to
generate proxy quality predictions that are substantially above the baseline in some fields, but
not others [7]. This suggests that automatically detecting quality will be much more difficult in
some fields than in others. Intuitively, more hierarchical fields with standardised methods would
be more easily to check quality for, given that deviations from best practice could theoretically
be detected. In contrast, in a humanities field, it might take substantial or wide field knowledge
to judge the quality of outputs.
Preliminary unpublished experiments to predict article quality with machine learning applied
to tens of thousands of human quality scores (high, medium, low) for articles in 27 Scopus broad
fields suggest that the highest accuracy is possible for the following biomedical and physical
science Scopus broad fields: Multidisciplinary; Biochemistry, Genetics and Molecular Biology;
Physics and Astronomy; Chemistry. In contrast, this task is most difficult or impossible in the
following Scopus broad fields: Engineering; Agricultural and Biological Sciences; Psychology;
Social Sciences; Environmental Science; Energy; Arts and Humanities; Dentistry; Nursing; and
Pharmacology & Toxicology. Thus, the potential for harnessing machine learning for article
quality prediction may be restricted to the biomedical and physical sciences.
Based on previous studies on predicting citation counts [8, 9, 10, 11], the following rec-
ommendations are made for a ranking system to reflect article quality in fields where it is
possible.
• Journal impact thirds, quartiles or other groupings can be used as the target of a machine
learning system in fields in which journal impact is a reasonable indicator of article quality
(medicine, health, physical sciences, economics, psychology) but not in areas where
citations have little value (engineering, other social sciences, arts and humanities). This
could be replaced by post publication or peer review scores when they become available
in sufficient numbers. If this replacement is made, then a journal impact indicator could
become an input.
• Machine learning should be applied to data segmented into narrow coherent fields to
give the algorithms the chance to learn field-specific quality patterns.
• Inputs should be field and year normalised (e.g., not citation counts but normalised
variants such as the Mean Normalised Citation Score (MNCS) or the Mean Normalised
Log-transformed Citation Score (MNLCS)) so that related fields and years can be combined
to gain sufficient training data.
• Valuable types of inputs include all those shown to associate with citation rates, including:
(normalised) article citations, number of authors, number of institutional affiliations,
article length, number of country affiliations, career publishing statistics of the authors,
and abstract readability.
11
• Text inputs, such as words and phrases used in the article title and abstract may reveal
important topics, which are more relevant to citations than quality. They may also point to
high quality methods (e.g., randomised control trials) and identify more subtle indicators
of high-quality work, such as appropriate hedging or shared data/code. If full text can be
analysed, then factors like the number of figures and tables in a paper may be useful in
judging the amount of evidence supporting the article in some fields.
The above approach is clearly quite citation-dependant but at least moves one step away
from a pure reliance on citations.
References
[1] Oriensubulitermes inanis [pseudonym], PubPeer comment https://pubpeer.com/
publications/940C291607CF03969C6A936F8BA5B9#2, 2022.
[2] D. Kershaw, B. Pettit, M. Hristakeva, K. Jack, Learning to rank research articles: A case
study of collaborative filtering and learning to rank in ScienceDirect, in: Proceedings BIR
2020, 2020, pp. 75–88. URL: http://ceur-ws.org/Vol-2591/paper-08.pdf.
[3] M. Thelwall, E. Papas, Z. Nyakoojo, L. Allen, V. Weigert, Identification of highly-cited
papers using topic-model-based and bibliometric features, Online Information Review 44
(2020). doi:https://doi.org/10.1108/OIR-11-2019-0347.
[4] J. L. Ortega, Classification and analysis of PubPeer comments: How a web journal club is
used, Journal of the Association for Information Science and Technology (2021). doi:https:
//doi.org/10.1002/asi.24568.
[5] G. Cabanac, C. Labbé, Prevalence of nonsensical algorithmically generated papers in the
scientific literature, Journal of the Association for Information Science and Technology 72
(2021) 1461–1476.
[6] G. Cabanac, C. Labbé, A. Magazinov, Tortured phrases: A dubious writing style emerging
in science. Evidence of critical issues affecting established journals, 2021. arXiv preprint
arXiv:2107.06751.
[7] M. Thelwall, Can the quality of published academic journal articles be assessed with
machine learning?, Quantitative Science Studies (2022). doi:https://doi.org/10.1162/
qss_a_00185refs.
[8] A. Abrishami, S. Aliakbary, Predicting citation counts based on deep neural network
learning techniques, Journal of Informetrics 13 (2019) 485–499.
[9] Y. H. Hu, C. T. Tai, K. E. Liu, C. F. Cai, Identification of highly-cited papers using topic-
model-based and bibliometric features, Journal of Informetrics 14 (2020).
[10] J. Xu, M. Li, J. Jiang, M. Cai, Early prediction of scientific impact based on multi-
bibliographic features and convolutional neural network, IEEE Access (2019) 92248–92258.
[11] T. van Dongen, G. Wenniger, L. Schomaker, SChuBERT: Scholarly document chunks with
bert-encoding boost citation count prediction, 2020. arXiv preprint arXiv:2012.11740.
12