=Paper= {{Paper |id=Vol-3745/paper12 |storemode=property |title=May Generative AI Be a Reviewer on an Academic Paper? |pdfUrl=https://ceur-ws.org/Vol-3745/paper12.pdf |volume=Vol-3745 |authors=Haichen Zhou,Xiaorong Huang,Hongjun Pu,Qi Zhang |dblpUrl=https://dblp.org/rec/conf/eeke/ZhouHPZ24 }} ==May Generative AI Be a Reviewer on an Academic Paper?== https://ceur-ws.org/Vol-3745/paper12.pdf
                                May Generative AI Be a Reviewer on an Academic
                                Paper?⋆
                                Haichen Zhou1, ∗, Xiaorong Huang1, Hongjun Pu1,and Qi Zhang2
                                1
                                    National Science Library (Chengdu), Chinese Academy of Sciences, Chengdu, 610041, China
                                2
                                    Nanjing University, Nanjing, 210023, China


                                                      Abstract
                                                      The application of artificial intelligence (AI) to academic evaluation is one of the important topics
                                                      within the academic community. The widespread adoption of technologies such as Generative AI
                                                      (GenAI) and Large Language Models appears to have introduced new opportunities for academic
                                                      evaluation. The question of whether GenAI has the capability to perform academic evaluations, and
                                                      what differences exist between its abilities and those of human experts, becomes the primary issue
                                                      that needs to be addressed first. In this study, we have developed a set of evaluation criteria and
                                                      processes to investigate on 853 post peer-reviewed papers in the field of cell biology, aiming to
                                                      observe the differences in scoring and comment styles between GenAI and human experts. We found
                                                      that the scores given by GenAI tend to be higher than those given by experts, and the evaluation
                                                      texts lack substantive content. The results indicate that GenAI is currently unable to provide the
                                                      depth of understanding and subtle analysis provided by human experts.

                                                      Keywords
                                                      academic evaluation, Generative AI, large language models, Copilot, ChatGPT 1



                                1. Introduction                                                                            excels in, and where it may need further improvement.
                                                                                                                           Hence, we focus on analyzing the difference between
                                How to use AI for more objective, accurate, and efficient                                  human expert evaluations and GenAI evaluations. We
                                academic evaluation has become an important research                                       aim to answer the following research questions:
                                topic[1][2]. Generative AI (GenAI) is a novel technology                                   RQ1: Can GenAI conduct academic evaluations?
                                that uses artificial intelligence to generate content in                                   RQ2: What differences exist between the scoring
                                various forms[3][4]. In the context of academic evaluation,                                results of GenAI and human experts?
                                GenAI provides a new possibility for automating                                            RQ3: What differences exist between the
                                academic evaluations by generating academic                                                evaluation text features of GenAI and human
                                evaluation content[ 5 ]. Comparing the evaluations of                                      experts?
                                human experts and those generated by GenAI is a very
                                intuitive way to better understand the effectiveness and                                   2. Method
                                reliability of GenAI. However, there is still a lack of
                                research on the quality of the content generated by                                        Our research methodology includes the following steps:
                                GenAI and whether there are differences between it and
                                the content generated by human experts. Figuring out                                          1.   Select papers from H1 Connect (connect.h1.co)
                                these issues provides a basis for us to answer whether                                             as research cases and establish selection
                                GenAI can match the depth of understanding and subtle                                              criteria.
                                analysis provided by human experts, what areas GenAI                                          2.   Generate a list of papers to be collected.


                                Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities from
                                Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024), April 23~24,
                                2024, Changchun, China and Online
                                ∗
                                  Corresponding author.
                                    zhouhc@clas.ac.cn (H. Zhou); huangxiaorong@clas.ac.cn
                                (X. Huang); puhj@clas.ac.cn (H. Pu) zhang@smail.nju.edu.cn (Q. Zhang)
                                    0000-0002-3366-1951 (H. Zhou); 0000-0002-9164-0585 (X. Huang);
                                0000-0003-4787-519X (H. Pu) 0000-0001-5401-2275 (Q. Zhang)
                                                 © 2024 for this paper by its authors. Use permitted under Creative
                                                 Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                      99
    3.   Obtain data such as Paper Title, DOI, Expert                   •Conceptual advance: The paper must provide a
         Score, Review Text, etc., to form the original             deep understanding and mechanistic explanation of an
         dataset.                                                   important problem or area, not just superficial or
    4.   Design evaluation dimensions and scoring                   incremental improvements.
         system for the research field of our dataset.                  •Timeliness: The paper must reflect the current hot
    5.   Generate the Copilot question template                     topics in the scientific community.
         (Prompt).                                                      •Significance: The paper must have immediate or
    6.   Use the template to ask questions and collected            long-term impact and implications.
         Copilot scores and evaluation text data.                       I also have a recommended scoring system: 1 star
    7.   Compare the differences in scores and texts                (Good), 2 stars (Very Good) , 3 stars (Exceptional) .You're
         between Copilot and experts.                               acting as a scientist. I'll give you a PubMed ID for the paper.
                                                                    First, please display the title of the paper and search the
2.1. Data Preparation                                               web site. No abstract is required. Second, please according
                                                                    to my criteria and scoring system for evaluation and
To minimize the influence of various factors on the                 scoring the paper; Third, please according to my scoring
evaluation results, such as the differences in evaluation           system for the overall evaluation and scoring of the paper.
standards for papers in different fields, newly published           Pubmed ID:XXXXXXXX
papers not yet receiving sufficient attention, and
differences in evaluation preferences among different               2.3. Collection of Evaluation Results
experts, we have limited the research field to Cell
Biology. We focused on papers from cell biology                     The process of collecting Copilot evaluation results is
published in 2020 that received one evaluation. We                  shown in Figure 1:
collected data on 853 papers (as of May 2022) from H1
Connect. H1 Connect is a leading platform for
researchers and clinicians seeking expert opinions and
insights on the latest life sciences and medical research.
We collected key information about the papers,
including paper title, authors, journal, DOI, PMID,
recommended score etc.

2.2. Question Template Design
We designed an evaluation system specifically for the
field of Cell Biology to enhance the relevance and
reliability of the content generated by Copilot. By
summarizing the review principles of top journals in this
field, such as “Nature Reviews Molecular Cell Biology”,
“Trends in Cell Biology”, “The Journal of Cell Biology”,
“Nature Cell Biology”, and “Journal of Molecular Cell
Biology”, we extracted the following evaluation
dimensions. Copilot will be required to evaluate each
paper on these dimensions, provide a recommendation
score, and finally give a comprehensive evaluation.
      After several rounds of testing, the final template           Figure 1: Flow chart of collecting Copilot evaluation
(prompt) for querying Copilot has been established as               results.
follows, with the inclusion of PubMed ID to assist
Copilot in accurately targeting information on the                  3. Result and Discussion
internet:                                                           RQ1: Can GenAI conduct academic evaluations?
     I have summarized a set of criteria for evaluating             GenAI can conduct academic evaluations and produce
academic papers:                                                    readable results in form.
     •Originality: The paper must report novel,                     RQ2: What differences exist between the scoring
innovative and influential research that does not repeat or         results of GenAI and human experts?
plagiarize existing work.                                           Observing the score distribution ratio (Figure 2), papers
     •Accuracy: The paper must follow high standards of             scored 3 stars by experts only account for 15%, while
experimental design, data analysis and result presentation,         those scored 2 stars and 1 star are both around 40%.
without errors, biases or misleading.                               However, Copilot’s 3 stars evaluations account for more




                                                              100
than 60%, 2 stars account for 32.72%, and 1 star is less
than 1%.




Figure 2: Scoring ratio between Copilot and expert.
                                                                   Figure 4: The source of score differences between
    Comparing the scoring results of Copilot and expert            Copilot and expert.
(Figure 3), we observe that: Copilot’s scoring results are
higher, indicating that it tends to give higher scores to          RQ3: What differences exist between the
most papers. The average score given by Copilot is 2.68            evaluation text features of GenAI and human
stars, while the average score given by experts is 1.76            experts?
stars. This is consistent with the experimental results of         From the perspective of sentences, both the number of
Mike Thelwall on 51 papers[2]. The fact that over 60% of           sentences and the average sentence length in the Copilot
papers are scored 3 stars by Copilot suggests that it may          text are less than/shorter than those in the expert text,
not yet possess the core ability to accurately distinguish         but the difference is not significant (Figure 5).
high-value academic papers.                                            From a lexical perspective, the overall proportion of
                                                                   word types between the two is not significantly different,
                                                                   with Copilot tending to use more adjectives (Figure 6
                                                                   and Appendix A). The high-frequency words used by
                                                                   experts better reflect professionalism and specificity,
                                                                   such as “cell”, “protein”, and “cancer”. In contrast, the
                                                                   high-frequency words used by Copilot are more general,
                                                                   such as “significant” (Table 1).




Figure 3: Scoring box plot between Copilot and expert.
     Observing the source of score differences (Figure 4),
we find that most papers scored 1 star by experts were
scored 3 stars by Copilot, with a small portion scored 2           Figure 5: The number of sentences and the average
stars. Nearly half of the papers scored 2 stars by experts         sentence length between Copilot and expert.
were scored 3 stars by Copilot, and the other half were
scored 2 stars. Most papers scored 3 stars by experts
were also scored 3 stars by Copilot. In summary, the
main disagreements occur with papers scored 1 star by
experts, while disagreements are lower for papers scored
3 stars by experts.




                                                                   Figure 6: The proportion of word types between Copilot
                                                                   and expert(the meaning of abbreviations is in Appendix).




                                                             101
Table 1                                                             Acknowledgements
Word counts between Copilot and expert
                                                                    This work was supported by the key project of
Word (Copilot)      Counts     Word (Expert)     Counts
                                                                    innovation fund from National Science Library
     paper           2539           cell          1091
                                                                    (Chengdu), the Chinese Academy of Sciences
   research           749         protein         697
                                                                    (E3Z0000902). We sincerely appreciate the insightful
     novel            677          study          481
                                                                    comments and constructive suggestions provided by the
     report           560         author          436
                                                                    reviewers, which have significantly contributed to the
    provide           461         cancer          309
                                                                    improvement of our manuscript.
      high            444           bind          295
      field           421          gene           285
 implication          394       expression        244
                                                                    References
       cell           354        increase         243               [1]       W. Liang, Y. Zhang, H. Cao, et al, Can large
  significant         335           role          240                         language models provide useful feedback on
                                                                              research papers? A large-scale empirical analysis,
                                                                              2023. URL: http://arxiv.org/abs/2310.01783.
4. Conclusion                                                       [2]       M. Thelwall, Can ChatGPT evaluate research
In this study, we collected post peer-review scores and                       quality?,                 2024.                URL:
text data for 853 papers in the field of Cell Biology from                    http://arxiv.org/abs/2402.05519.
the H1 Connect website. We also obtained the scores                 [3]       Bloomberg, Generative AI to Become a $1.3
and evaluation text data for each of these papers from                        Trillion Market by 2032, Research Finds, 2023. RUL:
Copilot, based on our designed evaluation criteria and                        https://www.bloomberg.com/company/press/gen
process. By comparing the evaluation results of Copilot                       erative-ai-to-become-a-1-3-trillion-market-by-
and experts through quantitative analysis and text                            2032-research-finds/
mining methods, we found that Copilot can score and                 [4]       Gartner, Understand and Exploit GenAI with
evaluate under specific prompts. From the scoring                             Gartner’s New Impact Radar, 2024. URL:
perspective, there is a significant difference in the                         https://www.gartner.com/en/articles/understand-
scoring patterns between Copilot and experts: the                             and-exploit-gen-ai-with-gartner-s-new-impact-
former tends to give higher star, but the high proportion                     radar.
of 3-star reveals that it does not have enough ability to           [5]       J. de Winter, Can ChatGPT be used to predict
judge the actual value of the paper. From the text                            citation counts, readership, and social media
perspective, Copilot’s shorter sentences and generic                          interaction? An exploration among 2222 scientific
wording indicate that its evaluation is only at the stage                     abstracts, J. Scientometrics. (2024).
of imitating the features of the evaluation text, and it            [6]       M. B. Garcia, Using AI tools in writing peer review
cannot yet carry out substantive evaluations of the                           reports: should academic journals embrace the use
originality, accuracy, and other core elements of the                         of ChatGPT ?, Annals of biomedical engineering
paper.                                                                        52, 2024: 139-140.
     Overall, GenAI, represented by Copilot, is currently
unable to provide the depth of understanding and subtle             Appendix
analysis provided by human experts. It should still not
be used for academic evaluation at this stage, as its over-
                                                                    A. Abbreviations and Examples of
evaluative nature may lead to the proliferation of low-                SpaCy Parts-of-Speech
quality academic results[6]. This study presents several
limitations: Firstly, a disparity exists between experts                        ADJ--adjective *big, old, green*
and Copilot, with the latter unable to access complete                          ADP--adposition *in, to, during*
paper texts, unlike experts. Secondly, the author did not                       ADV--adverb *very, tomorrow, where *
perform iterative testing nor utilized the mean outcomes                        AUX--auxiliary *is, has (done), will (do) *
of various expert assessments for analysis. Third, the                          CCONJ--coordinating conjunction *and, or,
evaluation criteria of Copilot and experts are not                               but*
consistent. The limitations bear potential inaccuracies                         DET--determiner *a, an, the*
for the research outcomes. Consequently, we aim to
                                                                                NOUN--noun *girl, cat, tree, air, beauty*
address these deficiencies in the subsequent phase of
                                                                                PUNCT--punctuation *., (, ), ?*
analysis.
                                                                                VERB--verb     *run, runs, running, eat, ate,
                                                                                 eating*




                                                              102