<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Are Large Language Models Better Peer-Reviewers Than Humans? An Early Investigation on OpenReview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gianluca Bonifazi</string-name>
          <email>g.bonifazi@univpm.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher Buratti</string-name>
          <email>c.buratti@pm.univpm.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Marchetti</string-name>
          <email>michele.marchetti@univpm.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federica Parlapiano</string-name>
          <email>f.parlapiano@pm.univpm.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Traini</string-name>
          <email>davide.traini@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Ursino</string-name>
          <email>d.ursino@univpm.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Virgili</string-name>
          <email>luca.virgili@univpm.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CHIMOMO, University of Modena and Reggio Emilia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DII, Polytechnic University of Marche</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>In recent years, Large Language Models (LLMs) have often been used by paper reviewers, despite this practice being generally prohibited. This has raised, and continues to raise, issues concerning ethics, review reliability, and the risk of review manipulation. Indeed, several arXiv preprints were recently discovered to contain invisible, LLM-targeted instructions designed to persuade an AI reviewer to yield a positive review. In this paper, we propose a systematic analysis of LLMs' review capabilities in this complex and evolving scenario. In particular, we want to address two research questions: (i) How can LLM ratings be compared with human ratings?, and (ii) Can hidden positive prompts injected in a manuscript alter an LLM's generated review? To address these questions, we created a dataset of 400 papers from OpenReview. For each paper, this dataset contains human reviews and scores already present in OpenReview, as well as reviews performed by three state-of-the-art LLMs, added by us. Our results show that human reviewers assign higher and more widely dispersed scores that clearly distinguish accepted and rejected papers. In contrast, LLM ratings cluster close to their mean value, blurring the distinction between accepted and rejected papers. Furthermore, a negative prompt given by the reviewer makes the LLM lower its scores, while a hidden positive prompt injected by the author often fails to raise scores, and sometimes triggers even lower scores, if detected by the LLM. These results reveal both the potential and fragility of delegating peer review tasks to LLMs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Generative Artificial Intelligence</kwd>
        <kwd>Large Language Model</kwd>
        <kwd>Peer Review</kwd>
        <kwd>Prompt Injection</kwd>
        <kwd>OpenReview</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, Generative Artificial Intelligence (GenAI) and, in particular, Large Language Models
(LLMs) have begun reshaping both everyday life and professional practice. These systems can now tackle
a wide range of complex tasks. From personalized tutoring to decision support in healthcare [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ],
their rapid spread is opening a lot of new possibilities while forcing researchers and practitioners to
reconsider established assumptions. Academia has likewise felt the impact of LLMs. In fact, researchers
use these tools at multiple stages of the research process, from drafting manuscripts to polishing prose
and checking references [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. While this can unlock new opportunities, it also introduces new risks.
For instance, the authors of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] asked GPT to write abstracts given a title and a target journal. They
demonstrated that GPT can produce scientifically credible abstracts that, however, contained invented
data. Another emerging issue is the use of LLMs to review scientific papers. For instance, the authors
of [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] compared human reviews with GPT-generated reviews for a machine learning conference. They
found that, while GPT can deliver reasonably high-quality feedback, important shortcomings remain.
The authors of [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] highlight GPT’s potential in the review process when evaluating language, enabling
reviewers to focus on content. However, they also acknowledge the risk of generating inaccurate,
irrelevant, or useless comments. Finally, the authors of [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] state that GPT cannot replace human
reviewers, since they did not find a significant overlap between human and GPT reviews.
      </p>
      <p>Despite the well documented problems associated with using LLMs for paper review, these tools
are being employed more frequently, even though this practice is generally prohibited. This raises
issues involving ethics, review reliability, and vulnerability to manipulation. For instance, at least
17 arXiv preprints were recently found to contain invisible, LLM-targeted instructions designed to
persuade AI reviewers to issue favorable reviews1. These instructions included hidden “accept” prompts
in manuscripts to ensure higher scores. This episode shows that some authors are already exploiting
the fact that some reviewers delegate their task to LLMs.</p>
      <p>Our paper is motivated by the examination of this scenario and aims to contribute to the investigation
of this phenomenon. Specifically, it aims to address two research questions. The first (RQ1) asks how
closely scores generated by state-of-the-art LLM reviewers align with those of human reviewers. The
second (RQ2) tests whether a hidden prompt injection embedded by the author or an explicit negative
prompt provided by a reviewer can bias an LLM when it reviews a paper.</p>
      <p>To address these research questions, we built a dataset of 400 papers from OpenReview2, a public peer
review platform used by top venues. We first identified A* conferences whose main-track submissions
were accompanied by publicly available peer reviews. These conferences covered the period from
2021 to 2024, spanning the years immediately before and after the release of GPT. For each paper, we
collected the PDF file, the complete set of human reviews, and the corresponding scores. In this way, we
obtained matched text-and-rating data across the entire acceptance spectrum. Next, we obtained three
reviews from LLMs for each paper using the same template employed by humans. For this purpose,
we selected three widely adopted LLMs, i.e., GPT-4o mini, Gemini 1.5 Flash, and Gemini 2.0 Flash. To
answer RQ1, we compared the scores returned by the LLMs with those returned by humans. To answer
RQ2, we asked the three LLMs to review the papers again after providing them with a hidden positive
author prompt and/or a negative reviewer prompt.</p>
      <p>The main results we obtained are the following:
• Human reviewers provide higher and more dispersed scores than LLMs, and their ratings more
clearly distinguish accepted papers from rejected ones. In contrast, LLM scores are very close to
the mean, which blurs this distinction.
• A negative reviewer prompt generally pushes each model toward lower overall ratings.
• A hidden “accept” author prompt is only efective with certain models. Specifically, GPT -4o mini is
particularly susceptible, while Gemini 2.0 Flash, and partially Gemini 1.5 Flash, resist manipulation.</p>
      <p>Interestingly, when the LLM recognized the injection, it penalizes the corresponding paper.</p>
      <p>The rest of this paper is organized as follows: Section 2 describes the methodology used to address the
research questions. Section 3 presents the empirical results. Finally, Section 4 draws some conclusions
and highlights some possible future developments of our work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>In this section, we describe the methodology used to answer the two research questions of interest for
this paper. In particular, Section 2.1 details the construction of our dataset. Section 2.2 outlines the
methodology used to address RQ1. Finally, Section 2.3 illustrates the procedure employed to address
RQ2.</p>
      <sec id="sec-2-1">
        <title>2.1. Dataset</title>
        <p>To answer our research questions, we built a dataset based on OpenReview. This is an open source
platform that supports transparent scholarly peer review. It makes key steps of the review process, such
as referee reports, author rebuttals, and community comments, publicly accessible under fine-grained
1https://asia.nikkei.com/Business/Technology/Artificial-intelligence/Positive-review-only-Researchers-hide-AI-prompts-i
n-papers
2https://openreview.net/
access controls. Top conferences, such as the International Conference on Learning Representations
(ICLR), the Annual Conference on Neural Information Processing Systems (NeurIPS), and the
International Conference on Empirical Methods in Natural Language Processing (EMNLP), rely on this
platform to manage double-blind or open-identity reviews, facilitating threaded discussions and real
time review tracking.</p>
        <p>To build our dataset, we first identified some major conferences whose main-track submissions were
accompanied by publicly available peer reviews in the period 2021 - 2024, i.e., in the years immediately
before and after the release of GPT. Specifically, we focused on ICLR and NeurIPS, as they met this
criterion. For each conference-year pair, we randomly selected 25 accepted and 25 rejected papers,
yielding 50 papers per pair and 400 papers in total. Including both accepted and rejected papers allowed
us to capture the entire spectrum of review scores, from the low scores typically assigned to rejected
papers to the high scores generally assigned to the accepted ones. Additionally, we downloaded the
PDF file, the complete set of human reviews, and the post -rebuttal scores for each paper. This ensures
that all the scores we analyzed align with the exact PDF version archived on OpenReview. Finally, we
took the mean of the scores provided by human reviewers.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. RQ1: Comparing Human and LLM Reviews</title>
        <p>To answer the first research question, we asked three models, namely GPT-4o mini, Gemini 1.5 Flash,
and Gemini 2.0 Flash, to review each paper in the dataset. To this end, we used a prompt that instructed
the LLM to act as a rigorous A* conference reviewer. The prompt also instructed it to follow a fixed
review template consisting of: (i) a summary; (ii) a score from 1 to 4 for soundness, presentation,
and contribution; (iii) a list of strengths and weaknesses; (iv) an overall score from 1 to 10; (v) the
LLM’s confidence in the topic of the paper to be reviewed. Then, we instructed the LLM to apply
an acceptance rate in line with that of the reference conference, and to directly reject papers that
were not technically sound, were poorly presented, or lacking in substantial and original contribution.
For each combination of conference, year, and model (or human), we computed the mean, standard
deviation, and skewness of the overall scores. These descriptive statistics capture the central tendency,
dispersion, and asymmetry in the review scores provided by humans and LLMs, making cross-year and
cross-conference comparisons straightforward.</p>
        <p>
          We applied two non-parametric tests suitable for ordinal and non-normally distributed ratings.
Specifically, we used:
• The Wilcoxon signed-rank test [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to compare the score distributions returned by human and
LLM reviewers. In particular, the null hypothesis of this test is that the two distributions are
equal; as usual, the null hypothesis is rejected if the corresponding p-value is less than 0.05.
• The Mann–Whitney U test [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to examine the extent to which the scores assigned to accepted
and rejected papers difered for each conference-year pair. In particular, the null hypothesis of
this test is that the scores assigned to accepted and rejected papers are equal; if the corresponding
p-value is less than 0.05 the null hypothesis is rejected.
2.3. RQ2: Analyzing Prompt Injection and Reviewer Coercions in LLM Reviews
To answer the second research question, we first created a second version of the PDF file of each paper.
In each page of this file, we embedded a multi -sentence instruction encouraging strong acceptance
of the paper. We made the text white with six-point font so that it would remain invisible to human
readers while still being parseable by LLMs. This strategy is similar to the one observed in 17 arXiv
papers mentioned in the Introduction.
        </p>
        <p>We then asked the LLMs to review the papers under four settings, namely:
1. Original: The LLM received the original PDF file of the paper, and the review request did not
include any forcing.
2. Injected: The LLM received the manipulated PDF file of the paper with the hidden positive author
prompt, but the review request prompt remained neutral.
3. Negative: The LLM received the original PDF file of the paper, but the reviewer provided the LLM
with a prompt requesting it to recommend rejection and assign a low overall score.
4. Negative Injected: This setting involved the use of the modified PDF file, as in the second setting,
and the prompt requesting rejection, as in the third setting.</p>
        <p>We summarized the overall ratings for each conference, year, model, and setting by calculating the
mean, standard deviation, and skewness. To determine whether the hidden positive author prompts
and/or the negative reviewer prompts altered the scoring behavior, we performed the Wilcoxon
signedrank test separately for each conference-reviewer pair.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>This section presents the results of our study. Specifically, Section 3.1 details the results for RQ1 and
Section 3.2 reports those for RQ2.</p>
      <sec id="sec-3-1">
        <title>3.1. RQ1: Comparing Human and LLM Reviews</title>
        <p>First, we computed the distribution of the paper scores, grouping the results by conference, in order to
quantify the diference in scores between human and LLM reviewers. Figure 1 shows the area charts of
the paper scores. Papers are divided by conference: 200 relate to ICLR and 200 to NeurIPS.</p>
        <p>From the analysis of this figure, we can see that human reviewers tend to use a wider range of scores
(from 1 to 9) for papers, while LLMs tend to assign ratings within a more limited range (from 3 to 7).
To confirm this first insight, we computed the mean, standard deviation, and skewness of the score
distributions for human and LLM reviewers. The results are shown in Table 1.</p>
        <p>Conference</p>
        <sec id="sec-3-1-1">
          <title>Reviewer</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Mean</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Std. Deviation</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Skewness</title>
          <p>ICLR
NeurIPS</p>
          <p>Human
GPT-4o mini
Gemini 1.5 Flash
Gemini 2.0 Flash
Human
GPT-4o mini
Gemini 1.5 Flash
Gemini 2.0 Flash
5.70
5.53
4.38
4.50
5.42
5.50
4.87
4.88
1.33
0.83
0.87
1.11
0.97
0.83
1.17
1.01</p>
          <p>Table 1 reveals that human reviewers tend to give higher scores than LLM reviewers at both
conferences, with only a marginal exception at NeurIPS, where GPT-4o mini surpasses human reviewers
by a small amount. Human scores also have the lowest negative skewness. GPT-4o mini is the most
consistent model, as indicated by its minimum standard deviation, which suggests that scores are tightly
clustered around the mean. Gemini 1.5 Flash is the most critical model because it has the lowest mean
and the strongest positive skewness. These results stem from numerous low scores and a few high
scores. Gemini 2.0 Flash is also critical because its scores have a low mean; however, the distribution of
its scores is bell-shaped, as evidenced by its skewness close to 0.</p>
          <p>10
8
ng6
it
a
R4
2
0
10
8
ng6
it
a
R4
2
0
ICLR</p>
          <p>ICLR
25 50 75 Paper Index125 150 175 200
100</p>
          <p>We then performed a two-sided Wilcoxon rank-sum test to compare the score distributions assigned
by human and LLM reviewers. We performed this comparison separately for each conference and for
each LLM. The results are shown in Table 2.</p>
          <p>Conference
ICLR
NeurIPS</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>Reviewer</title>
          <p>GPT-4o mini
Gemini 1.5 Flash
Gemini 2.0 Flash
GPT-4o mini
Gemini 1.5 Flash
Gemini 2.0 Flash</p>
          <p>Statistic
7,736.50
1,535.00
2,276.00
8,653.00
5,670.00
5,026.50</p>
          <p>As shown in Table 2, the score distributions returned by LLMs are statistically diferent from those
returned by humans in almost all cases, as indicated by p-values less than 0.05. The only exception is
GPT-4o mini in NeurIPS. A p-value close to 0.05 was also found for the same model in ICLR. Therefore,
GPT-4o mini is the LLM that provides the most human-like evaluations. Cross-referencing these data
with those in Table 1 reveals that Gemini 1.5 Flash and Gemini 2.0 Flash provide evaluations that difer
significantly from human ones. The scores they assign are significantly lower than those provided by
humans. Instead, in the case of GPT-4o mini, the scores returned by the LLM are close to those returned
by humans. In particular, the mean scores are slightly lower for ICLR papers and slightly higher for
NeurIPS papers.</p>
          <p>We then refined the analysis by separating accepted papers from rejected ones to see if the score
patterns difered between the two groups. Accepted papers should have received higher scores, while
rejected papers should have received lower scores. To this end, we computed the violin plots of the
scores returned by humans and each LLM, distinguishing between accepted and rejected papers. The
corresponding results are illustrated in Figure 2.</p>
          <p>10
8
6
g
n
i
t
a
R4
2
0
Decision</p>
          <p>Accept</p>
          <p>Reject
Human</p>
          <p>As this figure shows, there are significant diferences in human scores between accepted and rejected
papers. In contrast, the LLMs’ score distributions overlap largely and cover nearly identical ranges.</p>
          <p>To verify whether the gap of the scores between accepted and rejected papers was statistically
significant, we compared the score distributions of accepted and rejected papers for each
conferenceyear pair. Since the two samples were non-overlapping, we used the Mann–Whitney U test. The results
are presented in Table 3.</p>
          <p>From the analysis of this table, we observe that, in all cases, the score distributions for accepted
and rejected papers returned by humans are statistically diferent. Instead, in 62.5% of the cases, the
distributions of scores returned by LLMs for accepted and rejected papers are not statistically diferent.
These results confirm the observation that LLM reviews show minimal variation across papers, whereas
human reviews show greater divergence between accepted and rejected papers.
3.2. RQ2: Analyzing Prompt Injection and Reviewer Coercions in LLM Reviews
After comparing reviews from humans and LLMs, we wanted to examine the latter in more detail.
Initially, we wanted to verify whether an LLM’s behavior could be manipulated by embedding a hidden
prompt in the PDF file of a paper. This prompt would encourage the LLM reviewer and convince it to
give a positive review to the paper. Since some researchers have already used this trick, we wanted to
verify whether it could fool the LLM into accepting a paper or at least giving it a higher score.</p>
          <p>To test this hypothesis, we injected positive prompts into the PDF file of each paper and asked the
LLMs to review it again giving them the modified PDF file as input. Figure 3 presents the distributions
of scores that the LLMs returned for the original papers and those with the injected prompts.</p>
          <p>The analysis of the figure shows that the number of papers with a score of 7 increases significantly
for GPT-4o mini, both in ICLR and in NeurIPS. For instance, the number of papers rated 7 after injection
doubles in ICLR. Moreover, the highest score the model assigns without injection is 7 in both conferences,
ICLR
NeurIPS
2021
2022
2023
2024
2021
2022
2023
2024
whereas, after injection, the model assigns a score of 8 five times. While there are still cases where
the model assigns low ratings, their number decreases with injection. Interestingly, for two papers
submitted to ICLR, the model assigned scores of 1 and 3 after the injection. Intrigued, we checked the
corresponding reviews provided by it. Upon analyzing them, we found that the model penalized the
paper with a score of 1 in its review because it noticed the author prompt injected into it. Gemini 1.5
Flash tends to assign higher scores when the paper is injected. In fact, the score of 7 appears 17 times
more often in ICLR, while the score of 4 appears 32 times less often. Additionally, there are some scores
higher than 7, albeit in a limited number. The same occurs in NeurIPS, although to a lesser extent. As
for Gemini 2.0 Flash, there is no noticeable increase in model ratings when the paper is injected. There
are a few cases where the model assigns a score of 8. The distributions without and with injection
considerably overlap in both conferences, indicating that the model is not deceived by the injected
prompts.</p>
          <p>To further verify this initial finding, we calculated the mean, standard deviation, and skewness of
the LLMs’ score distributions separately for ICLR and NeurIPS. Additionally, we applied the Wilcoxon
2
(a) GPT-4o mini vs. GPT-4o mini with positive injection
Type
Original
Injected
Type
Original</p>
          <p>Injected
(b) Gemini 1.5 Flash vs. Gemini 1.5 Flash with positive injection</p>
          <p>Type
Original
Injected
Type
Original</p>
          <p>Injected
4 Rating 6
8
(c) Gemini 2.0 Flash vs. Gemini 2.0 Flash with positive injection
signed-rank test to determine if there were statistically significant diferences between the injected and
the original scores. The results are reported in Table 4.</p>
          <p>From the analysis of this table, we can observe that there is a clear increase in the mean with the
prompt injection in the case of GPT-4o mini. We also observe an increase in the standard deviation,
which can be explained by the fact that the model assigns both very high and very low values with
prompt injection, which never happens in the original case. As for Gemini 1.5 Flash, we observe
increases in the mean and standard deviation in both conferences when prompts are injected. This is
because the model’s evaluations include high scores in this last case, which were not present originally.
As we have seen before, Gemini 2.0 Flash is not afected by prompt injection, and its score distributions
without and with it are similar. Examining the Wilcoxon signed-rank test results in the table provides
further confirmation of our previous conclusions. In fact, for GPT-4o mini and Gemini 1.5 Flash, the
p-value is less than 0.05. This allows us to conclude that the distributions of scores without and with
injected prompts are statistically diferent. Conversely, for Gemini 2.0 Flash, the p-value confirms the
null hypothesis, indicating that the distributions of scores without and with injected prompts are not
statistically diferent.</p>
          <p>After demonstrating that positive author prompt injection can cause an LLM to significantly alter its
score in some cases, we investigated what happens when a reviewer provides the model with a prompt
asking for a negative evaluation of the paper. Figure 4 compares the score distributions provided by</p>
          <p>After analyzing the efects of the hidden positive prompt inserted by the author and the explicit
the models in the original case and in the presence of the negative prompt. The analysis of this figure
reveals that the negative prompt causes all the LLMs to provide lower scores. For instance, GPT-4o
mini rates most of the papers with a score of 4 in case of negative prompt whereas it often assigns a
score of 5 or 6 in the original case. Moreover, when the model is provided with the negative prompt,
the highest score is 6, which is also assigned in very few cases. Instead, without a negative prompt,
the highest score provided by the same model is 7. Both Gemini variants assign a score of 3 to many
papers in the presence of a negative prompt. This score is rarely assigned to papers in their original
reviews. This demonstrates the strong influence of the negative reviewer prompt on the evaluation
of these two models. The analysis of the statistics in Table 5 confirms this conclusion. In fact, all the
means are lower with a negative prompt, and the diferences between the score distributions without
and with the negative prompt are statistically significant, as evidenced by the p-values less than 0.05.
Type
Original
Negative
Type
Original
Negative
t300
n
u
oC200
100
negative prompt provided by the reviewer separately, we performed a new analysis to determine the
outcome of providing these two prompts simultaneously. To answer this question, we calculated the
score distributions: (i) with only the negative reviewer prompt, and (ii) with both the hidden positive
author prompt and the negative reviewer prompt. The results are reported in Figure 5.</p>
          <p>The analysis of this figure shows that only GPT-4o mini reacts to the hidden positive author prompt
by softening its evaluation. Specifically, the number of ICLR papers with a score of 4 decreases by about
a third, whereas the number of NeurIPS papers with a score of 4 decreases by about a quarter. Most
papers have a score of 5 or 6 and some even a score of 7 and 8. Instead, Gemini 2.0 Flash does not seem
influenced by the hidden positive author prompt as the diferences without and with it are minimal.
Gemini 1.5 Flash exhibits intermediate behavior. In fact, in NeurIPS the distributions of scores without
and with a hidden positive author prompt are nearly identical; in ICLR, instead, there is a slight increase
in scores for some papers. These results suggest that, for these two LLMs, the negative reviewer prompt
is much more influential than the hidden positive author prompt, which is often irrelevant.</p>
          <p>Also in this case, we calculated the mean, standard deviation, and skewness of each score distribution
and applied the Wilcoxon signed-rank test. The results are shown in Table 6. From the analysis of this
table, we can see that there is a significant diference in mean values for the two score distributions
under consideration in the case of GPT-4o mini. This diference is smaller for Gemini 1.5 Flash and
much smaller for Gemini 2.0 Flash. Regarding the Wilcoxon-signed rank test, we note that: (i) the
p-values for GPT-4o mini are much lower than 0.05, indicating that the two distributions are statistically</p>
          <p>Type
Negative
Negative Injected</p>
          <p>Type
Negative
Negative Injected
(c) Gemini 2.0 Flash with negative prompt vs. Gemini 2.0 Flash with</p>
          <p>negative prompt and positive injection
diferent; (ii) the p-values for Gemini 1.5 Flash and Gemini 2.0 Flash are greater than 0.05, suggesting
that the two distributions are not statistically diferent. These conclusions confirm the results obtained
from examining Figure 5 and the mean values analyzed above.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we investigated the behavior of LLMs when used for peer review. To this end, we
constructed a dataset of 400 papers from OpenReview and asked three state-of-the-art LLMs (i.e.,
GPT-4o mini, Gemini 1.5 Flash, and Gemini 2.0 Flash) to review each paper. Our study was guided by
two research questions, namely:
• (RQ1) How do LLM scores compare to human scores?
• (RQ2) Can review prompts or hidden author prompts influence the model’s evaluation?
Our investigation yielded three main insights, namely:
• Human reviewers tend to provide higher and more dispersed scores, which clearly distinguish
accepted papers from rejected ones. In contrast, LLM scores tend to cluster around the mean.</p>
      <p>Mean Std. Deviation Skewness Statistic</p>
      <p>These results underscore both the potential and the fragility of delegating peer review tasks to LLMs.</p>
      <p>Our study on the behavior of LLMs when reviewing papers is not an endpoint. In fact, it paves the
way for several future work. For instance, we plan to shift our focus from numeric ratings to qualitative
outputs to investigate, for instance, how LLMs describe strengths and weaknesses, and how they present
their overall recommendations. As in this paper, the ultimate goal is to compare the behavior of the
LLMs and humans when reviewing a paper. Additionally, we plan to explore new forms of prompts that
may influence LLM behavior. Indeed, rather than hiding instructions within the main PDF file of a paper,
one could inject positive prompts into other fields, such as metadata, references, or supplementary files.
It would be interesting to test whether LLMs can easily be fooled in these cases. Finally, we plan to
explore the potential of “defensive” injections to assist authors opposing to AI-based reviewing to inject
prompts designed to halt or confuse an LLM, preventing it from evaluating their paper.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We acknowledge the support of the project “MEraviglIA - Metodologie didattiche inclusive ed
Intelligenza Artificiale” (J11I24000700009) under the PR Marche FSE+ 2021/2027 funded by Regione Marche.
This work is also partially supported by the project SERICS (CUP H73C22000880001 – PE000000014)
under the MUR National Recovery and Resilience Plan funded by the European Union -
NextGenerationEU.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration of competing interest</title>
      <p>The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leung</surname>
          </string-name>
          ,
          <article-title>Empowering student self-regulated learning and science education through ChatGPT: A pioneering pilot study</article-title>
          ,
          <source>British Journal of Educational Technology</source>
          <volume>55</volume>
          (
          <year>2024</year>
          )
          <fpage>1328</fpage>
          -
          <lpage>1353</lpage>
          . Wiley Online Library.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ng</surname>
          </string-name>
          , M. Jong,
          <article-title>Exploring the application of ChatGPT in ESL/EFL education and related research issues: A systematic review of empirical studies</article-title>
          ,
          <source>Smart Learning Environments</source>
          <volume>11</volume>
          (
          <year>2024</year>
          ) 50. Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Thirunavukarasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Elangovan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.F.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ting</surname>
          </string-name>
          ,
          <article-title>Large language models in medicine</article-title>
          ,
          <source>Nature medicine 29</source>
          (
          <year>2023</year>
          )
          <fpage>1930</fpage>
          -
          <lpage>1940</lpage>
          . Nature.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Horbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ross-Hellauer</surname>
          </string-name>
          ,
          <article-title>Open Science at the generative AI turn: An exploratory analysis of challenges and opportunities</article-title>
          ,
          <source>Quantitative science studies 6</source>
          (
          <year>2025</year>
          )
          <fpage>22</fpage>
          -
          <lpage>45</lpage>
          . MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Eke</surname>
          </string-name>
          ,
          <article-title>ChatGPT and the rise of generative AI: Threat to academic integrity?</article-title>
          ,
          <source>Journal of Responsible Technology</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>100060</fpage>
          . Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Markov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pearson</surname>
          </string-name>
          ,
          <article-title>Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers</article-title>
          ,
          <source>NPJ digital medicine 6</source>
          (
          <year>2023</year>
          )
          <article-title>75</article-title>
          .
          <string-name>
            <surname>Nature</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <article-title>Gpt4 is slightly helpful for peer-review assistance: A pilot study</article-title>
          ,
          <source>arXiv preprint arXiv:2307.05492</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anjali</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Fiorillo,</surname>
          </string-name>
          <article-title>The application of ChatGPT in the peer-reviewing process</article-title>
          ,
          <source>Oral Oncology Reports</source>
          <volume>9</volume>
          (
          <year>2024</year>
          )
          <fpage>100227</fpage>
          . Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Saad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ariyaratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Iyengar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Davies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vaishya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Botchu</surname>
          </string-name>
          ,
          <article-title>Exploring the potential of ChatGPT in the peer review process: an observational study</article-title>
          ,
          <source>Diabetes &amp; Metabolic Syndrome: Clinical Research &amp; Reviews</source>
          <volume>18</volume>
          (
          <year>2024</year>
          )
          <fpage>102946</fpage>
          . Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neuhäuser</surname>
          </string-name>
          ,
          <article-title>Wilcoxon-signed-rank test</article-title>
          , in: International encyclopedia of statistical science,
          <year>2011</year>
          , pp.
          <fpage>1658</fpage>
          -
          <lpage>1659</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.W.</given-names>
            <surname>MacFarland and J.M. Yates</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mann-Whitney</surname>
            <given-names>U</given-names>
          </string-name>
          test, in: Introduction to nonparametric
          <source>statistics for the biological sciences using R</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>132</lpage>
          . Springer.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>