<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Essay Argument Persuasiveness Prediction Using a RoBERTa-LSTM Hybrid Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fahad M. Alzaidee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommy Yuan</string-name>
          <email>tommy.yuan@york.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Nightingale</string-name>
          <email>peter.nightingale@york.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khaled El Ebyary</string-name>
          <email>khaled.elebyary@york.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1University of York</institution>
          ,
          <addr-line>Heslington, York YO10 5DD</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Over the past five decades, automated essay scoring has been a significant focus in both research and industry, capturing the interest of the NLP community due to its potential to provide valuable educational tools that save time for educators worldwide. The persuasiveness of arguments is a key aspect of argumentative essay quality. However, despite its significance, the persuasiveness of arguments has often been overlooked in research, with most studies still in their infancy. In this paper, we introduce several neural models aimed at enhancing the prediction of the persuasiveness score of an argument. Our proposed model improved the prediction accuracy compared to the approach suggested in [1].</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Persuasiveness scoring</kwd>
        <kwd>Argument evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Work on argument persuasiveness, closely related to our study, has been explored by several researchers.
Persing and Ng [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] identify elements that weaken persuasiveness, while Stab and Gurevych [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
investigate the adequacy of argument support. Al Khatib et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] annotate a news editorial corpus with
detailed argumentative discourse units to examine persuasive strategies. Schaefer et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] identify key
factors that influence the persuasiveness of a text, including the usage patterns of argument components,
the structure of the essay, the flow and sequence of argument types, as well as the impact of the essay
prompt and the individual author’s style. Additionally, Wachsmuth et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] identify and annotate
15 dimensions—logical, rhetorical, and dialectical—relevant for automatically evaluating argument
quality. Ke et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] introduce an artificial neural network, bidirectional LSTM model with attention
mechanisms, to score metrics like persuasiveness, specificity, and strength using a neural network
on their annotated dataset of 102 essays [8]. Toledo et al. [9] publish a new dataset with arguments
annotated for quality and compared arguments in pairs to determine the stronger one. They utilised a
BERT language model to generate numerical representations for words in both arguments and then
ifne-tuned it for the classification and ranking tasks. In 2023, another study [ 10] used the PERSUADE
dataset to predict persuasiveness ratings for discourse elements based on their type labels. Previous
studies, such as [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], that evaluate the overall persuasiveness of an entire essay often provide generalized
feedback, lacking the granularity needed to highlight specific areas for improvement. On the other
hand, the study in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] ofers feedback on diferent traits impacting the persuasiveness of various essay
sections but still demonstrates only modest performance.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The aspect of persuasiveness in essays has been annotated in several available datasets, such as [11],
[12], and [8]. However, datasets [11] and [12] lack the granularity required to train our model efectively.
Therefore, we decided to use the dataset [8], which comprises 102 essays from the Annotated Essays
corpus by Stab and Gurevych [13]. Each essay is annotated with an argument tree, mirrors the natural
lfow of argumentative essays, avoiding cycles and maintaining clarity. These trees typically have
three to four levels, beginning with the Major Claim, followed by Claims and Premises that support or
challenge their parent nodes. The dataset includes 1,459 components: 185 Major Claims, 567 Claims,
and 707 Premises, each of which is assigned with various score metrics, including persuasiveness, which
is the focus of our work. The Krippendorf’s values for persuasiveness annotations (0.739 for Major
Claims, 0.701 for Claims, and 0.552 for Premises) indicate that the dataset is well-suited for training our
model. The dataset was split into training and testing sets, with the training set representing 80% of the
essays.</p>
      <p>To address the dataset’s limited size, we identified all possible arguments in each set, generating 1,459
distinct arguments. We created two diferent sequences for each argument: one based on the order of
appearance in its original essay and the other using postorder traversal. Inspired by [14], we enriched
each component with lexical and structural features as illustrated in Figure ??, increasing the maximum
length from 58 to 85 words. Additionally, we paraphrased each argument component in the training set
ifve times using a fine-tuned ChatGPT paraphraser on T5 (Text-to-Text Transfer Transformer). This
resulted in four forms of input data: plain and enriched arguments using their order of appearance, and
plain and enriched arguments using postorder traversal.</p>
      <p>(a) Plain argument
(b) Enriched argument</p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Methodology</title>
      <p>In this study, we design four diferent neural models and compare their accuracy with two baseline
models. Starting with baseline models, we fine-tuned the language model Longformer and the second
model based on a Hierarchical BERT framework (HBM) [15] designed for classifying long documents
with limited labelled data. In the HBM-based model, we identify the argument components in each
argument and independently convert their tokens, which can be words or subwords, into their numerical
vectors with the RoBERTa encoder. We use mean pooling to average these vectors, creating a single
representation for each argument component. The new computed vectors are then input into the
sentence-level Hierarchical BERT encoder, generating an intermediate representation of the entire
argument. We adapted this model to predict persuasiveness scores as continuous values, by adding
a sigmoid activation function after the linear layer, multiplying its output by 6, and rounding to the
nearest integer.</p>
      <p>We designed four neural models, illustrated in Figures ?? and ??, combining a transformer model with
an LSTM layer. RoBERTa and Longformer were used as embedding layers to generate a representation
for each argument component in each argument, while the LSTM layer captured the dependency among
the argument components. We refer to those models as LONG-LSTM, LONG-LSTM-TAG, ROB-LSTM,
and ROB-LSTM-TAG. The term "TAG" in the model name indicates that the model uses a multi-output
approach. In LONG-LSTM and LONG-LSTM-TAG, tokens for all argument components are embedded
in a single pass using Longformer. Token embeddings are extracted and mean pooled to create a single
representation for each component, which is then fed into an LSTM layer. In LONG-LSTM, the final
hidden state is passed through a linear layer followed by a sigmoid activation function, producing a
persuasiveness score between 0 and 6. The output is scaled by 6 and rounded to the nearest integer.
MAE is then computed. In LONG-LSTM-TAG, each encoded argument component’s hidden state is
used to predict its persuasiveness score. ROB-LSTM and ROB-LSTM-TAG follow the same process
except each argument component is encoded independently using RoBERTa.</p>
      <p>(a) LONG-LSTM
(b) LONG-LSTM-TAG</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiment Setup</title>
      <p>We begin by randomly dividing our training set into five parts and perform five-fold cross-validation. In
each experiment, four parts are used for training and one for development. After each iteration, we test
(a) ROB-LSTM
(b) ROB-LSTM-TAG
each model and compute the MAE and PC on the rounded predicted scores. The overall performance of
the models is determined by averaging the metrics from all five iterations.</p>
      <p>For training the HBM-based model, we used a learning rate of 2 × 10− 5, a dropout rate of 0.01, 50
epochs, a batch size of 4, and the Adam optimizer with a learning rate decay set to 1 × 10− 8. For other
models, we used a learning rate of 1 × 10− 3, a dropout rate of 0.7, 50 epochs with a batch size of 4, and
the Adam optimizer.</p>
      <p>
        To evaluate the models’ prediction accuracy, we used Mean Absolute Error (MAE) and Pearson’s
Correlation Coeficient (PC), computed after rounding the predicted scores. MAE measures the average
distance between the predicted and actual scores. PC reflects the consistency and directionality of the
predictions. We use the MAE instead of the Mean Squared Error (MSE) for its equal treatment of errors
and reduced sensitivity to outliers. Additionally, it facilitates direct comparison with the models in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluating the Efectiveness of Models</title>
      <p>Table 1 presents a summary of the evaluation experiments conducted on the test set. The leftmost
column lists the various modeling approaches, while the top row identifies the diferent types of input
data used. The HBM-based model shows the strongest correlations (PC) for plain arguments (0.309)
and plain arguments (Postorder) (0.322). For rich arguments, the Fine-tuned Longformer achieves the
highest correlation (0.702), while the ROB-LSTM-TAG model has the highest correlation (0.696) for rich
arguments (Postorder).</p>
      <p>The table clearly illustrates that incorporating rich features alongside argument content significantly
improves model performance. Among the models, the ROB-LSTM stands out for its balanced
performance, achieving a low MAE of 0.743 and a high PC of 0.691 in the ‘Rich Argument (Postorder)’ category.
This suggests the potential efectiveness of leveraging hierarchical dependencies between argument
components to predict persuasiveness. We also trained all models on the training set after paraphrasing
each argument component, but this did not lead to improved results when tested on the test set.</p>
      <p>The improvement in RoBERTA-based models is attributed to generating representations for each
argument component independently, which reduces noise from other components in the same argument.
In contrast, encoding the entire argument using the Longformer introduces more noise. Adding an
LSTM layer further helps by separating the content of the argument component from its contextual
dependency, thus reducing noise.</p>
      <p>
        In comparison to the closest related work by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which was also trained on the same dataset
and reported a MAE of 0.983 and a PC of 0.353, our ROB-LSTM model demonstrates a substantial
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>In this preliminary study, we explored diferent models to predict the persuasiveness score of
arguments with varying complexity and structures. The RoBERTa-LSTM model demonstrated a balanced
performance where it achieved a low MAE and a relatively high PC. The addition of rich features and
the consideration of hierarchical order relations highlighted the potential benefits of these factors in
improving persuasiveness prediction.</p>
      <p>A significant challenge we face is incorporating the types of relationships between arguments within
an essay. We aim to understand how the persuasiveness of lower-level arguments in the argument tree
afects the overall persuasiveness of related higher-level arguments. This understanding is crucial for
developing a feedback component in our system that efectively leverages these relationships.
Conference of the European Chapter of the Association for Computational Linguistics: Volume 1,
2017, pp. 176–187.
[8] W. Carlile, N. Gurrapadi, Z. Ke, V. Ng, Give me more feedback: Annotating argument persuasiveness
and related attributes in student essays, in: Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 621–631.
[9] A. Toledo, S. Gretz, E. Cohen-Karlik, R. Friedman, E. Venezian, D. Lahav, M. Jacovi, R. Aharonov,
N. Slonim, Automatic argument quality assessment–new datasets and methods, arXiv preprint
arXiv:1909.01007 (2019).
[10] Y. Hicke, T. Tian, K. Jha, C. H. Kim, Automated essay scoring in argumentative writing:
Deberteachingassistant, arXiv preprint arXiv:2307.04276 (2023).
[11] C. Stab, I. Gurevych, Recognizing insuficiently supported arguments in argumentative essays,
in: M. Lapata, P. Blunsom, A. Koller (Eds.), Proceedings of the 15th Conference of the European
Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Association
for Computational Linguistics, Valencia, Spain, 2017, pp. 980–990. URL: https://aclanthology.org/
E17-1092.
[12] S. Li, V. Ng, ICLE++: Modeling fine-grained traits for holistic essay scoring, in: K. Duh, H. Gomez,
S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies (Volume 1: Long
Papers), Association for Computational Linguistics, Mexico City, Mexico, 2024, pp. 8465–8486.</p>
      <p>URL: https://aclanthology.org/2024.naacl-long.468. doi:10.18653/v1/2024.naacl-long.468.
[13] C. Stab, I. Gurevych, Annotating argument components and relations in persuasive essays, in:
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics:
Technical Papers, 2014, pp. 1501–1510.
[14] U. Mushtaq, J. Cabessa, Argument classification with bert plus contextual, structural and syntactic
features as text, in: International Conference on Neural Information Processing, Springer, 2022,
pp. 622–633.
[15] J. Lu, M. Henchion, I. Bacher, B. Mac Namee, A sentence-level hierarchical bert model for document
classification with limited labelled data, in: Discovery Science: 24th International Conference, DS
2021, Halifax, NS, Canada, October 11–13, 2021, Proceedings, Springer, 2021, pp. 231–241.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Carlile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gurrapadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>Learning to give feedback: Modeling attributes afecting argument persuasiveness in student essays</article-title>
          .,
          <source>in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4130</fpage>
          -
          <lpage>4136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wambsganss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Niklaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Söllner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Handschuh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Leimeister</surname>
          </string-name>
          , Al:
          <article-title>An adaptive learning support system for argumentation skills</article-title>
          ,
          <source>in: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Persing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>Why can't you convince me? modelling weaknesses in unpersuasive arguments</article-title>
          ,
          <source>in: Proceedings of the 26th International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>4082</fpage>
          -
          <lpage>4088</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Stab</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Recognizing insuficiently supported arguments in argumentative essays</article-title>
          ,
          <source>in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</source>
          ,
          <year>2017</year>
          . URL: https://api.semanticscholar.org/CorpusID:6801402.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Al Khatib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>A news editorial corpus for mining argumentation strategies</article-title>
          ,
          <source>in: Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: Technical Papers</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>3433</fpage>
          -
          <lpage>3443</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Schaefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Knaebel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stede</surname>
          </string-name>
          ,
          <article-title>Towards fine-grained argumentation strategy analysis in persuasive essays</article-title>
          , in: M.
          <string-name>
            <surname>Alshomary</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Muresan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Park</surname>
          </string-name>
          , J. Romberg (Eds.),
          <source>Proceedings of the 10th Workshop on Argument Mining</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>76</fpage>
          -
          <lpage>88</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .argmining-
          <volume>1</volume>
          .8. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . argmining-
          <volume>1</volume>
          .8.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Naderi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bilu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Alberdingk</given-names>
            <surname>Thijm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hirst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Computational argumentation quality assessment in natural language</article-title>
          ,
          <source>in: Proceedings of the 15th</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>