<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating the Aspect-Category-Opinion-Sentiment Analysis Task on a Custom Dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Loris Di Quilio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Fioravanti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DEc, University of Chieti-Pescara</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work, we report the results of some experiments with Aspect Based Sentiment Analysis (ABSA) on a dataset consisting of user reviews of products of a manufacturing company operating in the packaging industry. We focus on one of the more challenging ABSA tasks, the Aspect Category Opinion Sentiment task, and compare the results obtained by using three diferent tools available in the literature. We have also performed experiments for assessing the improvements that could be obtained by using larger models and similarity measures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• category (c): is a pre-defined category related to a specific domain of interest. For
example, AMBIENCE, PRICE, FOOD can be categories for the restaurant domain.
• aspect term (a): represents the specific opinion target explicitly mentioned in the
provided text. For instance, in the sentence “The pizza is delicious but the service is terrible”,
the explicit aspects are “pizza” and “service”. When this is implicit, as in the sentence “it’s
very reasonably priced”, when the subject is not explicitly named, we use a “NULL” label.
• polarity (p): characterizes the sentiment orientation expressed towards an aspect
category or an aspect term. Sentiment polarity falls into one of three categories: positive,
negative, or neutral indicating whether the sentiment is favorable, unfavorable, or neither,
respectively.
• opinion term (o): is the word or multiple words used by opinion users to convey their
sentiment or feelings about the target entity or aspect. For example, in the sentence “The
pizza is delicious but the service is terrible”, “delicious” and “terrible” are opinions terms,
expressing a positive and negative sentiment toward the pizza.</p>
      <p>Among the tasks of Aspect-based Sentiment Analysis that aim to predict a single sentiment
element, there are:
• Aspect Term Extraction (ATE);
• Aspect Category Detection (ACD);
• Opinion Term Extraction (OTE);
• Aspect opinion co-extraction (AOCE);
• Aspect Sentiment Classification (ASC).</p>
      <p>The tasks where multiple sentiment elements are predicted include:
• Aspect-Opinion Pair Extraction (AOPE);
• End-to-End ABSA (E2E-ABSA);
• Aspect Category Sentiment Analysis (ACSA);
• Aspect Sentiment Triplet Extraction (ASTE);
• Aspect Category Sentiment Detection (ACSD);
• Aspect Category Opinion Sentiment (ACOS).</p>
      <p>Following we show a summary of the tasks using the input sentence “The pizza is delicious
but the service is terrible”.</p>
      <sec id="sec-1-1">
        <title>Output</title>
        <p>pizza (a), service (a)
food (c), service(c)
delicious (o), terrible (o)
positive(p)
negative (p)
Task
ATE
ACD
OTE
ASC</p>
        <p>Input
sentence
sentence
sentence
sentence, pizza
sentence, service
sentence
sentence
sentence</p>
        <p>AOPE {pizza (a), delicious (o)}, {service (a), terrible (o)}
E2E ABSA {pizza (a), positive p)}, {service (a), negative (p)}
ACSA {food (c), positive (p)}, {service (c), negative (p)}
ASTE sentence {pizza (a), positive (p), delicious (o)},</p>
        <p>{service (a), negative (p), terrible (o)}
ACSD sentence {food (c), pizza (a), positive (p)},</p>
        <p>{service (c), service (a), negative (p)}
ACOS sentence {pizza (a), food (c), delicious (o), positive (p)},
{service (a), service (c), terrible (o), negative (p)}</p>
        <p>In this paper, we will focus our attention on the ACOS task which aims at predicting all the
sentiment elements at once, namely category (c), aspect term (a), polarity (p), and opinion term
(o). For the ACOS task, a relatively limited body of research and literature exists. Our primary
objective is to establish an integrated framework that leverages multiple tools for eficient ACOS
task execution.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset and annotation tool</title>
      <p>The dataset used in this work is based on user reviews about skincare and pharmaceutical
products supplied by a manufacturing company. The reviews have been scraped from
ecommerce sites and some of them have been annotated using an open-source tool named Label
Studio2. The annotations have been curated by one of the authors of the article with a dual-stage
revision process to ensure their reliability. These annotations exhibit variances when compared
to datasets accessible in the literature due to the incorporation of numerous implicit aspects
related to the product supplied and opinion terms, frequently composed of multiple words. The
dataset (Table 2) comprises 756 sentences and 1038 annotations, with the possibility of each
sentence having multiple annotations.</p>
      <sec id="sec-2-1">
        <title>Sentences</title>
      </sec>
      <sec id="sec-2-2">
        <title>Annotations</title>
      </sec>
      <sec id="sec-2-3">
        <title>Train</title>
        <p>623
881</p>
      </sec>
      <sec id="sec-2-4">
        <title>Test</title>
        <p>133
157</p>
      </sec>
      <sec id="sec-2-5">
        <title>Total</title>
        <p>756
1038</p>
        <p>The annotations appears to be composed in a balanced way with regards to sentiment polarity
(p); neutral sentiment is not calculated because predicting neutrality is not of interest in this
case. As regards the categories, 13 classes were identified, encompassing both general and
specific aspects of product performance.</p>
        <p>The distribution of classes is mostly balanced, with the exception of the category pertaining
to “general satisfaction of the final consumer” which happens to be the most frequent one.</p>
        <p>For this work, a custom template in Label Studio was built, which allows all elements to be
annotated for each review. In Figure 1 we show an example of a sentence annotated on this
annotation tool: the explicitly mentioned aspect and opinion elements can be directly selected
in the text, while the polarity and the category, which is not shown, can be chosen from the
predefined ones.</p>
        <p>
          A translation module has been developed to convert the JSON encoding of the dataset exported
from Label Studio to other formats, including those of the considered tools for the ACOS task,
and the SemEval-2014 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and SemEval-2016 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] formats.
        </p>
        <p>The dataset and further details about the annotation process cannot be released due to a
non-disclosure agreement.</p>
        <sec id="sec-2-5-1">
          <title>2https://github.com/heartexlabs/label-studio</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental evaluation</title>
      <p>
        In this section, we present the details of the experimental evaluation we performed on our
dataset using some tools that have been specifically built for the ACOS task. We have selected
three tools that have stemmed from significant studies in this field and for which the source
code is publicly available online. All the selected tools leverage the fine-tuning of pre-trained
models, specifically T5 [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] and BERT[9], as a crucial component of their functionality:
• Paraphrase modeling [10]: the model’s objective is to generate a sequence of words,
denoted as , from an input sentence . The sequence  should contain all the desired
sentiment elements. Once the sequence  is generated, it’s possible to recover the so-called
“sentiment quads”  = (, , , ). This approach aims to fully leverage the semantics of
the sentiment elements represented by  by generating them in natural language form
within the sequence . The pre-trained language model used is T5-base. This is the only
tool among those we have considered that does not support implicit opinion terms;
• Extract Classify-ACOS [11]: This tool first performs aspect-opinion co-extraction, then
predicts category-sentiment given the extracted aspect-opinion pairs. The tool uses the
BERT model with AdamW optimizer3 [12], so the data is transformed into a format
suitable for it by delimiting each sentence using the CLS token.
• PyABSA [13, 14]: this tool is a variation of the original one, made for aspect-opinions pair
extraction. There is no documentation about quadruple extraction because this feature
is still experimental. The format of this tool was taken as a reference to transform the
data once exported from the annotation tool. Also in this case T5-base is used as the
pre-trained model.
      </p>
      <p>We also performed additional experiments using PyABSA. In particular, (i) we utilized the
tool with a larger pre-trained model, T5-large, which comprises 770 million parameters; (ii) we
applied a similarity threshold between true labels and those predicted by the model for one
of the components: the opinion term (o); (iii) we evaluated the performance of PyABSA using
3AdamW optimizer: is a stochastic gradient descent method that is based on adaptive estimation of first-order and
second-order moments with an added method to decay weights.
the T5-large model with the standard correctness criterion, without similarity, on some less
complex ABSA tasks, namely ACSA, E2E ABSA, ACSD and ASTE.</p>
      <p>The second experiment is motivated by the fact that sentences in our domain often contain
implicit opinions, frequently composed of multiple words rather than single terms. So we
established a relaxed correctness criterion for considering a prediction correct when it matches
the gold standard in terms of aspect, category, and polarity, and when the similarity between
the predicted opinion term and the real one is at least 70%. For computing string similarity
we used the Python function, SequenceMatcher4 that is based on an extension of the Ratclif
and Obershelp algorithm (“gestalt pattern matching”) [15] and compares pairs of sequences
by finding the longest common subsequence while excluding uninteresting elements, with a
quadratic time complexity for the worst case. In this way, for instance, the prediction of the
opinion “super practical to slip into my bag” can be considered correct even if the real opinion
is “practical to slip into my bag”.</p>
      <p>In Table 2, we show the tool settings we used for the experiments, including the batch-size,
which indicates the number of training examples used in each iteration, the learning rate, a
parameter in controlling the step size at each iteration while moving towards the minimum of a
loss function and the number of epochs, representing the complete cycles through the training
dataset.</p>
      <sec id="sec-3-1">
        <title>Tool</title>
      </sec>
      <sec id="sec-3-2">
        <title>Paraphrase modeling</title>
      </sec>
      <sec id="sec-3-3">
        <title>Extract Classify-ACOS</title>
        <p>PyABSA
batch-size learning rate</p>
        <p>16 3e-4
32{a, o}, 16(p), 8(c) 2e-5{a, o}, 3e5(p),(c)
16 5e-5
epochs
20
20
20
3.1. Results
To measure the performance of the models on the data, we computed the most commonly used
metrics to evaluate these types of tasks, namely precision, the fraction of relevant retrieved
instances over all the retrieved instances, recall, the fraction of relevant retrieved instances over
all the relevant instances, and F1-Score, the harmonic mean of precision and recall calculated
as (2 · precision · recall)/(precision + recall). The results are shown in Table 3. Please note that,
with the exception of the last tool in the table where we used the similarity metric discussed
above, the prediction of a quadruple is considered to be correct if and only if it is equal to the
gold one in all its four components.</p>
        <p>Among the tools with base pre-trained models (T5-base and BERT), the Paraphrase modeling
tools seems to be the overall best, but the support for implicit opinion, lacking from this tool,
could be important for some application domains. The Extract Classify-ACOS tool seems to be
slightly better than Paraphrase modeling in terms of precision, but has a significantly lower
value for recall. The last tool we considered, PyABSA, is not the best in terms of performance</p>
        <sec id="sec-3-3-1">
          <title>4https://docs.python.org/3/library/diflib.html</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Tool</title>
      </sec>
      <sec id="sec-3-5">
        <title>Paraphrase modeling (T5-base)</title>
      </sec>
      <sec id="sec-3-6">
        <title>Extract Classify-ACOS (BERT)</title>
      </sec>
      <sec id="sec-3-7">
        <title>PyABSA (T5-base)</title>
      </sec>
      <sec id="sec-3-8">
        <title>PyABSA (T5-large)</title>
      </sec>
      <sec id="sec-3-9">
        <title>PyABSA (T5-large with similarity) Precision</title>
        <p>but it turned out to be very well designed, allowing us to customize it for performing further
experiments using a larger pre-trained model (T5-Large) and employing a similarity criterion
for one of the components. By using the larger model the precision increased from about 32% to
41% using the standard correctness criterion, and to 54% using the relaxed correctness criterion
based on similarity.</p>
        <p>The results of the experiments using PyABSA with the T5-large model and the standard
correctness criterion on some less complex ABSA tasks are reported in Table 4.</p>
      </sec>
      <sec id="sec-3-10">
        <title>Task</title>
      </sec>
      <sec id="sec-3-11">
        <title>ACSA</title>
      </sec>
      <sec id="sec-3-12">
        <title>E2E ABSA</title>
      </sec>
      <sec id="sec-3-13">
        <title>ACSD</title>
      </sec>
      <sec id="sec-3-14">
        <title>ASTE</title>
      </sec>
      <sec id="sec-3-15">
        <title>ACOS</title>
      </sec>
      <sec id="sec-3-16">
        <title>Predicted elements</title>
        <p>c,p
a,p
c,a,p
a,p,o
c,a,p,o</p>
        <p>From the obtained results, it is evident that the model used by PyABSA performs well
in predicting tuples, both in Aspect Category Sentiment Analysis (ACSA) and End-to-End
ABSA (E2E ABSA). Furthermore, the model demonstrates good performance in extracting
triples for Aspect Category Sentiment Detection (ACSD). However, it performs less efectively
than the ACOS model with opinion term similarity set at 70% in Aspect Sentiment Triplet
Extraction (ASTE). This observation implies that, within the framework of this model and the
provided dataset, the primary limitation appears to be in the accurate identification of opinion
terms. These terms, as previously discussed and as one might intuitively expect, are frequently
composed of multiple words, posing a significant challenge for the model to predict them with
absolute precision.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and future work</title>
      <p>We benchmarked three ACOS systems available in the literature by applying them to a diferent
domain, using a custom dataset we built. Additionally, we assessed the PyABSA tool’s
performance in handling ACOS subtasks to identify critical elements in this process, which in this
application domain seems to be the identification of the “opinion terms”.</p>
      <p>In the future, we plan to experiment with additional ACOS tools and diferent similarity
measures. We also would like to expand the dataset and improve the annotation process.
Another direction for future research is comparing the efectiveness of ACOS tools that perform
the prediction of all the sentiment components at once with respect to other approaches that
combine the results of specialized tools on simpler tasks.</p>
      <p>One of the goals of the research is to develop a unified framework that allows the execution of
diferent ABSA tasks by running multiple tools on the same dataset. Adapters should be in charge
of translating data into the appropriate format. Also, it should be possible to define a variety of
experiments and to explore diferent scenarios through an automatic and controlled selection
of test and train data, by defining constraints on data categories and polarities. We envision an
integrated framework in which the predictions from these tools are used to automatically or
semi-automatically enhance and expand the training data, improving both the eficiency and
the overall quality of the sentiment analysis models.
Subjectivity, Sentiment, &amp; Social Media Analysis, WASSA@ACL 2023, Toronto, Canada,
July 14, 2023, Association for Computational Linguistics, 2023, pp. 19–27.
[9] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
transformers for language understanding, in: Proc. 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies, NAACL-HLT 2019, Minneapolis MN, USA, June 2-7, 2019, Vol 1, Association for
Computational Linguistics, 2019, pp. 4171–4186.
[10] W. Z. et al., Aspect sentiment quad prediction as paraphrase generation, in: M. M. et al.
(Ed.), Proc. of the 2021 Conference on Empirical Methods in Natural Language Processing,
EMNLP 2021, Punta Cana, Dominican Republic, 7-11 November, 2021, Association for
Computational Linguistics, 2021, pp. 9209–9219. doi:10.18653/v1/2021.emnlp-main.
726.
[11] H. Cai, R. Xia, J. Yu, Aspect-category-opinion-sentiment quadruple extraction with implicit
aspects and opinions, in: C. Z. et al. (Ed.), Proc. 59th Annual Meeting of the Association
for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing, ACL/IJCNLP 2021, Vol 1, August 1-6, 2021, Association for
Computational Linguistics, 2021, pp. 340–350. doi:10.18653/v1/2021.acl-long.29.
[12] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y. Bengio, Y. LeCun
(Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego,
CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[13] H. Yang, K. Li, A modularized framework for reproducible aspect-based sentiment analysis,</p>
      <p>CoRR abs/2208.01368 (2022). doi:10.48550/arXiv.2208.01368.
[14] H. Yang, K. Li, PyABSA, 2023. URL: https://github.com/yangheng95/PyABSA.
[15] J. W. Ratclif, D. Metzener, et al., Pattern matching: The gestalt approach, Dr. Dobb’s
Journal 13 (1988) 46.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bassignana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramponi</surname>
          </string-name>
          , Preface to the
          <source>Seventh Workshop on Natural Language for Artificial Intelligence (NL4AI)</source>
          ,
          <source>in: Proceedings of the Seventh Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2023</year>
          )
          <article-title>co-located with 22th International Conference of the Italian Association for Artificial Intelligence (AI* IA</article-title>
          <year>2023</year>
          ),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <article-title>A survey on aspect-based sentiment analysis: Tasks, methods, and challenges</article-title>
          ,
          <source>CoRR abs/2203</source>
          .01054 (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv. 2203.01054.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>M. P.</surname>
          </string-name>
          et al.,
          <article-title>Semeval-2014 task 4: Aspect based sentiment analysis</article-title>
          , in: P. Nakov, T. Zesch (Eds.),
          <source>Proc. 8th International Workshop on Semantic Evaluation, SemEval@COLING</source>
          <year>2014</year>
          , Dublin, Ireland,
          <source>August 23-24</source>
          ,
          <year>2014</year>
          , The Association for Computer Linguistics,
          <year>2014</year>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>35</lpage>
          . doi:
          <volume>10</volume>
          .3115/v1/s14-
          <fpage>2004</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>M. P.</surname>
          </string-name>
          et al.,
          <article-title>Semeval-2016 task 5: Aspect based sentiment analysis</article-title>
          ,
          <source>in: S. B</source>
          . et al. (Ed.),
          <source>Proc. 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT</source>
          <year>2016</year>
          , San Diego, CA, USA, June 16-17,
          <year>2016</year>
          , The Association for Computer Linguistics,
          <year>2016</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>30</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/s16-
          <fpage>1002</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>M. M. Trusca</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Frasincar</surname>
          </string-name>
          ,
          <article-title>Survey on aspect detection for aspect-based sentiment analysis</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>56</volume>
          (
          <year>2023</year>
          )
          <fpage>3797</fpage>
          -
          <lpage>3846</lpage>
          . URL: https://doi.org/10.1007/ s10462-022-10252-y. doi:
          <volume>10</volume>
          .1007/s10462-022-10252-y.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brauwers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Frasincar</surname>
          </string-name>
          ,
          <article-title>A survey on aspect-based sentiment classification</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <volume>65</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>65</lpage>
          :
          <fpage>37</fpage>
          . doi:
          <volume>10</volume>
          .1145/3503044.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <volume>140</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>140</lpage>
          :
          <fpage>67</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>S. V.</surname>
          </string-name>
          et al.,
          <article-title>Instruction tuning for few-shot aspect-based sentiment analysis</article-title>
          , in: J.
          <string-name>
            <surname>Barnes</surname>
            ,
            <given-names>O. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clercq</surname>
          </string-name>
          , R. Klinger (Eds.),
          <source>Proc. 13th Workshop on Computational Approaches to</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>