<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SiDi-NLP-Team at IDPT2021: Irony Detection in Portuguese 2021</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Almeida Neto</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>onio Manoel dos Santos</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bruna Almeida</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Azevedo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Camila de Araujo.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nobrega</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Antonio Asevedo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Henrico Bertini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peinado</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Heriberto Osinaga</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bittencourt</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marciele de Menezes</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>da Silva</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nataly Leopoldina Patti</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cortes</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Priscila Osorio</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper presents the submission of SiDi-NLP team in IDPT 2021 - Irony Detection in Portuguese (IberLEF 2021). Irony detection is a challenging semantic task similar to other tasks such as Sentiment Analysis. Due to these similarities we performed experiments using algorithms that achieved state-of-the-art results for similar semantic tasks in Brazilian Portuguese with linguistic feature representation and pre-trained BERT models applied to the two shared task datasets { Tweet Dataset and News Dataset. The pre-trained BERT models outperformed the other classi ers achieving 1:00 accuracy and F1 in the Tweet Dataset, and 0:903 accuracy/0:900 in F1 for the News Dataset. We also discuss the results considering the results obtained in the shared task.</p>
      </abstract>
      <kwd-group>
        <kwd>Irony</kwd>
        <kwd>Sarcasm</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Irony Detection in Portuguese (IDPT 2021) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a shared task co-allocated
at IberLEF 2021 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and presents competitors with datasets for identi cation
of irony documents in two di erent domains { tweets and news. Some authors
describe the ability to recognizing ironic sentences by humans as "relatively easy
way although not always" [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The main goal of the task is to extract a label
for a document as F alse when the whole document does not contain irony and
      </p>
      <p>T rue otherwise. Due to the subjectivity of the task, there are several similarities
between Irony Detection and other Natural Language Processing tasks, such as
Sentiment Analysis and Hate Speech Detection. The main challenges in these
scenarios are that the irony clues usually relate to di erent pragmatic contexts
such as the period in which the text was published, external information, relation
between writer and readers, and others.</p>
      <p>Since the use of pragmatic feature for machine learning algorithms usually
comprehends complex processes of modeling, we handled the problem as a text
classi cation problem based on features related to Sentiment Analysis
methods (Section 2.2). This idea was based on the hypothesis that ironic texts would
share linguistic features with opinion texts, due to the nature of both tasks.
Furthermore, in order to compare the e ciency of these features against most
recent approaches for Natural Language Processing, we also performed
experiments with a pre-trained BERT (Bidirectional Encoder Representations from
Transformers) model that was built for the Portuguese Language {
BERTimbau [13].</p>
      <p>We have observed that our proposed linguistic feature has not outperformed
the results of BERTimbau, at least, at the test dataset that was shared by the
shared task organizers. It is important to note that in the o cial report by the
organizers, we may see a kind of rotation between the results of the participants
over the two datasets. It could be a suggestion of a huge di erence between the
data distribution or, probably, di erent approaches were used and have presented
good results over distinct contexts.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Experimental Setup</title>
      <sec id="sec-2-1">
        <title>Corpora</title>
        <p>The two corpora used in this work contain texts on di erent topics written in
Brazilian Portuguese language and are publicly available.</p>
        <p>
          The rst corpora contain 15,212 tweets extracted from dataset used in an
irony and sarcasm detection work [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The authors collected potentially ironic
tweets containing the hashtags \#ironia" or \#sarcasmo" posted between
August 10, 2014 and August 6, 2017. Others non-ironic tweets were collected
considering random tweets about economics, politics, and education that do not
contain the hashtags #ironia or #sarcasmo. Additionally, the authors included
tweets collected by de Freitas et. al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] that were manually annotated by
Portuguese language experts. This nal dataset has many more ironic sentences than
non-ironic sentences, as shown the distribution of classes in g1. This dataset
was free of words and expressions that could serve as tips for the model and
interfere in the learning, such as links or \rt", \#ironia" or \#sarcastico" tags.
        </p>
        <p>
          The second corpora contains 18,494 news extracted from Sensacionalista,
The Piau Herald, and Estad~ao websites. The news were labeled according to
the source website: news from Sensacionalista and The Piau Herald are sarcastic
and, therefore, were labeled as ironic; news from Estad~ao were labeled as
nonironic [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The distribution of classes is shown in Fig. 2.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Machine Learning Features</title>
        <p>
          There are several similarities between Irony Detection and Sentiment Analysis
due to the nature of the tasks. Following this intuition, we ran baseline
experiments in the datasets using a well known text representation and pipelines that
were proposed at [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          The features presented in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] are the same we used for our ML methods
since they enable the classi ers to observe several features that may indicate the
semantic alignment of the sentence. The representation is not decisive on the
classi cation, but merely as input for classi ers. We describe all the features as
follows:
{ Bag-of-Words (BoW): a BoW representation of the data with absence or
presence as 0 and 1 respectively;
{ Presence of negations: in Sentiment Analysis, the negation presence
usually indicates the inversion of a polarity. We used a list of Brazilian
Portuguese negations such as \n~ao" and \nunca" in order to keep this aspect in
our experiments even though the feature must not be as important for Irony
as it is for Sentiment Analysis;
{ Emoticons: emoticon is a string that put together form a gure
representation that contains semantic relevance for the sentence. We used a list of
positive and negative emoticons to map the alignment of the token in the
sentence;
{ Emojis: similarly to the emoticons, we also represented the emojis in the
documents. The main di erence between them is that the emojis are
alphanumeric characters that nowadays are interpreted by smartphones and
browsers to be shown as gures. We also used a corpora with the polarity of
each emoji in the document as a feature [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ];
{ Sentiment Lexicon: we also provided the ML methods with the count of
positive and negative words in the sentence following Sentilex [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ];
{ Part-of-Speech tagging: we also counted the number of nouns, adverbs,
verbs and adjectives using PoS tagged in NLPnet [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], the feature is specially
relevant for identifying adjectives that are most frequent in opinion sentences
and less frequent in factual information.
        </p>
        <p>
          We used these features as input for ve ML classi ers { a Support
Vector Machine (SVM), a Logistic Regressor (LR), MultiLayer Perceptron (MLP),
Random Forest (RF) and Naive Bayes (NB). We have chosen these methods
considering [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and the results obtained in Sentiment Analysis for Brazilian
Portuguese tweets. The parameters were not grid-searched and are the same as the
best t the original work presents [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>BERT and BERTimbau</title>
        <p>
          BERTimbau[13] is a pre-trained BERT model trained on the Portuguese
language [13]. BERT [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a Transformer encoder architecture that learns
contextual relations between words in a text to generate a language model. BERT is
designed to pre-train deep bidirectional representations from unlabeled text by
jointly conditioning on both left and right context in all layers. This pre-trained
BERT model can be ne-tuned with just one additional output layer to create
speci c models for a wide range of tasks in NLP, without substantial architecture
modi cations [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>To create a Portuguese version of BERT, the authors of BERTimbau used
data from brWaC [14], the largest open Portuguese corpus, which contains 2.68
billion tokens from 3.53 million documents (web pages). They train
BERTimbau models on two sizes: Base (12 layers, 768 hidden dimensions, 12 attention
heads, and 110M parameters) and Large (24 layers, 1024 hidden dimensions, 16
attention heads and 330M parameters) [13].</p>
        <p>In this work, we ne-tuned the base version of BERTimbau to classify irony
sentences. The parameters were not grid-searched and are the same as in the
original work.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>Since we have observed a big variation over the two datasets and that IDPT
organizers also have indicated the submissions of the results separately, we will
also report the results individually here.</p>
      <p>Table 1 and Table 2 present the values of Accuracy and F-score obtained by
each method in each dataset. To facilitate the comparison of the results, the best
F-score for each dataset was highlighted in bold.</p>
      <p>Multilingual-BERT and BERTimbau stood out positively in F-score and
Accuracy in both datasets. These methods also correctly predicted all test samples
from Tweets dataset. Analyzing this dataset, it was possible to notice that the
non-ironic examples had a limited vocabulary related to economic news, and for
that reason, it is possible to say the models were biased by training data. Since
BERTimbau did better at predicting the News dataset, it was de ned as a model
to perform the nal predictions.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>It is interesting to note the teams that have participating of the IDPT 2021 have
not presented the equivalent results over the two datasets from the competition.
For instance, the submissions of the team \TeamBERT4Ever` have the highest
results on news, but they have not performed well on dataset with tweets. The
same behavior can be observed on our submission, in which our relative results
over news are better than other ones. For instance, the models not based on deep
learning approaches also have reached good results at the datasets with news
than in the another one.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>We acknowledge SiDi for all the support with the infrastructure for experiments
and work environment for the development of the work.
13. Fabio Souza, Rodrigo Nogueira, and Roberto Lotufo. Bertimbau: Pretrained bert
models for brazilian portuguese. In Ricardo Cerri and Ronaldo C. Prati, editors,
Intelligent Systems, pages 403{417, Cham, 2020. Springer International Publishing.
14. Jorge A. Wagner Filho, Rodrigo Wilkens, Marco Idiart, and Aline Villavicencio.</p>
      <p>The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings
of the Eleventh International Conference on Language Resources and Evaluation
(LREC 2018), Miyazaki, Japan, May 2018. European Language Resources
Association (ELRA).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Fabio Araujo da Silva.
          <article-title>Detecca~o de ironia e sarcasmo em l ngua portuguesa: uma abordagem utilizando deep learning</article-title>
          .
          <source>Bachelor's thesis</source>
          , Universidade Federal de Mato Grosso,
          <year>February 2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lucas</surname>
            <given-names>V Avanco</given-names>
          </string-name>
          ,
          <article-title>Henrico B Brum,</article-title>
          and
          <string-name>
            <given-names>MG</given-names>
            <surname>Nunes</surname>
          </string-name>
          .
          <article-title>Improving opinion classi ers by combining di erent methods and resources</article-title>
          . XIII Encontro Nacional de Intelig^
          <article-title>encia Arti cial e Computacional (ENIAC)</article-title>
          , pages
          <fpage>25</fpage>
          {
          <fpage>36</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Henrico</given-names>
            <surname>Bertini</surname>
          </string-name>
          <article-title>Brum and Maria das Gracas Volpe Nunes</article-title>
          .
          <article-title>Semi-supervised sentiment annotation of large corpora</article-title>
          .
          <source>In International Conference on Computational Processing of the Portuguese Language</source>
          , pages
          <volume>385</volume>
          {
          <fpage>395</fpage>
          . Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Ulisses</given-names>
            <surname>Brisolara</surname>
          </string-name>
          <article-title>Corr^ea, Leonardo Pereira dos Santos, Leonardo Coelho, and Larissa A</article-title>
          . de Freitas.
          <source>Overview of the IDPT Task on Irony Detection in Portuguese at IberLEF 2021. Procesamiento del Lenguaje Natural</source>
          ,
          <volume>67</volume>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Larissa Astrogildo de Freitas, Aline Aver Vanin, Denise Nauderer Hogetop, Marco Nemetz Bochernitsan, and
          <string-name>
            <given-names>Renata</given-names>
            <surname>Vieira</surname>
          </string-name>
          .
          <article-title>Pathways for irony detection in tweets</article-title>
          .
          <source>In Proceedings of the 29th Annual ACM Symposium on Applied Computing</source>
          , SAC '
          <volume>14</volume>
          , page
          <volume>628</volume>
          {
          <fpage>633</fpage>
          , New York, NY, USA,
          <year>2014</year>
          .
          <article-title>Association for Computing Machinery</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <article-title>BERT: pretraining of deep bidirectional transformers for language understanding</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1810</year>
          .04805,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Erick R Fonseca,</surname>
          </string-name>
          <article-title>Joa~o Lu s G Rosa, and Sandra Maria Alu sio</article-title>
          .
          <article-title>Evaluating word embeddings and a revised corpus for part-of-speech tagging in portuguese</article-title>
          .
          <source>Journal of the Brazilian Computer Society</source>
          ,
          <volume>21</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>14</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Irazu</given-names>
            <surname>Hernandez-Far</surname>
          </string-name>
          <string-name>
            <given-names>as</given-names>
            , Jose-Miguel
            <surname>Bened</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>Applying basic features from sentiment analysis for automatic irony detection</article-title>
          .
          <source>In Iberian Conference on Pattern Recognition and Image Analysis</source>
          , pages
          <volume>337</volume>
          {
          <fpage>344</fpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Petra</given-names>
            <surname>Kralj</surname>
          </string-name>
          <string-name>
            <surname>Novak</surname>
          </string-name>
          , Jasmina Smailovic, Borut Sluban, and
          <string-name>
            <given-names>Igor</given-names>
            <surname>Mozetic</surname>
          </string-name>
          .
          <article-title>Sentiment of emojis</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>10</volume>
          (
          <issue>12</issue>
          ):
          <fpage>e0144296</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Gabriel Schubert Marten and Larissa Astrogildo de Freitas.
          <article-title>The construction of a corpus for detecting irony and sarcasm in portuguese</article-title>
          .
          <source>In Proceedings of XVII Encontro</source>
          Nacional de Intelig^
          <article-title>encia Arti cial e Computacional (ENIAC-</article-title>
          <year>2020</year>
          ), pages
          <fpage>709</fpage>
          {
          <fpage>717</fpage>
          ,
          <string-name>
            <surname>Rio</surname>
            <given-names>Grande</given-names>
          </string-name>
          , Brazil,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Manuel</surname>
            <given-names>Montes</given-names>
          </string-name>
          , Paolo Rosso, Julio Gonzalo, Ezra Aragon, Rodrigo Agerri, Miguel Angel Alvarez Carmona, Elena Alvarez Mellado, Jorge Carrillo de Albornoz, Luis Chiruzzo, Larissa Freitas, Helena Gomez Adorno, Yoan Gutierrez,
          <source>Salud Mar a Jimenez Zafra</source>
          , Salvador Lima, Flor Miriam Plaza de Arco, and
          <string-name>
            <given-names>Mariona</given-names>
            <surname>Taule</surname>
          </string-name>
          .
          <source>In Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2021</year>
          ).
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Mario</surname>
          </string-name>
          J Silva,
          <article-title>Paula Carvalho, and Lu s Sarmento. Building a sentiment lexicon for social judgement mining</article-title>
          .
          <source>In International Conference on Computational Processing of the Portuguese Language</source>
          , pages
          <volume>218</volume>
          {
          <fpage>228</fpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>