<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TeamUFPR at IDPT 2021: Equalizing a Strategy Using Machine Learning for Two Types of Data in Detecting Irony</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federal University of Paraná - Curitiba</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brazil</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>theinrich</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>fjoceschin}@inf.ufpr.br</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Santa Catarina State University - Joinville</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brazil felipe.ramos@edu.udesc.br</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of the TeamUFPR at the Task on Irony Detection in Portuguese (IDPT 2021), framed within the Iberian Languages Evaluation Forum (IberLEF 2021). The task consists of creating a methodology for irony detection in Portuguese using two datasets, one of them containing news texts obtained from different sources and the second being tweets collected on twitter. Our proposal focused mainly on using only one approach for both datasets, three tests were submitted using different strategies to identify the impact of the models considering the type of data. We evaluate a total of ten machine learning algorithms, with four feature selection strategies that explore a variety of parameters. Three strategy's were used in IDPT 2021, focusing in undersampling and lemmatization. Overall, the result was relatively pleasant with the best results being found by Multilayer Perceptron and Random Forest, and we were able to demonstrate a new approach to identifying irony in messages.</p>
      </abstract>
      <kwd-group>
        <kwd>Sentimental Analysis</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Sentiment analysis focuses on extracting sentiment from texts found in sources
such as news, social networks, or e-mails, to classify as positive or negative [1],
or a more specific classification task, such as irony detection. Applications using
Natural Language Processing (NLP) have been popularizing in recent years, with
the widespread of solutions in both academia and industry.</p>
      <p>The representation of texts or phrases (such as tweets or documents) by
techniques that aim to analyze and explore new models of representation are
known as NLP [6]. This type of approach considers a language evaluation, with
the objective of making an algorithm that understands this information in the
most similar way to the understanding of a human being.</p>
      <p>Over the years new strategies were developed with the focus in using machine
learning (ML), that could take advantage of computational power. Unsupervised
algorithms began to have more popularity in recent years, achieving adequate
results to label a large amount of data.</p>
      <p>The IDPT 2021 is the first IberLEF task turned to irony detection in
Portuguese. The competition uses two sets of data crawled from the web, one
presenting news and other, tweets for the competitors [4]. The IDPT is one of the
tasks offers in the IberLEF 2021, in the section of humor and irony.</p>
      <p>Our proposal focuses on exploring machine learning techniques for the
learning phase, and recurring strategies for the preprocessing phase. Overall, our
team explores a total of nine strategies in the preprocessing step, four on the
feature extraction step, and ten algorithms in the learning step. The final
approach consists in evaluating the average of the execution of 10-fold cross
validation of each combination, looking for the strategy that best fits the two sets
of data proposed for the competitors. The source code is available in https:
//github.com/h31nr1ch/TeamUFPR-IDPT2021.</p>
      <p>The experience base of our team is varied, consisting of knowledge in
machine learning facing security applications and protein structure prediction with
metaheuristics. Our main motivation in participating in the competition is to
use concepts already known and adapting them to sentiment analysis using NLP.</p>
      <p>The reminder of this paper is structured as follows: Section 2 describes the
IDPT 2021 task. Section 3 presents the methodology used. Section 4 explains
our evaluation and algorithm choices. Section 5 presents the related work; and
Section 6 concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Task description</title>
      <p>The IberLEF 2021 has a focus on irony detection in the Portuguese language [4].
The task aims to identify the presence of irony in two sets of data (News and
Tweets). The proposed dataset for competitors are show in Table 1. The train
set consist of 15.2k tweets and 18.4k news, which must be used to classify 600
messages (found in the Test set) half representing each class.</p>
      <p>The competitors must provide an id and a label, that will be used to check
the efficiency of their strategy. After that, the results will be presented by the
following metrics: Bacc, Accuracy, F1, Precision, and Recall. Each team was
allowed to submit three runs for each data set, making a total of six runs.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>In this section, we describe our methodology for the competition. Dividing the
section in three groups, Section 3.1 preprocessing stage; Section 3.2 feature
extraction process; and Section 3.3 our machine learning methodology.
3.1</p>
      <sec id="sec-3-1">
        <title>Preprocessing stage</title>
        <p>
          The first step consists of the preprocessing stage, focusing on cleaning up
undesirable and irrelevant patterns. In total, nine preprocessing strategies were used,
which are: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) removal of all accented characters; (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) fix encoding found in some
texts that were not utf-8; (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) remove tags from users or entities; (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) remove
punctuation from the text; (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) remove special characters; (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ) remove duplicate
spacing; (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ) change texts to lowercase; (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) remove numbers; and (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) removal of
stop words (in this case we used nltk list of stop words [7]).
        </p>
        <p>These nine steps are responsible for eliminating features that can be
problematic for the feature extraction algorithms and later can harm the machine
learning algorithms. After all this process, two additional preprocessing cases
were defined, one representation using stemming (using the spaCy [10]) and one
without, the focus here was on optimizing the results for the set of tweets. Our
focus in the tweet set is due to the wide variation of phrases, along with the high
number of slang used.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Feature extraction</title>
        <p>For the feature extraction strategies, we considered and evaluated four methods.
In the end, only two were selected. The algorithms used were:
CountVectorizer (Token Counts); TfidfVectorizer (TF-IDF); HashingVectorizer
(Hashing Trick); and Word2Vec, all of them from scikit-learn [9]. The choice was
due to familiarity and past experiences with these extractors, given that they
are widely used in the literature of NLP applications and we wanted to test their
performance in irony detection tasks.</p>
        <p>The feature extraction step will be responsible for converting the textual
information into a format that is understandable for machine learning algorithms.
Each of the four algorithms focuses on creating different types of outputs,
according to their feature extraction strategy.</p>
        <p>The number of features tested for CountVectorizer, TfidfVectorizer, and
HashingVectorizer was their default parameters (found in scikit-learn
documentation [9]), 10k, 20k, 30k, 40k, 50, 100k, and 200k. The feature dimension
extracted from Word2Vec was 50, 100, 250, and 500.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Machine learning</title>
        <p>
          With the train dataset, a total of ten algorithms were tested: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Random Forest
(RF); (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Multilayer Perceptron (MLP); (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) sgd; (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) linearSVC; (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) svc; (
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
decisionTree; (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ) perceptron; (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) k-nearest neighbors (KNN); (
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
multinomialNB; and (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ) gaussianNB.
        </p>
        <p>For each algorithm we run tests considering the preprocessing and feature
extraction stage, using each configuration presented. After all these steps, we
checked if lemmatization could help detect irony by comparing classifiers trained
with and without it.</p>
        <p>At the end of all runs, the algorithms that presented the best results for
the news dataset and tweets dataset were Multilayer Perceptron and Random
Forest, respectively. Taking into account the set of tests we ran, in this section
we focused only in presenting the best scenarios of our approach. The complete
set of results are available on GitHub along with the source code.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>In this section, we will discuss the evaluation of our methodology using only
the train dataset. After this process we will present the strategy used in the
IDPT 2021 task (Section 4.1). Our objective here is to choose the algorithm
with the best average value for both classes (News and Tweets), also checking
which settings are used in the respective data with the best result.</p>
      <p>Figure 1, 2, 3, and 4 present the evaluation of the news dataset. These tests
split the training dataset in 50/50 for train/test, and use the Multilayer
Perceptron. The approach using Word2Vec didn’t present an acceptable result (Fig.
2 and 4) in comparison with the other three approach that have achieved results
above 96% (Fig. 1 and 3). We believe that it happens due to the use of a small
dataset to train the Word2Vec model, which requires a lot of data to achieve
better results. The TfidfVectorizer was the feature selection that presented
the best results continuously, consequently it was the method chosen for the
news set. The difference between using of lemmatization was quite small, but we
decided not to use it considering that it presented the best average result.</p>
      <p>Figure 5, 6, 7 and 8 present the evaluation of the tweets dataset. These
tests also split the training dataset in 50/50 for train/test, and use the Random
Forest classifier. Word2Vec present the worst results overall in this scenario, even
with the use of lemmatization that helped the news set in this same scenario.
The results presented in 5 and 7 highlights the best performance when not
using and when using lemmatization with HashingVectorizer. Taking that into
account, it was decided not to apply lemmatization again, which indicates that
the inflected forms of a word might help to detect irony.</p>
      <p>Table 2 show the size difference between both classes, found in the news
and tweet datasets. Because of the high class imbalance, a strategy considering
undersampling was considered, with the goal of balancing the dataset and avoid
problems with the algorithms.
News Results for Word2Vec (Lemmatization)
The IDPT 2021 task, allow teams to submit three runs. As we already had
knowledge of the best algorithms with the results of Section 4, we focused on
zFFaeigtai.to7un:r)eT.wEexettrsacRtieosnulMts eftohroDdisff(eLreenmtmTaetxit- (FLige.m8m:aTtwizeaettiosn)R. esults for Word2Vec
1. No undersampling strategy was used, the data consist of only the
preprocessing stage and feature extraction;
2. Random undersampling was used to approximate the size of both classes;
and
3. Random undersampling was used and a threshold of 0.9 was defined for
minority class. This strategy had the objective of diversifying the options
for the tweet set.</p>
      <p>The random undersampling and the use of a threshold presented an impact
in ours tests using just the train dataset. And we expected that the test dataset
would be similar to the training dataset (as confirmed by the final results).
4.2</p>
      <sec id="sec-4-1">
        <title>IDPT 2021 Results</title>
        <p>Four articles were used to guide our methodology. [5] presents a task perform
in SemEval 2017, that had the goal of detecting sarcasm in sentences. The
sentimental classification was made by a two-level classification system. The first
phase used three strategies for the preprocessing of the data. The second phase,
focus in identify key factors as affection, cognition, and sociolinguistics of the
sentences.</p>
        <p>[2] was a task in the HaSpeeDe 2018, that consists of the detection of hate
speech in Italian social media. Three tasks were tenders to nine teams, that aim
to find the best strategy for identifying hateful speeches. The document lists the
general approaches used by each team and their results.</p>
        <p>Focusing on irony detection [3] presents two tasks for identifying irony in
sentences and the identification of types of irony. The competition received an
overall of seventeen submissions, which were evaluated by their results, approaches,
algorithms, and features.</p>
        <p>Irony detection in Spanish is the focus in [8], which presents the first task
for identifying irony in short messages IroSvA. Three tasks were defined for the
irony detection, one case focusing on the identification in Spain tweets, another
case with a focus on Mexican tweets, and the last focusing on Cuban news. A
detailed set of strategies used by the competitors is presented, along with metrics
to help the comparison of results.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper we describe the participation of the TeamUFPR at the IDPT 2021
Task on Irony Detection in Portuguese. The task consisted in creating a
methodology for irony detection in Portuguese using two datasets, one of them
containing news texts obtained from different sources and the second being tweets
collected on twitter. Our proposal focused mainly on using only one approach for
both datasets, three tests were submitted using different strategies to identify
the impact of the models considering the type of data.</p>
      <p>Overall, we identified that TF-IDF was the best feature extraction option for
the news dataset, and HashingVectorizer was the best option for the tweets
dataset. The classifiers that presented the best results for the news dataset and
tweets dataset were Multilayer Perceptron and Random Forest, respectively.
Also, using random undersampling or ensemble classifiers (with and without
the use of a threshold on classifier output) did not helped us to improve our
classification results, which indicates that future work should focus on
different strategies to fix the imbalanced data problem in irony detection, mainly in
the tweets dataset. Finally, we also concluded that lemmatization is a step that
should not be performed in detecting irony, indicating that the inflected forms
of a word might help to detect it. For future works, we believe that creating new
feature extraction methods (such as BERT) and classifiers that consider
imbalanced data, without using word lemmatization, are key to improve classification
performance of irony detection.</p>
      <p>Acknowledgments This study was financed in part by the Coordenação de
Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES). The authors
also thank the UFPR Computer Science department.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Boiy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>M.F.:</given-names>
          </string-name>
          <article-title>A machine learning approach to sentiment analysis in multilingual web texts</article-title>
          .
          <source>Information retrieval 12</source>
          (
          <issue>5</issue>
          ),
          <fpage>526</fpage>
          -
          <lpage>558</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bosco</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Felice</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poletto</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanguinetti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurizio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Overview of the evalita 2018 hate speech detection task</article-title>
          .
          <source>In: EVALITA 2018-Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</source>
          . vol.
          <volume>2263</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . CEUR (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cignarella</surname>
            ,
            <given-names>A.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frenda</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bosco</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Overview of the evalita 2018 task on irony detection in italian tweets (ironita)</article-title>
          .
          <source>In: Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA</source>
          <year>2018</year>
          ). vol.
          <volume>2263</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . CEUR-WS (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Correa</surname>
          </string-name>
          , U.B.,
          <string-name>
            <surname>dos Santos</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coelho</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>de Freitas</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          :
          <article-title>Overview of the IDPT Task on Irony Detection in Portuguese at IberLEF 2021</article-title>
          .
          <article-title>Procesamiento del Lenguaje Natural</article-title>
          , vol.
          <volume>67</volume>
          , (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>R.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Crystalnest at semeval-2017 task 4: Using sarcasm detection for enhancing sentiment classification and quantification</article-title>
          .
          <source>In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          . pp.
          <fpage>626</fpage>
          -
          <lpage>633</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Liddy</surname>
          </string-name>
          , E.D.:
          <article-title>Natural language processing (</article-title>
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. nltk:
          <source>Natural language toolkit v3.6</source>
          .2 (may
          <year>2021</year>
          ), http://www.nltk.org/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ortega-Bueno</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hernández Farıas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Medina</given-names>
            <surname>Pagola</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.E.</surname>
          </string-name>
          :
          <article-title>Overview of the task on irony detection in spanish variants</article-title>
          .
          <source>In: Proceedings of the Iberian languages evaluation forum (IberLEF</source>
          <year>2019</year>
          ),
          <article-title>co-located with 34th conference of the Spanish Society for natural language processing (SEPLN 2019)</article-title>
          .
          <article-title>CEUR-WS. org</article-title>
          . vol.
          <volume>2421</volume>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>256</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. spaCy: spacy
          <year>v3</year>
          .
          <volume>0</volume>
          (may
          <year>2021</year>
          ), https://spacy.io/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>