=Paper=
{{Paper
|id=Vol-2263/paper036
|storemode=property
|title=Tweetaneuse @ AMI EVALITA2018: Character-based Models for the Automatic Misogyny Identification Task (Short Paper)
|pdfUrl=https://ceur-ws.org/Vol-2263/paper036.pdf
|volume=Vol-2263
|authors=Davide Buscaldi
|dblpUrl=https://dblp.org/rec/conf/evalita/Buscaldi18
}}
==Tweetaneuse @ AMI EVALITA2018: Character-based Models for the Automatic Misogyny Identification Task (Short Paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2263/paper036.pdf</pdf>
<pre>
        Tweetaneuse @ AMI EVALITA2018: Character-based Models
        for the Automatic Misogyny Identification Task (Short Paper)
                                          Davide Buscaldi
                                      LIPN, Université Paris 13
                                         Sorbonne Paris Cité
                                  99, Avenue Jean-Baptiste Clément
                                     93430 Villetaneuse (France)
                                       buscaldi@lipn.fr

                    Abstract                           messages is characterised by the use of profan-
                                                       ities, specific hashtags, threats and other intimi-
    English. This paper presents the par-              dating language. This task is an ideal test bed
    ticipation of the RCLN team with the               for character-based models, and (Anzovino et al.,
    Tweetaneuse system to the AMI task at              2018) already reported that character n-grams play
    Evalita 2018. Our participation was fo-            an important role in the misogyny identification
    cused on the use of language-independent,          task.
    character-based methods.
                                                          We participated to the French Sentiment Anal-
    Italiano.     Quest’articolo presenta la           ysis challenge DEFT 2018 (Paroubek et al.,
    partecipazione del team RCLN con il sis-           2018) earlier this year with language-independent
    tema Tweetaneuse al challenge AMI di               character-based models, based both on neural net-
    Evalita 2018. La nostra partecipazione             works and classic machine learning algorithms.
    era orientata sull’utilizzo di metodi mul-         For our participation to AMI@Evalita2018 our
    tilingue e basati sui caratteri.                   objective was to verify whether the same models
                                                       could be applied to this task while keeping a com-
                                                       parable accuracy.
1   Introduction                                          The rest of the paper is structured as follows:
The language used on social media and especially       in Section 2 we describe the two methods that
Twitter is particularly noisy. The reasons are var-    were developed for the challenge; in Section 3 we
ious; among them, the abuse of abbreviations in-       present and discuss the obtained results, and fi-
duced by the limitations on the size of the mes-       nally in Section 4 we draw some conclusions about
sages, and the use of different ways to refer to the   our experience and participation to the AMI chal-
same event or concept, strengthened by the avail-      lenge.
ability of hashtags (for instance: World Cup in
Russia, #WorldCup2018, #WC18 all refer to the          2     Methods
same event).
                                                       2.1    Locally-weighted Bag-of-Ngrams
   Recently, some character-level neural network
based models have been developed to take into ac-      This method is based on a Random Forest (RF)
count these problems for tasks such as sentiment       classifier (Breiman, 2001) with character n-grams
analysis (Zhang et al., 2017) or other classifica-     features, scored on the basis of their relative po-
tion tasks (Yang et al., 2016). Another advantage      sition in the tweet. One of the first parameters to
of these methods, apart the robustness to the noisy    choose was the size of the n-grams to work with.
text that can be found in tweets, is that they are     According to our previous experience, we chose to
completely language independent and they don’t         use all the character n-grams (excluding spaces) of
need lexical information to carry out the classifi-    size 3 to 6, with a minimum frequency of 5 in the
cation task.                                           training corpus.
   The Automatic Misogyny Identification task at          The weight of each n-gram n in tweet t is cal-
Evalita2018 (Fersini et al., 2018) presented an        culated as:
interesting and novel challenge. Misogyny is a                     (
type of hate speech that targets specifically women                    0                          if absent
                                                       s(n, t) =       Pocc(n,t)
in different ways. The language used in such                               i=1     1 + pos(n i)
                                                                                        len(t)    otherwise
where occ(n, t) is the number of occurences of n-       tained after the bi-directional reading are added
gram n in t, pos(ni ) indicates the position of the     and they are required to represent the input sen-
first character of the i-th occurrence of the n-gram    tence, r(p) = ln + l10 .
n and len(t) is the length of the tweet as num-            Finally, these vectors are used as input to a
ber of characters. The hypothesis behind the use        multi-layer perceptron which is responsible for the
of this positional scoring scheme is that the pres-     final classification: o(p) = σ(O × max(0, (W ×
ence of some words (or symbols) at the end or the       r(p) + b))), where σ is the softmax operator, W ,
beginning of a tweet may be more important that         O are matrices and b a vector. The output is inter-
the mere presence of the symbol. For instance, in       preted as a probability distribution on the tweets’
some cases the conclusion is more important than        categories.
the first part of the sentence, especially when peo-       The size of character embeddings is 16, those
ple are evaluating different aspects of an item or      of the text fragments 32, the input layer for the
they have mixed feelings: I liked the screen, but       perceptron is of size 64 and the hidden layer 32.
the battery duration is horrible.                       The output layer is size 2 for subtask A and 6 for
                                                        subtask B. We used the DYNET1 library.
2.2   Char and Word-level bi-LSTM
This method was only tested before and after the
                                                        3       Results
participation, since we observed that it performed      We report in Table 1 the results on the develop-
worse than the Random Forest method.                    ment set for the two methods.
    In this method we use a recurrent neural net-
work to implement a LSTM classifier (Hochreiter                                 Italian
and Schmidhuber, 1997), which are now widely                Subtask                     lw RF   bi-LSTM
used in Natural Language Processing. The classi-            Misogyny identification 0.891           0.872
fication is carried out in three steps:                     Behaviour classification 0.692          0.770
    First, the text is split on spaces. Every result-                          English
ing text fragment is read as a character sequence,          Misogyny identification 0.821           0.757
first from left to right, then from right to left, by       Behaviour classification 0.303          0.575
two recurrent NN at character level. The vec-
tors obtained after the training phase are summed       Table 1: Results on the dev test (macro F1). In
up to provide a character-based representation of       both cases, the results were obtained on a random
the fragment (compositional representation). For        90%-10% split of the dev dataset.
a character sequence s = c1 . . . cm , we compute
for each position hi = LST Mo (hi−1 , e(ci )) et           From these results we could see that the lo-
h0i = LST Mo0 (h0i+1 , e(ci )), where e is the em-      cally weighted n-grams model using Random For-
bedding function, and LST M indicates a LSTM            est was better in the identification tasks, while the
recurrent node. The fragment compositional rep-         bi-LSTM was more accurate for the misogynis-
resentation is then c(s) = hm + h01 .                   tic behaviour classification sub-task. However, a
    Subsequently, the sequence of fragments (i.e.,      closer look to these results showed us that the bi-
the sentence) is read again from left to right and      LSTM was classifying all instances but two in the
vice versa by other two recurrent NNs at word           majority class. Finally, due to these problems, we
level. These RNNs take as input the compositional       decided to participate to the task with just the lo-
representation obtained in the previous step for the    cally weighted n-grams model.
fragments to which a vectorial representation is           The official results obtained by this model are
concatenated. This vectorial representation is ob-      detailed in Table 2. We do not consider the de-
tained from the training corpus and is considered       railing category for which the system obtained 0
only if the textual fragment has a frequence ≥ 10.      accuracy.
For a sequence of textual fragments p = s1 . . . sn ,      We also conducted a “post-mortem” test with
we calculate li = LST Mm (li−1 , c(si ) + e(si )),      the bi-LSTM model for which we obtained the fol-
li0 = LST Mm0 (li+1 , c(si ) + e(si )), where c is      lowing results:
the compositional representation introduced above
                                                            1
and e the embedding function. The final states ob-              https://github.com/clab/dynet
                     Overall                              difficult to analyze our results.
  Subtask                    Italian English
  Misogyny identification      0.824     0.586            4   Conclusions
  Behaviour classification     0.473     0.165
                                                          Our participation to the AMI task at EVALITA
         Per-class accuracy (sub-task B)
                                                          2018 was not as successful as we hoped it to
  discredit                    0.694     0.432            be; our systems in particular were not able to
  dominance                    0.250     0.184            repeat the excellent results that they obtained at
  sexual harassment            0.722     0.169            the DEFT 2018 challenge, although for a differ-
  stereotype                   0.699     0.040            ent task, the detection of messages related to pub-
  active                       0.816     0.541            lic transportation in tweets. In particular, the bi-
  passive                      0.028     0.248            LSTM model underperformed and was outclassed
                                                          by a simpler Random Forest model that uses lo-
Table 2: Official results on the test set (macro F1)      cally weighted n-grams as features. At the time of
obtained by the locally-weighted bag of n-grams           writing, we are not able to assess if this was due
model.                                                    to a misconfiguration of the neural network, or to
                                                          the nature of the data, or the dataset. We hope
  Subtask                       Italian    English
                                                          that this participation and the comparison to the
  Misogyny identification         0.821      0.626
                                                          other systems will allow us to better understand
  Behaviour classification        0.355      0.141
                                                          where we have failed and why in view of future
                                                          participations. The most positive point of our con-
Table 3: Unofficial results on the test set (macro
                                                          tribution is that the systems that we proposed are
F1) obtained by the bi-LSTM character model.
                                                          completely language-independent and we did not
                                                          make any adjustment to adapt the systems that par-
   As it can be observed, the results confirmed           ticipated in a French task to the Italian or English
those obtained in the dev test for the misogyny           language that were targeted in the AMI task.
identification sub-task, and in any case we ob-
served that the “deep” model performed overall            Acknowledgments
worse than its more “classical” counterpart.              We would like to thank the program “Investisse-
   The results obtained by our system were in gen-        ments d’Avenir” overseen by the French National
eral underwhelming and below the expectations,            Research Agency, ANR-10-LABX-0083 (Labex
except for the discredit category, for which our          EFL) for the support given to this work.
system was ranked 1st and 3rd in Italian and En-
glish respectively. An analysis of the most relevant
features according to information gain (Lee and           References
Lee, 2006) showed that the 5 most informative n-
                                                          Maria Anzovino, Elisabetta Fersini, and Paolo Rosso.
grams are tta, utt, che, tan, utta for Italian and you,    2018. Automatic identification and classification of
the, tch, itc, itch for English. They are clearly part     misogynistic language on twitter. In Natural Lan-
of some swear words that can appear in different           guage Processing and Information Systems - 23rd
forms, or conjunctions like che that may indicate          International Conference on Applications of Natu-
                                                           ral Language to Information Systems, NLDB 2018,
some linguistic phenomena such as emphasization            Paris, France, June 13-15, 2018, Proceedings, pages
(for instance, as in “che brutta!” - “what a ugly          57–64.
girl!”). On the other hand, another category for
which some keywords seemed particularly impor-            Leo Breiman. 2001. Random forests. Machine learn-
tant is the dominance one, but in that case the in-         ing, 45(1):5–32.
formation gain obtained by sequences like stfu in         Elisabetta Fersini, Debora Nozza, and Paolo Rosso.
English or zitt in Italian (related to the “shut up”         2018. Overview of the evalita 2018 task on au-
meaning) was marginal. We suspect that the main              tomatic misogyny identification (ami). In Tom-
problem may be related to the unbalanced train-              maso Caselli, Nicole Novielli, Viviana Patti, and
                                                             Paolo Rosso, editors, Proceedings of the 6th evalua-
ing corpus in which the discredit category is dom-           tion campaign of Natural Language Processing and
inant, but without knowing whether the other par-            Speech tools for Italian (EVALITA’18), Turin, Italy.
ticipants adopted some balancing technique it is             CEUR.org.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  Long short-term memory. Neural computation,
  9(8):1735–1780.
Changki Lee and Gary Geunbae Lee. 2006. Informa-
  tion gain and divergence-based feature selection for
  machine learning-based text categorization. Infor-
  mation processing & management, 42(1):155–165.
Patrick Paroubek, Cyril Grouin, Patrice Bellot, Vin-
  cent Claveau, Iris Eshkol-Taravella, Amel Fraisse,
  Agata Jackiewicz, Jihen Karoui, Laura Monceaux,
  and Torres-Moreno Juan-Manuel. 2018. Deft2018:
  recherche d’information et analyse de sentiments
  dans des tweets concernant les transports en ı̂le de
  france. In 14ème atelier Défi Fouille de Texte 2018.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
  Alex Smola, and Eduard Hovy. 2016. Hierarchi-
  cal attention networks for document classification.
  In Proceedings of the 2016 Conference of the North
  American Chapter of the Association for Computa-
  tional Linguistics: Human Language Technologies,
  pages 1480–1489.

Shiwei Zhang, Xiuzhen Zhang, and Jeffrey Chan.
  2017. A word-character convolutional neural net-
  work for language-agnostic twitter sentiment analy-
  sis. In Proceedings of the 22Nd Australasian Doc-
  ument Computing Symposium, ADCS 2017, pages
  12:1–12:4, New York, NY, USA. ACM.

</pre>