<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <abstract>
        <p>English. This paper presents the participation of the RCLN team with the Tweetaneuse system to the AMI task at Evalita 2018. Our participation was focused on the use of language-independent, character-based methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The language used on social media and especially
Twitter is particularly noisy. The reasons are
various; among them, the abuse of abbreviations
induced by the limitations on the size of the
messages, and the use of different ways to refer to the
same event or concept, strengthened by the
availability of hashtags (for instance: World Cup in
Russia, #WorldCup2018, #WC18 all refer to the
same event).</p>
      <p>
        Recently, some character-level neural network
based models have been developed to take into
account these problems for tasks such as sentiment
analysis
        <xref ref-type="bibr" rid="ref8">(Zhang et al., 2017)</xref>
        or other
classification tasks
        <xref ref-type="bibr" rid="ref7">(Yang et al., 2016)</xref>
        . Another advantage
of these methods, apart the robustness to the noisy
text that can be found in tweets, is that they are
completely language independent and they don’t
need lexical information to carry out the
classification task.
      </p>
      <p>
        The Automatic Misogyny Identification task at
Evalita2018
        <xref ref-type="bibr" rid="ref1 ref3">(Fersini et al., 2018)</xref>
        presented an
interesting and novel challenge. Misogyny is a
type of hate speech that targets specifically women
in different ways. The language used in such
messages is characterised by the use of
profanities, specific hashtags, threats and other
intimidating language. This task is an ideal test bed
for character-based models, and
        <xref ref-type="bibr" rid="ref1">(Anzovino et al.,
2018)</xref>
        already reported that character n-grams play
an important role in the misogyny identification
task.
      </p>
      <p>
        We participated to the French Sentiment
Analysis challenge DEFT 2018
        <xref ref-type="bibr" rid="ref6">(Paroubek et al.,
2018)</xref>
        earlier this year with language-independent
character-based models, based both on neural
networks and classic machine learning algorithms.
For our participation to AMI@Evalita2018 our
objective was to verify whether the same models
could be applied to this task while keeping a
comparable accuracy.
      </p>
      <p>The rest of the paper is structured as follows:
in Section 2 we describe the two methods that
were developed for the challenge; in Section 3 we
present and discuss the obtained results, and
finally in Section 4 we draw some conclusions about
our experience and participation to the AMI
challenge.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        This method is based on a Random Forest (RF)
classifier
        <xref ref-type="bibr" rid="ref2">(Breiman, 2001)</xref>
        with character n-grams
features, scored on the basis of their relative
position in the tweet. One of the first parameters to
choose was the size of the n-grams to work with.
According to our previous experience, we chose to
use all the character n-grams (excluding spaces) of
size 3 to 6, with a minimum frequency of 5 in the
training corpus.
      </p>
      <p>The weight of each n-gram n in tweet t is
calculated as:
s(n; t) =
( 0</p>
      <p>Pio=cc1(n;t) 1 + ploesn((nti))
if absent
otherwise
where occ(n; t) is the number of occurences of
ngram n in t, pos(ni) indicates the position of the
first character of the i-th occurrence of the n-gram
n and len(t) is the length of the tweet as
number of characters. The hypothesis behind the use
of this positional scoring scheme is that the
presence of some words (or symbols) at the end or the
beginning of a tweet may be more important that
the mere presence of the symbol. For instance, in
some cases the conclusion is more important than
the first part of the sentence, especially when
people are evaluating different aspects of an item or
they have mixed feelings: I liked the screen, but
the battery duration is horrible.
2.2</p>
      <sec id="sec-2-1">
        <title>Char and Word-level bi-LSTM</title>
        <p>This method was only tested before and after the
participation, since we observed that it performed
worse than the Random Forest method.</p>
        <p>
          In this method we use a recurrent neural
network to implement a LSTM classifier
          <xref ref-type="bibr" rid="ref4">(Hochreiter
and Schmidhuber, 1997)</xref>
          , which are now widely
used in Natural Language Processing. The
classification is carried out in three steps:
        </p>
        <p>First, the text is split on spaces. Every
resulting text fragment is read as a character sequence,
first from left to right, then from right to left, by
two recurrent NN at character level. The
vectors obtained after the training phase are summed
up to provide a character-based representation of
the fragment (compositional representation). For
a character sequence s = c1 : : : cm, we compute
for each position hi = LST Mo(hi 1; e(ci)) et
h0i = LST Mo0 (h0i+1; e(ci)), where e is the
embedding function, and LST M indicates a LSTM
recurrent node. The fragment compositional
representation is then c(s) = hm + h0 .
1</p>
        <p>Subsequently, the sequence of fragments (i.e.,
the sentence) is read again from left to right and
vice versa by other two recurrent NNs at word
level. These RNNs take as input the compositional
representation obtained in the previous step for the
fragments to which a vectorial representation is
concatenated. This vectorial representation is
obtained from the training corpus and is considered
only if the textual fragment has a frequence 10.
For a sequence of textual fragments p = s1 : : : sn,
we calculate li = LST Mm(li 1; c(si) + e(si)),
li0 = LST Mm0 (li+1; c(si) + e(si)), where c is
the compositional representation introduced above
and e the embedding function. The final states
obtained after the bi-directional reading are added
and they are required to represent the input
sentence, r(p) = ln + l10.</p>
        <p>Finally, these vectors are used as input to a
multi-layer perceptron which is responsible for the
final classification: o(p) = (O max(0; (W
r(p) + b))), where is the softmax operator, W ,
O are matrices and b a vector. The output is
interpreted as a probability distribution on the tweets’
categories.</p>
        <p>The size of character embeddings is 16, those
of the text fragments 32, the input layer for the
perceptron is of size 64 and the hidden layer 32.
The output layer is size 2 for subtask A and 6 for
subtask B. We used the DYNET1 library.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>We report in Table 1 the results on the
development set for the two methods.</p>
      <p>Italian</p>
      <sec id="sec-3-1">
        <title>Subtask</title>
        <p>Misogyny identification
Behaviour classification</p>
        <p>English
Misogyny identification
Behaviour classification
lw RF
0.891
0.692
0.821
0.303
bi-LSTM
0.872
0.770
0.757
0.575</p>
        <p>From these results we could see that the
locally weighted n-grams model using Random
Forest was better in the identification tasks, while the
bi-LSTM was more accurate for the
misogynistic behaviour classification sub-task. However, a
closer look to these results showed us that the
biLSTM was classifying all instances but two in the
majority class. Finally, due to these problems, we
decided to participate to the task with just the
locally weighted n-grams model.</p>
        <p>The official results obtained by this model are
detailed in Table 2. We do not consider the
derailing category for which the system obtained 0
accuracy.</p>
        <p>We also conducted a “post-mortem” test with
the bi-LSTM model for which we obtained the
following results:
1https://github.com/clab/dynet</p>
      </sec>
      <sec id="sec-3-2">
        <title>Subtask Italian</title>
        <p>Misogyny identification 0.824
Behaviour classification 0.473</p>
        <p>Per-class accuracy (sub-task B)
discredit 0.694
dominance 0.250
sexual harassment 0.722
stereotype 0.699
active 0.816
passive 0.028
English
0.586
0.165</p>
        <p>As it can be observed, the results confirmed
those obtained in the dev test for the misogyny
identification sub-task, and in any case we
observed that the “deep” model performed overall
worse than its more “classical” counterpart.</p>
        <p>
          The results obtained by our system were in
general underwhelming and below the expectations,
except for the discredit category, for which our
system was ranked 1st and 3rd in Italian and
English respectively. An analysis of the most relevant
features according to information gain
          <xref ref-type="bibr" rid="ref5">(Lee and
Lee, 2006)</xref>
          showed that the 5 most informative
ngrams are tta, utt, che, tan, utta for Italian and you,
the, tch, itc, itch for English. They are clearly part
of some swear words that can appear in different
forms, or conjunctions like che that may indicate
some linguistic phenomena such as emphasization
(for instance, as in “che brutta!” - “what a ugly
girl!”). On the other hand, another category for
which some keywords seemed particularly
important is the dominance one, but in that case the
information gain obtained by sequences like stfu in
English or zitt in Italian (related to the “shut up”
meaning) was marginal. We suspect that the main
problem may be related to the unbalanced
training corpus in which the discredit category is
dominant, but without knowing whether the other
participants adopted some balancing technique it is
difficult to analyze our results.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>Our participation to the AMI task at EVALITA
2018 was not as successful as we hoped it to
be; our systems in particular were not able to
repeat the excellent results that they obtained at
the DEFT 2018 challenge, although for a
different task, the detection of messages related to
public transportation in tweets. In particular, the
biLSTM model underperformed and was outclassed
by a simpler Random Forest model that uses
locally weighted n-grams as features. At the time of
writing, we are not able to assess if this was due
to a misconfiguration of the neural network, or to
the nature of the data, or the dataset. We hope
that this participation and the comparison to the
other systems will allow us to better understand
where we have failed and why in view of future
participations. The most positive point of our
contribution is that the systems that we proposed are
completely language-independent and we did not
make any adjustment to adapt the systems that
participated in a French task to the Italian or English
language that were targeted in the AMI task.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We would like to thank the program
“Investissements d’Avenir” overseen by the French National
Research Agency, ANR-10-LABX-0083 (Labex
EFL) for the support given to this work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Maria</given-names>
            <surname>Anzovino</surname>
          </string-name>
          , Elisabetta Fersini, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Automatic identification and classification of misogynistic language on twitter</article-title>
          .
          <source>In Natural Language Processing and Information Systems - 23rd International Conference on Applications of Natural Language to Information Systems, NLDB 2018</source>
          , Paris, France, June 13-15,
          <year>2018</year>
          , Proceedings, pages
          <fpage>57</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Leo</given-names>
            <surname>Breiman</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Random forests</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>45</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Elisabetta</given-names>
            <surname>Fersini</surname>
          </string-name>
          , Debora Nozza, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the evalita 2018 task on automatic misogyny identification (ami)</article-title>
          .
          <source>In Tommaso Caselli</source>
          , Nicole Novielli, Viviana Patti, and Paolo Rosso, editors,
          <source>Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18)</source>
          , Turin, Italy. CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and Ju¨rgen Schmidhuber.
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Changki</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gary Geunbae</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Information gain and divergence-based feature selection for machine learning-based text categorization</article-title>
          .
          <source>Information processing &amp; management</source>
          ,
          <volume>42</volume>
          (
          <issue>1</issue>
          ):
          <fpage>155</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Paroubek</surname>
          </string-name>
          , Cyril Grouin, Patrice Bellot, Vincent Claveau, Iris Eshkol-Taravella, Amel Fraisse, Agata Jackiewicz, Jihen Karoui, Laura Monceaux, and
          <string-name>
            <surname>Torres-Moreno</surname>
          </string-name>
          Juan-Manuel.
          <year>2018</year>
          .
          <article-title>Deft2018: recherche d'information et analyse de sentiments dans des tweets concernant les transports en ˆıle de france</article-title>
          . In 14e`me atelier De´fi Fouille de Texte
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Zichao</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Diyi</given-names>
            <surname>Yang</surname>
          </string-name>
          , Chris Dyer, Xiaodong He,
          <string-name>
            <surname>Alex Smola</surname>
            , and
            <given-names>Eduard</given-names>
          </string-name>
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Hierarchical attention networks for document classification</article-title>
          .
          <source>In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , pages
          <fpage>1480</fpage>
          -
          <lpage>1489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Shiwei</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Xiuzhen Zhang, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Chan</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A word-character convolutional neural network for language-agnostic twitter sentiment analysis</article-title>
          .
          <source>In Proceedings of the 22Nd Australasian Document Computing Symposium, ADCS 2017</source>
          , pages
          <fpage>12</fpage>
          :
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          :
          <fpage>4</fpage>
          , New York, NY, USA. ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>