=Paper=
{{Paper
|id=Vol-2263/paper036
|storemode=property
|title=Tweetaneuse @ AMI EVALITA2018: Character-based Models for the Automatic Misogyny Identification Task (Short Paper)
|pdfUrl=https://ceur-ws.org/Vol-2263/paper036.pdf
|volume=Vol-2263
|authors=Davide Buscaldi
|dblpUrl=https://dblp.org/rec/conf/evalita/Buscaldi18
}}
==Tweetaneuse @ AMI EVALITA2018: Character-based Models for the Automatic Misogyny Identification Task (Short Paper)==
Tweetaneuse @ AMI EVALITA2018: Character-based Models
for the Automatic Misogyny Identification Task (Short Paper)
Davide Buscaldi
LIPN, Université Paris 13
Sorbonne Paris Cité
99, Avenue Jean-Baptiste Clément
93430 Villetaneuse (France)
buscaldi@lipn.fr
Abstract messages is characterised by the use of profan-
ities, specific hashtags, threats and other intimi-
English. This paper presents the par- dating language. This task is an ideal test bed
ticipation of the RCLN team with the for character-based models, and (Anzovino et al.,
Tweetaneuse system to the AMI task at 2018) already reported that character n-grams play
Evalita 2018. Our participation was fo- an important role in the misogyny identification
cused on the use of language-independent, task.
character-based methods.
We participated to the French Sentiment Anal-
Italiano. Quest’articolo presenta la ysis challenge DEFT 2018 (Paroubek et al.,
partecipazione del team RCLN con il sis- 2018) earlier this year with language-independent
tema Tweetaneuse al challenge AMI di character-based models, based both on neural net-
Evalita 2018. La nostra partecipazione works and classic machine learning algorithms.
era orientata sull’utilizzo di metodi mul- For our participation to AMI@Evalita2018 our
tilingue e basati sui caratteri. objective was to verify whether the same models
could be applied to this task while keeping a com-
parable accuracy.
1 Introduction The rest of the paper is structured as follows:
The language used on social media and especially in Section 2 we describe the two methods that
Twitter is particularly noisy. The reasons are var- were developed for the challenge; in Section 3 we
ious; among them, the abuse of abbreviations in- present and discuss the obtained results, and fi-
duced by the limitations on the size of the mes- nally in Section 4 we draw some conclusions about
sages, and the use of different ways to refer to the our experience and participation to the AMI chal-
same event or concept, strengthened by the avail- lenge.
ability of hashtags (for instance: World Cup in
Russia, #WorldCup2018, #WC18 all refer to the 2 Methods
same event).
2.1 Locally-weighted Bag-of-Ngrams
Recently, some character-level neural network
based models have been developed to take into ac- This method is based on a Random Forest (RF)
count these problems for tasks such as sentiment classifier (Breiman, 2001) with character n-grams
analysis (Zhang et al., 2017) or other classifica- features, scored on the basis of their relative po-
tion tasks (Yang et al., 2016). Another advantage sition in the tweet. One of the first parameters to
of these methods, apart the robustness to the noisy choose was the size of the n-grams to work with.
text that can be found in tweets, is that they are According to our previous experience, we chose to
completely language independent and they don’t use all the character n-grams (excluding spaces) of
need lexical information to carry out the classifi- size 3 to 6, with a minimum frequency of 5 in the
cation task. training corpus.
The Automatic Misogyny Identification task at The weight of each n-gram n in tweet t is cal-
Evalita2018 (Fersini et al., 2018) presented an culated as:
interesting and novel challenge. Misogyny is a (
type of hate speech that targets specifically women 0 if absent
s(n, t) = Pocc(n,t)
in different ways. The language used in such i=1 1 + pos(n i)
len(t) otherwise
where occ(n, t) is the number of occurences of n- tained after the bi-directional reading are added
gram n in t, pos(ni ) indicates the position of the and they are required to represent the input sen-
first character of the i-th occurrence of the n-gram tence, r(p) = ln + l10 .
n and len(t) is the length of the tweet as num- Finally, these vectors are used as input to a
ber of characters. The hypothesis behind the use multi-layer perceptron which is responsible for the
of this positional scoring scheme is that the pres- final classification: o(p) = σ(O × max(0, (W ×
ence of some words (or symbols) at the end or the r(p) + b))), where σ is the softmax operator, W ,
beginning of a tweet may be more important that O are matrices and b a vector. The output is inter-
the mere presence of the symbol. For instance, in preted as a probability distribution on the tweets’
some cases the conclusion is more important than categories.
the first part of the sentence, especially when peo- The size of character embeddings is 16, those
ple are evaluating different aspects of an item or of the text fragments 32, the input layer for the
they have mixed feelings: I liked the screen, but perceptron is of size 64 and the hidden layer 32.
the battery duration is horrible. The output layer is size 2 for subtask A and 6 for
subtask B. We used the DYNET1 library.
2.2 Char and Word-level bi-LSTM
This method was only tested before and after the
3 Results
participation, since we observed that it performed We report in Table 1 the results on the develop-
worse than the Random Forest method. ment set for the two methods.
In this method we use a recurrent neural net-
work to implement a LSTM classifier (Hochreiter Italian
and Schmidhuber, 1997), which are now widely Subtask lw RF bi-LSTM
used in Natural Language Processing. The classi- Misogyny identification 0.891 0.872
fication is carried out in three steps: Behaviour classification 0.692 0.770
First, the text is split on spaces. Every result- English
ing text fragment is read as a character sequence, Misogyny identification 0.821 0.757
first from left to right, then from right to left, by Behaviour classification 0.303 0.575
two recurrent NN at character level. The vec-
tors obtained after the training phase are summed Table 1: Results on the dev test (macro F1). In
up to provide a character-based representation of both cases, the results were obtained on a random
the fragment (compositional representation). For 90%-10% split of the dev dataset.
a character sequence s = c1 . . . cm , we compute
for each position hi = LST Mo (hi−1 , e(ci )) et From these results we could see that the lo-
h0i = LST Mo0 (h0i+1 , e(ci )), where e is the em- cally weighted n-grams model using Random For-
bedding function, and LST M indicates a LSTM est was better in the identification tasks, while the
recurrent node. The fragment compositional rep- bi-LSTM was more accurate for the misogynis-
resentation is then c(s) = hm + h01 . tic behaviour classification sub-task. However, a
Subsequently, the sequence of fragments (i.e., closer look to these results showed us that the bi-
the sentence) is read again from left to right and LSTM was classifying all instances but two in the
vice versa by other two recurrent NNs at word majority class. Finally, due to these problems, we
level. These RNNs take as input the compositional decided to participate to the task with just the lo-
representation obtained in the previous step for the cally weighted n-grams model.
fragments to which a vectorial representation is The official results obtained by this model are
concatenated. This vectorial representation is ob- detailed in Table 2. We do not consider the de-
tained from the training corpus and is considered railing category for which the system obtained 0
only if the textual fragment has a frequence ≥ 10. accuracy.
For a sequence of textual fragments p = s1 . . . sn , We also conducted a “post-mortem” test with
we calculate li = LST Mm (li−1 , c(si ) + e(si )), the bi-LSTM model for which we obtained the fol-
li0 = LST Mm0 (li+1 , c(si ) + e(si )), where c is lowing results:
the compositional representation introduced above
1
and e the embedding function. The final states ob- https://github.com/clab/dynet
Overall difficult to analyze our results.
Subtask Italian English
Misogyny identification 0.824 0.586 4 Conclusions
Behaviour classification 0.473 0.165
Our participation to the AMI task at EVALITA
Per-class accuracy (sub-task B)
2018 was not as successful as we hoped it to
discredit 0.694 0.432 be; our systems in particular were not able to
dominance 0.250 0.184 repeat the excellent results that they obtained at
sexual harassment 0.722 0.169 the DEFT 2018 challenge, although for a differ-
stereotype 0.699 0.040 ent task, the detection of messages related to pub-
active 0.816 0.541 lic transportation in tweets. In particular, the bi-
passive 0.028 0.248 LSTM model underperformed and was outclassed
by a simpler Random Forest model that uses lo-
Table 2: Official results on the test set (macro F1) cally weighted n-grams as features. At the time of
obtained by the locally-weighted bag of n-grams writing, we are not able to assess if this was due
model. to a misconfiguration of the neural network, or to
the nature of the data, or the dataset. We hope
Subtask Italian English
that this participation and the comparison to the
Misogyny identification 0.821 0.626
other systems will allow us to better understand
Behaviour classification 0.355 0.141
where we have failed and why in view of future
participations. The most positive point of our con-
Table 3: Unofficial results on the test set (macro
tribution is that the systems that we proposed are
F1) obtained by the bi-LSTM character model.
completely language-independent and we did not
make any adjustment to adapt the systems that par-
As it can be observed, the results confirmed ticipated in a French task to the Italian or English
those obtained in the dev test for the misogyny language that were targeted in the AMI task.
identification sub-task, and in any case we ob-
served that the “deep” model performed overall Acknowledgments
worse than its more “classical” counterpart. We would like to thank the program “Investisse-
The results obtained by our system were in gen- ments d’Avenir” overseen by the French National
eral underwhelming and below the expectations, Research Agency, ANR-10-LABX-0083 (Labex
except for the discredit category, for which our EFL) for the support given to this work.
system was ranked 1st and 3rd in Italian and En-
glish respectively. An analysis of the most relevant
features according to information gain (Lee and References
Lee, 2006) showed that the 5 most informative n-
Maria Anzovino, Elisabetta Fersini, and Paolo Rosso.
grams are tta, utt, che, tan, utta for Italian and you, 2018. Automatic identification and classification of
the, tch, itc, itch for English. They are clearly part misogynistic language on twitter. In Natural Lan-
of some swear words that can appear in different guage Processing and Information Systems - 23rd
forms, or conjunctions like che that may indicate International Conference on Applications of Natu-
ral Language to Information Systems, NLDB 2018,
some linguistic phenomena such as emphasization Paris, France, June 13-15, 2018, Proceedings, pages
(for instance, as in “che brutta!” - “what a ugly 57–64.
girl!”). On the other hand, another category for
which some keywords seemed particularly impor- Leo Breiman. 2001. Random forests. Machine learn-
tant is the dominance one, but in that case the in- ing, 45(1):5–32.
formation gain obtained by sequences like stfu in Elisabetta Fersini, Debora Nozza, and Paolo Rosso.
English or zitt in Italian (related to the “shut up” 2018. Overview of the evalita 2018 task on au-
meaning) was marginal. We suspect that the main tomatic misogyny identification (ami). In Tom-
problem may be related to the unbalanced train- maso Caselli, Nicole Novielli, Viviana Patti, and
Paolo Rosso, editors, Proceedings of the 6th evalua-
ing corpus in which the discredit category is dom- tion campaign of Natural Language Processing and
inant, but without knowing whether the other par- Speech tools for Italian (EVALITA’18), Turin, Italy.
ticipants adopted some balancing technique it is CEUR.org.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural computation,
9(8):1735–1780.
Changki Lee and Gary Geunbae Lee. 2006. Informa-
tion gain and divergence-based feature selection for
machine learning-based text categorization. Infor-
mation processing & management, 42(1):155–165.
Patrick Paroubek, Cyril Grouin, Patrice Bellot, Vin-
cent Claveau, Iris Eshkol-Taravella, Amel Fraisse,
Agata Jackiewicz, Jihen Karoui, Laura Monceaux,
and Torres-Moreno Juan-Manuel. 2018. Deft2018:
recherche d’information et analyse de sentiments
dans des tweets concernant les transports en ı̂le de
france. In 14ème atelier Défi Fouille de Texte 2018.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
Alex Smola, and Eduard Hovy. 2016. Hierarchi-
cal attention networks for document classification.
In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 1480–1489.
Shiwei Zhang, Xiuzhen Zhang, and Jeffrey Chan.
2017. A word-character convolutional neural net-
work for language-agnostic twitter sentiment analy-
sis. In Proceedings of the 22Nd Australasian Doc-
ument Computing Symposium, ADCS 2017, pages
12:1–12:4, New York, NY, USA. ACM.