YNU OXZ @ HaSpeeDe 2 and AMI : XLM-RoBERTa with Ordered
        Neurons LSTM for classification task at EVALITA 2020

                   Xiaozhi Ou                                           Hongling Li
                 Yunnan University                                     Yunnan University
                      China                                                 China
             xiaozhiou88@gmail.com                                 honglingli66@126.com


                      Abstract                              substantial amount of work has been done in lan-
                                                            guages like English. However, hate speech and of-
    English. This paper describes the sys-                  fensive language identification in other language
    tem that team YNU OXZ submitted for                     scenario is still an area worth exploring. The latest
    EVALITA 2020. We participate in the                     edition of EVALITA (Caselli et al., 2018) hosted
    shared task on Automatic Misogyny Iden-                 the first Hate Speech (HS) detection in Social Me-
    tification (AMI) and Hate Speech Detec-                 dia (i.e. HaSpeeDe (Bosco et al., 2018)) task for
    tion (HaSpeeDe 2) at the 7th evaluation                 Italian, the HaSpeeDe 2 (Hate Speech Detection)
    campaign EVALITA 2020. For HaSpeeDe                     (Sanguinetti et al., 2020) shared task have been or-
    2, we participate in Task A - Hate Speech               ganized within Evalita 2020 1 . The ultimate goal
    Detection and submitted two-run result-                 of HaSpeeDe 2 is to take a step further in the s-
    s for the news headline test and tweet-                 tate of the art of HS detection for Italian while al-
    s headline test, respectively. Our submit-              so exploring other side phenomena, the extent to
    ted run is based on the pre-trained multi-              which they can be distinguished from HS, and fi-
    language model XLM-RoBERTa, and in-                     nally whether and how much automatic systems
    put into Convolution Neural Network and                 are able to draw such conclusions. For AMI (Elis-
    K-max Pooling (CNN + K-max Pooling).                    abetta Fersini, 2020), the second shared task at the
    Then, an Ordered Neurons LSTM (ON-                      7th evaluation campaign EVALITA 2020 (Basile
    LSTM) is added to the previous represen-                et al., 2020). Given the huge amount of user-
    tation and submitted to a linear decision               generated content on the Web, and in particular on
    function. Regarding the AMI shared task                 social media, the problem of detecting, in order to
    for the automatic identification of misogy-             possibly limit the diffusion of hate speech against
    nous content in the Italian language. We                women, is rapidly becoming fundamental espe-
    participate in subtask A about Misogy-                  cially for the societal impact of the phenomenon,
    ny & Aggressive Behaviour Identifica-                   it is very important to identify misogyny in social
    tion. Our system is similar to the one de-              media.
    fined for HaSpeeDe and is based on the
    pre-trained multi-language model XLM-                   1.1 Hate Speech (HaSpeeDe 2)
    RoBERTa, an Ordered Neurons LSTM                        In recent years, with the acceleration of infor-
    (ON-LSTM), a Capsule Network, and a fi-                 mation dissemination, the identification of hate
    nal classifier.                                         speech and offense language has become a crucial
                                                            mission in multilingual sentiment analysis field-
                                                            s and has attracted the attention of a large num-
1   Introduction and Background                             ber of industrial and academic researchers. From
                                                            an NLP perspective, much attention has been paid
People use offensive contents in their social me-
                                                            to the topic of HS - together with all its possi-
dia posts to degrade an individual or religion or
                                                            ble facets and related phenomena, such as offen-
other organizations in many respects, the identifi-
                                                            sive/abusive language, and its identification. This
cation of such social media posts is a necessity, a
                                                            is shown by the proliferation, especially in the
     Copyright ⃝c 2020 for this paper by its authors. Use   last few years, of contributions on this topic (e.g.
permitted under Creative Commons License Attribution 4.0
                                                               1
International (CC BY 4.0).                                         http://www.evalita.it/2020/tasks
Caselli et al. (2020), Jurgens et al. (2019), Fortuna    yny in a multilingual environment. Aiming at
et al. (2019)), corpora and lexica (e.g. de Pelle        the TRAC-2 shared tasks of Aggression Identifica-
and Moreira (2017), (Sanguinetti et al., 2018),          tion and Misogynistic Aggression Identification,
(Bassignana et al., 2018)), dedicated workshop-          Samghabadi et al. (2020) propose an end-to-end
s, and shared tasks within national (GermEval            neural model using attention on top of BERT that
2 , HASOC 3 , IberLEF 4 ) and international (Se-         incorporates a multi-task learning paradigm to ad-
mEval 5 ) evaluation campaigns. Among them,              dress both the sub-tasks simultaneously. Arango
Gemeval2018 is about offensive language recog-           et al. (2019) discussed the implications for current
nition and aims to promote research on offen-            research and re-conduct experiments, a closer look
sive contents recognition in German language mi-         at model validation to give a more accurate pic-
croblogs. The best teams system is to train three        ture of the current state-of-the-art methods. Re-
basic classifiers (maximum entropy and two ran-          cent investigations studied how the misogyny phe-
dom forest sets) using five disjoint feature set-        nomenon takes place, such as Farrell et al. (2019)
s and then used the maximum entropy element-             study this phenomenon by investigating the flow
level classifier for final classification (Montani and   of extreme language across seven online commu-
Schüller, 2018). In the SemEval-2019 shared tasks       nities on Reddit. Goenaga et al. (2018) automat-
HatEval and OffensEval, HatEval is a multilin-           ic misogyny identification using neural networks.
gual detection of hate speech against immigrants         Automatic misogyny identification in Twitter has
and women on Twitter. Fermi team is the best             been firstly investigated by Anzovino et al. (2018).
team of Hateval. It proposes an SVM model with
the RBF kernel and uses sentence embedding in            2 Task and Data description
Google general sentence encoder as a function (In-
                                                         2.1 Task description
durthi et al., 2019). OffensEval is about the iden-
tification and classification of offensive language      In this part, we describe one of the subtasks
in social media. The NULI team is the best per-          HaSpeeDe 2 participating in EVALITA 2020. This
forming team, they use BERT-base without default         task introduces its novelty from three main aspect-
parameters (Liu et al., 2019). HASOC2019 is pro-         s (Language variety and test of time, Stereotyp-
posed to identify hate speech and offensive con-         ical communication, Syntactic realization of HS).
tent in Indo-European languages. Its purpose is          We participated in Task A - Hate Speech Detection
to develop powerful technologies capable of pro-         (Main Task), a binary classification task aimed at
cessing multilingual data and to develop a transfer      determining the presence or the absence of hateful
learning method that can utilize cross-lingual data.     content in the text towards a given target (among
The optimal system is a system based on ordered          Immigrants, Muslims or Roma people).
neuron LSTM (ON-LSTM) and attention model                   The AMI shared task proposes that misogy-
and adopts the K-folding approach for ensemble           nous content in Italian is automatic identification
(Wang et al., 2019).                                     in Twitter. It is organized according to two main
                                                         subtasks, namely subtask A - Misogyny & Ag-
1.2 Misogyny (AMI)                                       gressive Behaviour Identification and subtask B -
Unfortunately, nowadays more and more incidents          Unbiased Misogyny Identification. We participate
of harassment against women have appeared and            in subtask A, the system must recognize whether
misogynistic comments have been found in so-             the text is misogyny, and if it is misogyny, it must
cial media, where misogynists hide behind by             also recognize whether it expresses an aggressive
anonymity security. Therefore, it is very important      attitude.
to identify misogyny in social media. Pamungkas
                                                         2.2 Data description
et al. (2020) conducted extensive and in-depth re-
search on online misogyny, developed a state-of-         HaSpeeDe 2 task organizer provides a new H-
the-art model for detecting misogyny in social me-       S training dataset (binary task) based on Twit-
dia and explored the feasibility of detecting misog-     ter data, accompanied by a test set including both
   2
                                                         in-domain and out-of-domain data (tweets + news
     https://projects.fzai.h-da.de/iggsa/germeval/
   3
     https://hasocfire.github.io/hasoc/2020
                                                         headlines), as well as from different time periods.
   4
     http://hitz.eus/sepln2019/                          The HaSpeeDe 2020 new training set already con-
   5
     http://alt.qcri.org/semeval2020/                    tains the Twitter dataset of HaSpeeDe 2018. The
new dataset contains a total of 6,839 tweets (label      reason is that StratifiedKFold can utilize stratified
0 means NOT HS, label 1 means HS), of which              sampling to divide, which can ensure that the pro-
HS contains 2,766, NOT HS contains 4,703, the            portion of each category in the generated training
tweets headlines test set contains 1,263 tweets, and     set and validation set is consistent with the origi-
the news headlines test set contains 500 elements.       nal training set so that the generated data distribu-
In the experimental run, the data we recommend           tion disorder will not occur. In the experiment, we
for this task is the result of combining the Face-       used 5-fold stratified sampling. For the HaSpeeDe
book dataset (training set + test set) of HaSpeeDe       2 training set (Merged dataset), each of which in-
2018 with the new training set of HaSpeeDe 2020,         cluded the randomly sampled training set (8,671)
this is to analyze the influence of out-of-domain        and validation set (2,168). For the AMI training
texts in the training set. The two contain a total of    set (raw dataset), each of which included the ran-
10,839 comments/tweets.                                  domly sampled training set (4,000) and validation
   The AMI organizer provided a raw dataset              set (1,000).
(5,000 tweets) as the training set for participants in
subtask A, the raw dataset is a balanced dataset of      3 Description of the system
tweets manually labeled according to two levels:

  • Misogynous: defines if a tweet is misogy-
    nous or not misogynous. Label 0 means Not
    misogynous tweet, label 1 means Misogy-
    nous tweet.

  • Aggressiveness: denotes the subject of the
    misogynistic tweet (misogynous tweet is la-
    bel 1). Label 0 means Non-aggressive tweet,
    label 1 means Aggressive tweet. Not misog-
    ynous tweet (misogynous tweet is label 0) are
    labeled as 0 by default.

   For the test set (1,000 tweets) for subtask A pro-
vided by the AMI organizer, only the annotations
on the “misogynous” and “aggressiveness” fields
in the raw dataset will consider.


                                                         Figure 2: System architecture diagram for Task A
                                                         (HaSpeeDe 2)

                                                            In this part, we introduce our final submission
                                                         system. Figure 2 shows the overall framework
                                                         of the system we submitted to HaSpeeDe 2 Task
                                                         A. We use the pre-trained multi-language model
                                                         XLM-RoBERTa. We discover the limitations of
                                                         BERT’s pooler output (P O) and obtained rich se-
                                                         mantic information by extracting the hidden state
Figure 1: 5-fold stratified sampling to the training     (The last four hidden layers) of XLM-RoBERTa,
set                                                      which is used as input for Convolution Neural Net-
                                                         work and K-max Pooling (CNN + K-max Pool-
   As shown in Figure 1, we use stratified sam-          ing). Then, we input the output of (CNN + K-max
pling technology (StratifiedKFold), using Strati-        Pooling) into the Ordered Neurons LSTM (ON-
fiedKFold cross-validation instead of ordinary k-        LSTM). Finally, we concatenate the P O and out-
fold cross-validation to evaluate a classifier. The      put of ON-LSTM ON-LSTM together and pass it
through the Linear layer and Softmax for the final     CommonCrawl data in 100 languages. Because
classification.                                        the training of the model in this task must make
   Figure 3 shows the overall framework of the         full use of the whole sentence content to extract
system we submitted to AMI subtask A. We               useful semantic features, which may help to deep-
use the pre-trained multi-language model XLM-          en the understanding of the sentence and reduce
RoBERTa. We first get pooler output (P O) and          the impact of noise on speech. Therefore, we use
obtained rich semantic information by extracting       XLM-RoBERTa in this work.
the hidden state (The last four hidden layers) of         In the classification task, the original output of
XLM-RoBERTa, which is input into Ordered Neu-          XLM-RoBERTa is obtained through the last hid-
rons LSTM (ON-LSTM). Then, we input the out-           den state of the model. However, the output usual-
put of ON-LSTM into Capsule Network.Finally,           ly does not summarize the semantic content of the
we concatenate the P O and output of Capsule to-       input. Recent studies have shown that abundan-
getherand through the Linear layer and Softmax         t semantic information features are learned by the
for the final classification.                          top hidden layer of BERT (Jawahar et al., 2019),
                                                       which we call the semantic layer. In our opinion,
                                                       the same is true of XLM-RoBERTa. Therefore, in
                                                       order to make the model obtain more abundant se-
                                                       mantic information features, we propose the sys-
                                                       tem as shown in Figure 2 for HaSpeeDe 2 Task A.
                                                       Firstly, we get P O. Secondly, we extract the hid-
                                                       den state of the last four layers of XLM-RoBERTa
                                                       and input them into CNN and K-max Pooling.
                                                       Then, input into ON-LSTM. For AMI subtask A,
                                                       we propose the system as shown in Figure 3. First-
                                                       ly, we get P O. Secondly, we extract the hidden s-
                                                       tate of the last four layers of XLM-RoBERTa and
                                                       input them into ON-LSTM. Then, input into Cap-
                                                       sule.

                                                       3.2 CNN and K-max Pooling
                                                       As shown in Figure 2, we input the extracted
                                                       hidden states of the last four layers of XLM-
                                                       RoBERTa into CNN and K-max Pooling for con-
Figure 3: System architecture diagram for subtask      volution operations to obtain multiple feature
A (AMI)                                                maps. The specific operation: a sentence contains
                                                       L words, each of which has a dimension of d after
                                                       the embedding layer, and the representation of the
3.1 XLM-RoBERTa and hidden layer state
                                                       sentence is formed by splicing the L words to form
Early work in the field of cross-language under-       a matrix of L ∗ d. There are several convolution k-
standing has proved the effectiveness of multi-        ernels in the convolutional layer, the size of which
lingual masked language model (MLM) in cross-          is N ∗ d, and N is the filter window size. The con-
language understanding, but models such as             volution operation is to apply a convolution kernel
XLM (Lample and Conneau, 2019) and Multilin-           to create a new feature in a matrix that is spliced
gual BERT (Devlin et al., 2018) (pre-trained on        by words. Its formula is as follows:
Wikipedia) are still limited in learning useful rep-
resentations of low resource languages. XLM-                  Cl = f (w ∗ x(l : L + N − 1) + b)          (1)
RoBERTa (Conneau et al., 2020) shows that the
performance of cross-language transfer tasks can          where l represents the lth word, Cl is the feature,
be significantly improved by using the large-scale     w is the convolution kernel, b is the bias term, and
multi-language pre-training model. It can be un-       f is a nonlinear function. After the convolution
derstood as a combination of XLM and RoBER-            operation of the whole sentence, a feature map is
Ta. It is trained on 2.5 TB of newly created clean     obtained, which is a vector of size L + N - 1.
   Another important idea of CNN is pooling. The          each step of input, thereby embedding the hierar-
pooling layer is usually connected behind the con-        chical structure through information grading.
volution layer. The purpose of introducing it is to
simplify the output of the convolutional layer and        3.4 Capsule Network
perform dimensionality reduction on the features          As shown in Figure 3, we input the output of ON-
of the Filter to form the final feature. Here is the      LSTM into Capsule. In the deep learning mod-
K-max Pooling operation, which takes the value            el, spatial patterns are aggregated at a lower lev-
of the scores in Top K among all the feature val-         el, which helps to represent higher-level concepts.
ues, and retains the original order of these feature      We use the Capsule Network (Sabour et al., 2017)
values, that is, by retaining some feature informa-       to enhance the models feature extraction capabil-
tion for subsequent use. Obviously, K-max Pool-           ities, spatial insensitivity methods are inevitably
ing can express the same type of feature multiple         limited by the abundant text structure (such as
times, that is, can express the intensity of a certain    saving the location of words, semantic informa-
type of feature; in addition, because the relative        tion, grammatical structure, etc.), difficult to ef-
order of these Top K eigenvalues is preserved, it         fectively encode, and lack of text expression abili-
should be said that it retains part of the position       ty. The Capsule network effectively improved this
information. However, this location information           disadvantage by using neuron vectors instead of
is only the relative order between features, not ab-      individual neuron nodes of traditional neural net-
solute location information.                              works to train this new neural network in the dy-
                                                          namic routing way. The Capsule’s parameter up-
3.3 Ordered Neurons LSTM                                  date algorithm is routing-by-agreement, a lower-
For HaSpeeDe 2, as shown in Figure 2, we input            level capsule prefers to send its output to higher-
the output of CNN and K-max pooling into ON-              level capsule whose activity vectors have a big s-
LSTM. For AMI, as shown in Figure 3, We input             calar product with the prediction coming from the
the extracted hidden states of the last four layers       lower-level capsule. The calculation formula of
of XLM-RoBERTa into ON-LSTM. ON-LSTM is                   the Capsule is as follows:
a new variant of LSTM, which sorts the neurons
in a specific order, allowing the hierarchical struc-                            ∥ Sj ∥2    Sj
                                                                        Vj =                               (6)
ture (tree structure) to be integrated into the LSTM                            1+ ∥ Sj ∥ ∥ Sj ∥
                                                                                         2

to express richer information. The gate structure                   ∑
and output structure of ON-LSTM are still similar            Sj =       Cij ûj|i ,       ûj|i = Wij ui   (7)
                                                                    i
to the original LSTM. The difference is that the
update mechanism from cbt to ct is different. The            where Vj is the vector output of capsule j and
formula is as follows (Shen et al., 2018):                Sj is its total input, prediction vectors ûj|i is by
                                                          multiplying the output ui of a capsule in the layer
                                                          below by a weight matrix Wij , the Cij are cou-
  fet = →
        −
        cs(sof tmax(Wfext + Ufeht−1 + bfe) (2)            pling coefficients that are determined by the itera-
                                                          tive dynamic routing process.
   iet = ←
         −
         cs(sof tmax(Wei xt + Uei ht−1 + bei )     (3)       The most fundamental difference between the
                                                          Capsule network and the traditional artificial neu-
                    wt = fet ◦ iet                 (4)    ral network lies in the unit structure of the net-
  ct =wt ◦ (ft ◦ ct−1 + it ◦ cbt ) + (fet − wt )          work. For traditional neural networks, the calcula-
                                                   (5)    tion of neurons can be divided into the following
      ◦ ct−1 + (iet − wt ) ◦ cbt
                                                          three steps: 1. Perform a scalar weighted calcu-
   Among them, −    →
                    cs and ←  − are cumsum() opera-
                             cs                           lation on the input. 2. Sum the weighted input
tions in the right and left directions, respectively.     scalars. 3. Nonlinearization from scalar to the s-
the newly introduced fet and iet represent the mas-       calar. For the Capsule, its calculation is divided
ter forget gate and master input gate respectively.       into the following four steps: 1. Do matrix multi-
wt represents a vector where the intersection part        plication on the input vector. 2. Scalar weighting
is 1 and the rest is all 0. In this way, the high-level   of the input vector. 3. Sum the weighted vector.
information remains a considerable long distance,         4. Vector-to-vector nonlinearization. The biggest
while the low-level information may be updated at         difference between the Capsule network and the
traditional neural network is the unit output. The              XLM-RoBERTa with only P O in News
output of the traditional neural network is a val-                     The validation set of 1-fold
ue, while the output of the Capsule network is a              Category     P        R       F1      Instances
vector, which can contain abundant features and is            Not Hate 0.70 0.981 0.817               1355
more interpretable.                                             Hate     0.886 0.259 0.401             813
                                                              Macro F1 0.793 0.62 0.609               2168
3.5 Experiment setting
                                                               XLM-RoBERTa with only P O in Tweets
For the XLM-RoBERTa, we use XLM-                                       The validation set of 1-fold
RoBERTa-base6 pre-trained model, which                        Category     P        R       F1      Instances
contains 12 layers. We use Binary cross-entropy,              Not Hate 0.805 0.569 0.667              1355
Adam optimizer with a learning rate of 5e-5.                    Hate     0.659 0.858 0.745             813
The batch size is set to 32 and the max sequence              Macro F1 0.723 0.713 0.706              2168
length is set to 80. We extract the hidden layer
state of XLM-RoBERTa by setting the out-                     Table 2: Precision, Recall, F1 score and Instances
put hidden States is true. The model is trained in           for XLM-RoBERTa with only P O in HaSpeeDe
8 epochs with a dropout rate of 0.1.                         2 Task A (The validation set is the first fold in the
   For the Convolution Neural Network,we use                 5-fold stratified cross-validation)
2D convolution (nn.Conv2d7 ). The size of the
convolution kernel is set to (3,4,5) and the num-              The number of different hidden layers of
ber of convolution kernels is set to 256.                     XLM-RoBERTa (The validation set of 1-fold)
   For the ON-LSTM, we set the hidden units to                      Systems         HS-News HS-Tweets
128 and num levels to 16.                                       Hidden layers       Macro F1 Macro F1
   For the Capsule Network, we set num capsule                  The last layers      0.623      0.725
to 10, dim capsule to 16, routings to 4.                      The last two layers    0.646      0.734
                                                              The last three layers   0.66      0.749
4     Results and Discussion
                                                              The last four layers   0.703      0.798

      Task        Our Score      Best Score     Rank         Table 3: The performance of our model at different
    HaSpeeDe       Macro F1                                  hidden layers (The validation set is the first fold in
                                                             the 5-fold stratified cross-validation)
     Tweets         0.7717         0.8088         8
      News          0.6922         0.7744         7
                                                             put of BERT is P O. In the same way, we just put
      AMI         Average F1
                                                             P O as the output of XLM-RoBERTa.The results
    subtask A       0.7313         0.7406         3          are shown in Table 2. We can see that the results
                                                             are not good when only P O is used as the output
Table 1: Classification results of our best runs on          of XLM-RoBERTa. We think that just using P O
the HaSpeeDe 2 Task A and AMI subtask A.                     as the output will lose some effective semantic in-
                                                             formation. So we think that deep and abundant
  Table 1 reports the official results of the best           semantic features are effective for this work. We
runs on the two tasks we participate in. For these           extract the hidden state of XLM-RoBERTa and we
two tasks, we submitted the results of two runs,             also discover that the performance of the model
and the results of both runs were ideal and equally          improves with the increase of the semantic layer.
matched. In the following subsections, the results           Table 3 shows the performance of our model at d-
obtained in each task will be discussed.                     ifferent semantic layers. Table 4 shows our results
                                                             on the test set.
4.1 HaSpeeDe 2 Task A
In our experiment, we find the limitations of P O            4.2 AMI subtask A
for sentiment analysis of hate text in Italian lan-              In this work, we have similar tasks as discussed in
guages. In the classification task, the original out-            Section 4.1, and we consider the influence of P O
     6
       https://huggingface.co/xlm-roberta-base                   for identifying misogyny content. We conduct ex-
     7
       https://pytorch.org/docs/stable/generated/torch.nn.Conv2d periments on the AMI subtask A base on the mod-
                                                            System                         Average F1
    The last four hidden states of XLM-RoBERTa
                                                            Run 1 (without using P O)           0.7014
                                                            Run 2 (using P O)                   0.7313
     News         P        R        F1      Macro
                                              F1
                                                       Table 6: The results on the test set for AMI subtask
    Not Hate   0.7486    0.8965   0.8159    0.6922     A
     Hate      0.7203    0.4696   0.5685
    Tweets        P         R       F1      Macro
                                              F1       tate of XLM-RoBERTa. The result shows that it
    Not Hate   0.8037    0.7285   0.7643    0.7717     is helpful to improve the performance of XLM-
     Hate      0.7448    0.8167   0.7791               RoBERTa to obtain more abundant semantic infor-
                                                       mation features by extracting the hidden state of
       Table 4: Results of Macro F1 on Test set        XLM-RoBERTa. We test the effects of using the
                                                       external dataset (Merged dataset) and not using
                                                       the external dataset (raw dataset). Our conclusion
el in HaSpeeDe 2, and in order to improve the per-
                                                       is that using data from the same social network for
formance, we propose a new method base on this
                                                       training and test is a necessary condition for good
model. Table 5 shows the comparative experimen-
                                                       performance. In addition, adding data from differ-
tal data of the CNN + K-max Pooling + ON-LSTM
                                                       ent social networks can improve results.
method and the ON-LSTM + Capsule method. Ta-
ble 6 shows the results of our new model for A-
MI subtask A on the test set. Run 1 only extracts      References
the last four hidden layer states of XLM-RoBERTa
                                                       Maria Anzovino, Elisabetta Fersini, and Paolo Rosso.
and inputs them into ON-LSTM, then through the          2018. Automatic identification and classification of
Capsule Network, and finally performs classifica-       misogynistic language on twitter. In International
tion (without using P O). Run 2 is to concatenate       Conference on Applications of Natural Language to
the output of the Capsule Network with the ob-          Information Systems, pages 57–64. Springer.
tained P O and input it to the classifier for final    Aymé Arango, Jorge Pérez, and Barbara Poblete. 2019.
classification (using P O). We think that concate-       Hate speech detection is not as easy as you may
nate the P O and the hidden layer will retain richer     think: A closer look at model validation. In Pro-
                                                         ceedings of the 42nd International ACM SIGIR Con-
semantic information and show excellent results.         ference on Research and Development in Informa-
                                                         tion Retrieval, pages 45–54.
           Base on XLM-RoBERTa model
            (The validation set of 1-fold)          Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
                                                      cia C. Passaro. 2020. Evalita 2020: Overview
                Method                     Macro F1   of the 7th evaluation campaign of natural language
    CNN + K-max Pooling + ON-LSTM           0.786     processing and speech tools for italian. In Valeri-
          (HaSpeeDe 2 Model)                          o Basile, Danilo Croce, Maria Di Maro, and Luci-
         ON-LSTM + Capsule                  0.857     a C. Passaro, editors, Proceedings of Seventh Eval-
                                                      uation Campaign of Natural Language Processing
              (AMI model)                             and Speech Tools for Italian. Final Workshop (E-
                                                          VALITA 2020), Online. CEUR.org.
Table 5: Comparison of experimental data be-
                                                       Elisa Bassignana, Valerio Basile, and Viviana Patti.
tween CNN + K-max Pooling method and ON-                  2018. Hurtlex: A multilingual lexicon of words to
LSTM + Capsule method on the validation set.              hurt. In 5th Italian Conference on Computational
(The validation set is the first fold in the 5-fold       Linguistics, CLiC-it 2018, volume 2253, pages 1–6.
stratified cross-validation)                              CEUR-WS.
                                                       Cristina Bosco, Dell’Orletta Felice, Fabio Poletto,
                                                         Manuela Sanguinetti, and Tesconi Maurizio. 2018.
5     Conclusion                                         Overview of the evalita 2018 hate speech detection
                                                         task. In EVALITA 2018-Sixth Evaluation Campaign
In the experiment, we find the limitation of on-         of Natural Language Processing and Speech Tools
ly using pooler output as the XLM-RoBERTa’s              for Italian, volume 2263, pages 1–9. CEUR.
output. To obtain deeper and more abundant se-         Tomasso Caselli, Nicole Novielli, Viviana Patti, and
mantic features, we extract the hidden layer s-          Paolo Rosso. 2018. Sixth evaluation campaign of
  natural language processing and speech tools for i-     David Jurgens, Eshwar Chandrasekharan, and Libby
  talian: Final workshop (evalita 2018). In EVALITA         Hemphill. 2019. A just and comprehensive strategy
  2018. CEUR Workshop Proceedings (CEUR-WS.                 for using nlp to address online abuse. arXiv preprint
  org).                                                     arXiv:1906.01738.
Tommaso Caselli, Valerio Basile, Jelena Mitrović, Inga   Guillaume Lample and Alexis Conneau. 2019. Cross-
  Kartoziya, and Michael Granitzer. 2020. I feel of-        lingual language model pretraining.
  fended, dont be abusive! implicit/explicit messages
  in offensive and abusive language. In Proceedings of    Ping Liu, Wen Li, and Liang Zou. 2019. Nuli at
  The 12th Language Resources and Evaluation Con-            semeval-2019 task 6: Transfer learning for offensive
  ference, pages 6193–6202.                                  language detection using bidirectional transformers.
                                                             In Proceedings of the 13th International Workshop
Alexis Conneau, Kartikay Khandelwal, Naman Goy-              on Semantic Evaluation, pages 87–91.
  al, Vishrav Chaudhary, and Veselin Stoyanov. 2020.
  Unsupervised cross-lingual representation learning      Joaquın Padilla Montani and Peter Schüller. 2018.
  at scale. In Proceedings of the 58th Annual Meeting       Tuwienkbs at germeval 2018: German abusive tweet
  of the Association for Computational Linguistics.         detection. In 14th Conference on Natural Language
                                                            Processing KONVENS, volume 2018, page 45.
Rogers Prates de Pelle and Viviane P Moreira. 2017.
  Offensive comments in the brazilian web: a dataset      Endang Wahyu Pamungkas, Valerio Basile, and Vi-
  and baseline results. In Anais do VI Brazilian Work-      viana Patti. 2020. Misogyny detection in twitter:
  shop on Social Network Analysis and Mining. SBC.          a multilingual and cross-domain study. Information
                                                            Processing & Management, 57(6):102360.
Jacob Devlin, Ming Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2018. Bert: Pre-training of        Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton.
   deep bidirectional transformers for language under-      2017. Dynamic routing between capsules.
   standing.
                                                          Niloofar Safi Samghabadi, Parth Patwa, PYKL Srini-
Paolo Rosso Elisabetta Fersini, Debora Nozza. 2020.
                                                            vas, Prerana Mukherjee, Amitava Das, and Thamar
  Overview of the evalita 2020 automatic misogyny i-
                                                            Solorio. 2020. Aggression and misogyny detection
  dentification (ami) task. In Valerio Basile, Danilo
                                                            using bert: A multi-task approach. In Proceedings
  Croce, Maria Di Maro, and Lucia C. Passaro, edi-
                                                            of the Second Workshop on Trolling, Aggression and
  tors, Proceedings of the 7th evaluation campaign of
                                                            Cyberbullying, pages 126–131.
  Natural Language Processing and Speech tools for
  Italian (EVALITA 2020), Online. CEUR.org.               Manuela Sanguinetti, Fabio Poletto, Cristina Bosco,
Tracie Farrell, Miriam Fernandez, Jakub Novotny, and       Viviana Patti, and Marco Stranisci. 2018. An ital-
  Harith Alani. 2019. Exploring misogyny across the        ian twitter corpus of hate speech against immigrants.
  manosphere in reddit. In Proceedings of the 10th         In Proceedings of the Eleventh International Confer-
  ACM Conference on Web Science, pages 87–96.              ence on Language Resources and Evaluation (LREC
                                                           2018).
Paula Fortuna, Joao Rocha da Silva, Leo Wanner,
  Sérgio Nunes, et al. 2019. A hierarchically-labeled    Manuela Sanguinetti, Gloria Comandini, Elisa Di Nuo-
  portuguese hate speech dataset. In Proceedings of        vo, Simona Frenda, Marco Stranisci, Cristina Bosco,
  the Third Workshop on Abusive Language Online,           Tommaso Caselli, Viviana Patti, and Irene Russo.
  pages 94–104.                                            2020. HaSpeeDe 2@EVALITA2020: Overview of
                                                           the EVALITA 2020 Hate Speech Detection Task. In
Iakes Goenaga, Aitziber Atutxa, Koldo Gojenola,            Valerio Basile, Danilo Croce, Maria Di Maro, and
   Arantza Casillas, Arantza Dı́az de Ilarraza, Nerea      Lucia C. Passaro, editors, Proceedings of Seventh E-
   Ezeiza, Maite Oronoz, Alicia Pérez, and Olatz          valuation Campaign of Natural Language Process-
   Perez-de Viñaspre. 2018. Automatic misogyny i-         ing and Speech Tools for Italian. Final Workshop (E-
   dentification using neural networks. In IberEval@       VALITA 2020), Online. CEUR.org.
   SEPLN, pages 249–254.
                                                          Yikang Shen, Shawn Tan, Alessandro Sordoni, and
Vijayasaradhi Indurthi, Bakhtiyar Syed, Manish Shri-        Aaron Courville. 2018. Ordered neurons: Integrat-
   vastava, Nikhil Chakravartula, Manish Gupta, and         ing tree structures into recurrent neural networks.
   Vasudeva Varma. 2019. Fermi at semeval-2019
   task 5: Using sentence embeddings to identify hate     Bin Wang, Yunxia Ding, Shengyan Liu, and Xiaobing
   speech against immigrants and women in twitter. In       Zhou. 2019. Ynu wb at hasoc 2019: Ordered neu-
   Proceedings of the 13th International Workshop on        rons lstm with attention for identifying hate speech
   Semantic Evaluation, pages 70–74.                        and offensive language. In FIRE (Working Notes),
                                                            pages 191–198.
Ganesh Jawahar, Benot Sagot, and Djam Seddah.
  2019. What does bert learn about the structure of
  language? In Proceedings of the 57th Annual Meet-
  ing of the Association for Computational Linguistic-
  s.