=Paper=
{{Paper
|id=Vol-2943/exist_paper11
|storemode=property
|title=A Simple Voting Mechanism for Online Sexist Content Identification
|pdfUrl=https://ceur-ws.org/Vol-2943/exist_paper11.pdf
|volume=Vol-2943
|authors=Chao Feng
|dblpUrl=https://dblp.org/rec/conf/sepln/Feng21
}}
==A Simple Voting Mechanism for Online Sexist Content Identification==
<pdf width="1500px">https://ceur-ws.org/Vol-2943/exist_paper11.pdf</pdf>
<pre>
    A Simple Voting Mechanism for Online Sexist
              Content Identification

                           Chao Feng1[0000−0002−0672−1090]

          University of Zurich, Rämistrasse 71, CH-8006 Zürich, Switzerland
                                  chao.feng2@uzh.ch


        Abstract. This paper presents the participation of the MiniTrue team
        in the EXIST 2021 Challenge on the sexism detection in social media task
        for English and Spanish. Our approach combines the language models
        with a simple voting mechanism for the sexist label prediction. For this,
        three BERT based models and a voting function are used. Experimental
        results show that our final model with the voting function has achieved
        the best results among our four models, which means that this voting
        mechanism brings an extra benefit to our system. Nevertheless, we also
        observe that our system is robust to data sources and languages.

        Keywords: Sexism detection · Social media · Contextual word embed-
        dings.


1     Introduction
    Equality, openness, and freedom, as the foundation and pillar of the Internet
spirit, should have made our cyberworld a better place. However, inequality, prej-
udice and bias against females still deeply implants in online social networks [14].
Sexist contents in social networks affect females’ life in both mental and physical
sides, for instance the career equality, life quality, even the mental health [7, 10].
So that, an automatic sexism identification system is urgently required, which
could help to build a better cyberworld with equality, openness, and freedom.
    Currently, researches on the sexism detection are frequently related to misog-
yny detection. Misogyny and sexism are usually considered as synonyms, but
there are some nuances between these two terms. The definition of misogyny
emphasizes more on the enmity, hatred and contempt towards female [6]. On
the other hand, sexism implies the expression of any manners of oppression or
prejudice against women [14]. Thus, sexism is the hypernym of misogyny.
    Traditionally, technologies of sexism detection are assembled into pipelined
architectures. User and relationship based features (e.g. including followers,
friends, number of tweets, interacting with other sexism users, etc.), as well
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
as the the text-based features (e.g. TF-IDF, Character n-grams, Token n-grams,
etc.) are extracted from the social network. Then, a number of traditional ma-
chine learning algorithm could be applied for this problem [1, 4, 12, 15].
    However, recent approaches based on end-to-end neural networks offer a
promising alternative. Recurrent Neural Networks (RNN) and Long Short Term
Memory networks are commonly used for Natural Language Processing tasks in
earlier works. [2, 9] have applied Bi-LSTM models to sexism, misogyny detection.
But these approaches are mainly used the context-independent word embeddings
or character embeddings. Since the pretrained transformer-based language mod-
els have been introduced to the NLP area, context-dependent embeddings are
heavily used in different tasks. Researches have adopted BERT model to sexism
detection task, and achieved highly scores [8, 11, 14].
    This work is done as part of the EXIST 2021 (Sexism Identification in Social
Networks) shared task [10], which is a classification task for online social media
data in English and Spanish. We are interested in the sub-task one called Sexism
Identification. This sub-task is a binary classification problem, we aim to com-
bine sentence level embeddings learned by deep-learning models like BERT [5]
with a voting mechanism, to create a sexism identification system and help for
fighting against online sexist. To reduce the noise of social media as well as the
multilingual noise, we leverage various features of language models, and imple-
ment our system in three separate stages:
1. In order to solve the multilingual problem, we use three different pretrained
   language models in produce the sentence level embeddings, including the
   original English version BERT [5], the Multilingual BERT [13], and the
   Spanish version pretrained language model BETO [3].
2. Three different models are built upon these different embeddings and infer-
   ence the output label independently.
3. We use a simple voting mechanism to get the final label.
   The note is organized as follows. We present the description of our method in
Section 2. Then we group the experimental results in Section 3 before discussing
the perspectives and concluding in Section 4.


2     System Architecture
   This section presents the architecture and the description of our system.
Three basic models and a vote-inference system are described respectively in
Subsection 2.1 and Subsection 2.2.

2.1   Basic Models
   We develop three independent basic models for label prediction by combining
language models with simple neural networks. As before mentioned, three lan-
guage models are applied in our system, including BERT, Multilingual BERT,
and BETO.
Basic Model One: This first basic model connects the language models with
a feed forward network. In order to gain the multilingual representation for the
input text data, our model uses two different pretrained language models, the
BERT and the Multilingual BERT. After tokenization, the text data feeds into
these two language models respectively, and generate two sentence embeddings.
A concentration layer is applied to combine these two embeddings together. After
that, embedded data flows to a feed forward network for the final inference I1 .


Basic Model Two: Research points out that the performance of Multilingual
BERT model is not as good as the language specific language models [3]. On this
count, we replace the Multilingual BERT by Spanish version language model,
BETO. Similar with the first model, we also use a concentration layer to com-
bine the English embedding with the Spanish embedding, and feed them into a
inference network to get the final output I2 .


Basic Model Three: To leverage the sequential information of the input text,
we use a Bi-directional LSTM network to obtain the consecutive features after
the language model. In this time, we just use a single multilingual BERT, and get
the sequential embeddings for each input token. After the Bi-directional LSTM,
we use a Sigmoid function to get the final inference I3 .


2.2   Voting Mechanism

    As before mentioned, I1 , I2 , I3 are three inference values predicted by our
basic models, and a simple voting mechanism is applied to count the final output.
If two or more models draw the same inference label, this label will be our final
prediction. The final output is predicted as:
                                    (
                                     0 if I1 +I2 +I3 <2
                         f (x) :=                                              (1)
                                     1 if I1 +I2 +I3 >= 2

    The architecture of our model is shown in Figure 1. As we can see, the input
text date feeds into each basic model, and then flows to each inference network
respectively. Finally, all of this information is combined in the voting function
for the prediction.


3     Experiments and Results

    In this section, we present the description of datasets, the setup of our ex-
periments, and the final results of our experiments. Experimental validation is
conducted on the EXIST 2021 training and test corpus. The datasets are de-
scribed in subsetion 3.1 followed by the analysis of our final results in subsetion
3.2.
      Fig. 1. The final model combines three basic models with a voting mechanism


3.1     Data Description
    There are two sub-tasks in this shared task, the first one is Sexism Identi-
fication, which is a binary classification task, and the system needs to decide
whether a given social media text is or is not sexist. The second sub-task is to
categorize these sexist texts into different predefined sexism types. This paper
focuses on the first sub-task.
    Both of the training and test datasets consist in English and Spanish text
data, which are collected from the social media such as Twitter and Gab. Table
1 briefly describes the corpus. As we can see, there are 6977 tweets/gabs in the
training set, and 4368 tweets/gabs in the test set, and labels are balanced in this
sub-task.

3.2     Results and Discussion
    The evaluation of this sexism identification sub-task is mainly based on Ac-
curacy score. Our results are gathered in Table 2, which contains the Accuracy
and F-Measure scores for the basic models and the final systems. Our first ques-
tion is whether the voting function brings additional benefits to our system. As
we can see, the our final system (with a simple voting mechanism) performs the
best Accuracy and F-measure metrics among these four systems, which provides
a 3% extra performance regarding basic models.
    Unlike the training data set, the test set contains the text data from different
sources, which are the Twitter and Gab. It brings the second question, which is
           Table 1. Sentence numbers for each data set for sub-task one.


               Date sets Language Label      Number of tweets/gabs
                         en       non-sexist 1800
               Training en        sexist     1636
                         es       non-sexist 1800
                         es       sexist     1741
                         en       non-sexist 1050
               Test      en       sexist     1158
                         es       non-sexist 1037
                         es       sexist     1123


               Table 2. Evaluation Results: Accuracy and F-Measure


          Date set Model                               Accuracy F1
                   Basic Model One                     0,7370   0,7360
                   Basic Model Two                     0,7280   0,7275
          Test     Basic Model Three                   0,7287   0,7287
                   Final Model (with voting mechanism) 0,7553 0,7551


whether our final model is data source sensitive. So we list the results in different
data sources for each model. As shown in Table 3, even though the training data
comes from Twitter, the results in Gab are better than those in Twitter in the
test set, which shows a robust ability of the language models.


              Table 3. Evaluation Results by Data Source: Accuracy


       Date set Data Source Model                               Accuracy
                Twitter     Basic Model One                     0,7271
                            Basic Model Two                     0,7312
                            Basic Model Three                   0,7312
       Test                 Final Model (with voting mechanism) 0,7504
                Gab         Basic Model One                     0,7709
                            Basic Model Two                     0,7169
                            Basic Model Three                   0,7200
                            Final Model (with voting mechanism) 0,7719


    The last question about our system is whether it is language sensitive. Table
4 lists the evaluation metrics of our systems by different languages. Results show
that the performance of our models in different languages is similar to each other,
and it seems that to introduce the language specific language models, i.e. the
BETO has not brought extra benefits to our final system.


                Table 4. Evaluation Results by Language: Accuracy


         Date set Language Model                               Accuracy
                  en       Basic Model One                     0,7197
                           Basic Model Two                     0,7351
                           Basic Model Three                   0,7486
         Test              Final Model (with voting mechanism) 0,7559
                  es       Basic Model One                     0,7546
                           Basic Model Two                     0,7208
                           Basic Model Three                   0,7083
                           Final Model (with voting mechanism) 0,7546


4    Conclusions

    In this paper, we implement a classification system that combines the lan-
guage models with a simple voting mechanism. Our system has been evaluated
on the EXIST 2021 sub-task one. The evaluation results have shown that our
system is datasource and language robust, and the voting function indeed brings
in extra benefits to our final system. Our experiments prove that the pre-trained
language model is also highly adaptable to social media texts, and a simple
voting mechanism can highly leverage the predictive ability of the multi-model
system.


References
1. Anzovino, M., Fersini, E., & Rosso, P. (2018). Automatic Identification and Classi-
   fication of Misogynistic Language on Twitter. NLDB.
2. Buscaldi, D. (2018). Tweetaneuse@ AMI EVALITA2018: Character-based Models
   for the Automatic Misogyny Identification Task. EVALITA Evaluation of NLP and
   Speech Tools for Italian, 12, 214.
3. Canete, J., Chaperon, G., Fuentes, R., & Pérez, J. (2020). Spanish pre-trained bert
   model and evaluation data. PML4DC at ICLR, 2020.
4. Canós, J. S. (2018). Misogyny Identification Through SVM at IberEval 2018. In
   IberEval@ SEPLN (pp. 229-233).
5. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training
   of deep bidirectional transformers for language understanding. arXiv preprint
   arXiv:1810.04805.
6. Fersini, E., Rosso, P., Anzovino, M. (2018). Overview of the Task on Automatic
   Misogyny Identification at IberEval 2018. IberEval@ SEPLN, 2150, 214-228.
7. Francisco Rodrı́guez-Sánchez, Jorge Carrillo-de-Albornoz, Laura Plaza, Julio Gon-
   zalo, Paolo Rosso, Miriam Comet, Trinidad Donoso. Overview of EXIST 2021: sEX-
   ism Identification in Social neTworks. Procesamiento del Lenguaje Natural, vol 67,
   septiembre 2021.
8. Fortuna, P., Soler-Company, J., Wanner, L. (2021). How well do hate speech, toxic-
   ity, abusive and offensive language classification models generalize across datasets?.
   Information Processing Management, 58 (3), 102524.
9. Goenaga, I., Atutxa, A., Gojenola, K., Casillas, A., de Ilarraza, A. D., Ezeiza, N., ...
   Perez-de-Viñaspre, O. (2018, September). Automatic Misogyny Identification Using
   Neural Networks. In IberEval@ SEPLN (pp. 249-254).
10. Manuel Montes, Paolo Rosso, Julio Gonzalo, Ezra Aragón, Rodrigo Agerri, Miguel
   Ángel Álvarez-Carmona, Elena Álvarez Mellado, Jorge Carrillo-de-Albornoz, Luis
   Chiruzzo, Larissa Freitas, Helena Gómez Adorno, Yoan Gutiérrez, Salud Marı́a
   Jiménez Zafra, Salvador Lima, Flor Miriam Plaza-de-Arco and Mariona Taulé (eds.):
   Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021), CEUR
   Workshop Proceedings, 2021.
11. Pamungkas, E. W., Basile, V., Patti, V. (2020). Misogyny detection in twitter: a
   multilingual and cross-domain study. Information Processing Management, 57 (6),
   102360.
12. Pamungkas, E. W., Cignarella, A. T., Basile, V., Patti, V. (2018). 14-ExLab@
   UniTo for AMI at IberEval2018: Exploiting lexical knowledge for detecting misogyny
   in English and Spanish tweets. In 3rd Workshop on Evaluation of Human Language
   Technologies for Iberian Languages, IberEval 2018 (Vol. 2150, pp. 234-241). CEUR-
   WS.
13. Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual
   bert?. arXiv preprint arXiv:1906.01502.
14. Rodrı́guez-Sánchez, F., Carrillo-de-Albornoz, J., & Plaza, L. (2020). Automatic
   Classification of Sexism in Social Networks: An Empirical Study on Twitter Data.
   IEEE Access, 8, 219563-219576.
15. Waseem, Z. (2016). Are you a racist or am i seeing things? annotator influence on
   hate speech detection on twitter. In Proceedings of the first workshop on NLP and
   computational social science (pp. 138-142).

</pre>