=Paper=
{{Paper
|id=Vol-2150/AMI_paper8
|storemode=property
|title=AMI at IberEval2018 Automatic Misogyny Identification in Spanish and English Tweets
|pdfUrl=https://ceur-ws.org/Vol-2150/AMI_paper8.pdf
|volume=Vol-2150
|authors=Victor Nina-Alcocer
|dblpUrl=https://dblp.org/rec/conf/sepln/Nina-Alcocer18
}}
==AMI at IberEval2018 Automatic Misogyny Identification in Spanish and English Tweets==
<pdf width="1500px">https://ceur-ws.org/Vol-2150/AMI_paper8.pdf</pdf>
<pre>
              AMI at IberEval2018
    Automatic Misogyny Identification in Spanish
               and English Tweets

                                 Victor Nina-Alcocer

                 Department of Computer Systems and Computation
                     Universitat Politècnica de València, Spain
                               vicnial@inf.upv.es


        Abstract. In this paper we describe the submission for the Automatic
        Misogyny Identification in Spanish and English Tweets shared task or-
        ganized at IberEval1 . This work proposes an approach based on weights
        of ngrams, word categories, structural information and lexical analysis
        to discover whether these components allow us to discriminate between
        misogynous and no misogynous tweets and their respective categories
        and targets in case of misogynous tweets. Moreover, we analyze the use
        of some features created by these components to investigate their impact.


1     Introduction
AMI is the first task on automatic misogyny identification [2]. Its aim was to
identify cases of aggressiveness and hate speech towards women in social me-
dia [1]. Poland’s work [3] was the first attempt to manually classify misogynous
tweets. Now this shared task will consider two subtasks for this classification:
 – subtask1: Misogyny identification.
 – subtask2a: Misogynistic Behaviour.
 – subtask2b: Target Classification.
The aim of subtask1 is to identify whether a tweet is misogynous or not, and the
second subtask2a aims to identify the category, if a misogynous tweet belongs
to: discredit, dominance, sexual harassment, stereotype, and derailing. Finally,
subtask2b is in charge to identify whether a misogynous tweet is active or passive
i.e. if its target is generic (women in general) or individual. In this work, each of
these tasks is approached as a classification task. We will use natural language
processing (NLP), machine learning and feature engineering to identify patterns
and learn classification models respectively.

2     Approach
This section tries to describe the main approaches that have been used. Gener-
ally, misogyny can be expressed written, orally, in a subtle or explicit way, also
1
    https://sites.google.com/view/ibereval-2018
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)


        2         Victor Nina-Alcocer

        directly or indirectly addressed to someone. In order to investigate how peo-
        ple may express misogyny in tweets, we propose an approach that allows us to
        discover some aspects about how misogyny is expressed in the corpus provided
        by the organizers. Hence this approach takes into account some features that
        we considered important in order to understand if some of them contribute to
        recognizing misogynous content and its respective category.

        Structure (str): Basically, knowing how many words are used in a tweet or
        if most of those words are written in capital letters, even if some of them use
        excessively punctuation marks could reveal important information. As we know
        a tweet is composed of words, punctuations, mentions, URLs, etc. In this ap-
        proach, we will pay attention to these aspects to see if all of them in some way
        help to better discriminate between misogynous tweets and not misogynous one.
        A summary of these features is given below:
            – The number of symbols or punctuation marks (!’ ?,.”).
            – The number of words written in capital letters.
            – The number of words and characters, including stop words.
            – Mean of the numbers of words and characters.
            – The number of mentions, URLs, and hash-tags.
            LIWC categories (lc): Another component that we consider important is
        the possibility to get features from Linguistic Inquiry and Word Count (LIWC) 2 .
        We have just taken into account some categories related to misogynous emotions
        such as: angry, sexual, swear, positive, negative, etc. [4] The idea behind this
        component is to calculate for instance the percentage of positive or negative
        emotions, or even if a tweet has sexual content as we can see in Figure 1.


            Fig. 1. The content (words) of a tweet belongs to some category: death, anger, etc.


            Ngrams (ng): In this component Term frequency - Inverse document fre-
        quency based on Words (TFIDFW) or Chars (TFIDFC) schemes are used. For
        instance in misogynous TFIDFW (see Table 1) the term bitch (first place) is more
         2
             https://www.receptiviti.ai/liwc-api-get-started


                                                        275
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)


                                                                           AMI at IberEval2018                  3

        used among the misogynous tweets than among the ones that are not misogy-
        nous (fourth place), e.g. in our case the uni-gram bitch has a different weight, it
        means that this word has a specific weight in a misogynous tweet and has an-
        other weight in a non misogynous one. The same logic is followed for subtask2a
        and subtask2b using TFIDFW of their categories and targets respectively.


                                Table 1. Weight of uni-grams and bi-grams

                 uni-grams                                             bi-grams
                 misogynous     no-misogynous                          misogynous     no-misogynous
        N term   weights      N term            weights   N term       weights      N term            weights
        1 bitch 0.054913      1 rape            0.021782 1 stupid bitch 0.010204    1 stupid cunt     0.006159
        2 dick   0.027398     2 dick            0.019902 2 ass bitch    0.006658    2 son bitch       0.002429
        3 stupid 0.024436     3 cunt            0.019422 3 suck dick    0.004807    3 men rights      0.002079
        4 like   0.024388     4 bitch           0.018755
        5 woman 0.023752      5 hoe             0.017120


            Part of Speech (pos): The last component of our approach takes into
        account part of speech information, which has the task of tagging each word in a
        sentence with its appropriate part of speech. We decide whether each word is a
        noun, adjective, verbs, etc. Using this component we can identify some patterns,
        for instance in our corpus some nouns are followed by punctuation marks e.g.
        bitch!!!!!!.


        3     Experiments and Results

        Thanks to the organizers we count with a dataset of 3307 Spanish and 3251
        English tweets respectively. Each tweet is labeled as misogynous (1) or no-
        misogynous (0) and both datasets are balanced. Regarding the type of misogyny
        and target, each tweet is labeled as: discredit, dominance, sexual harassment,
        stereotype, derailing and active or passive in case of the target. With respect to
        the category and target information, the corpus is unbalanced. The first one is
        biased in favor of discredit (60%) and regarding the target is biased in favor of
        active (almost 75%). Moreover, to evaluate our system, a test dataset with 831
        and 726 unlabeled tweets in Spanish and English respectively was provided.

            For the experiments, we employed a set of feature combinations which has
        been used to feed some classifiers: Support Vector Machine (SVM), Multi-layer
        Perceptron (MLP) and MultinomialNB (MNB).
        SVM and 10 K-fold cross-validation were used. The first one was chosen because
        its performance was good enough with thousands of features, and the second
        one allows us to avoid over-fitting in all the experiments.

           Firstly, the main goal was to face the classification of misogynous tweets in
        Spanish in order to apply the best performing approach to the rest of subtasks


                                                          276
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)


        4         Victor Nina-Alcocer

        in English or Spanish.


                             Table 2. Configuration of the main experiments

                      Name                                     Set up
                      ap1 .           TFIDFW + TFIDFC + BOW + BOC
                      ap2 .    SVD30(TFIDFW) + SVD30(TFIDFC) + BOW
                      ap3 . MNB(PREDICTED) + SVD20(TFIDFW) + str + lc
                      ap4 .                  BOW + str + lc + ng + pos
                      ap5 .               TFIDFW + str + lc + ng + pos


            Table 2 shows how the experiments were set up. Approaches ap1 and ap2
        had the aim to find out whether features created by TFIDFW, TFIDFC, Bag of
        word-grams (BOW) or Bag of char-grams (BOC) are useful. ap1 uses the whole
        group of features (thousands of them) created by TFIDFW or TFIDFC, while
        ap2 obtains the 60 best features using truncated singular value decomposition
        (SVD) on TFIDFW and TFIDFC then combines with BOW. Unfortunately,
        those approaches were interesting but we did not obtain results over our baselines
        with any of the classifiers (MLP, SVM, MNB). ap3 tries to reduce the number
        of features: firstly we classified a tweet using MNB and then we obtained their
        respective probabilities to use them as features (2), additionally we got the best
        20 features using SVD on TFIDFW and lastly, we added the features str (5) and
        lc (10). Unfortunately, with these 37 features we did not achieve results over our
        baselines in subtask1 and subtask2ab respectively.
            Now we proceed to analyze the results that we got with the approach pro-
        posed in Section 2. ap4 and ap5 follow the same logic, but ap5 obtains better
        results than ap4 because it uses TFIDFW. Tables 3 and 4 show the best val-


                           Table 3. Results with ap5 on English training tweets

        run                                      subtask1                            subtask2a subtask2b
                                                                   Accuracy          F1-macro F1-macro
              
              
        run1 SVM on TFIDFW       +str+lc                             0.733    +pos      0.299      0.721

              
        run2 SVM on TFIDFW    +str+lc+ng(u)                          0.781    +pos      0.302      0.762

              
        run3 SVM on TFIDFW   +str+lc+ng(u+b)                         0.781    +pos      0.343      0.763
        run4 SVM on TFIDFW +str+lc+ng(u+b+t)                         0.782    +pos      0.370      0.764


        ues that we achieved: run4 in Table 3 uses TFIDFW plus structure, category
        and weight of ngrams(unigrams+bigrams+trigrams) as features and we obtained
        0.782 of accuracy applying linear SVM on subtask1 . While with respect to the
        subtask2a, we added part of speech as feature and we obtained 0.370 of F1-macro.


                                                        277
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)


                                                                            AMI at IberEval2018            5

                           Table 4. Results with ap5 on Spanish training tweets

        run                                      subtask1                   subtask2a subtask2b
                                                              Accuracy      F1-macro            F1-macro
              
              
        run1 SVM on TFIDFW        +str+lc                        0.804          0.472       -str      0.781
        run2 SVM on TFIDFW +str+lc+ng(b)                        0.860 -lc      0.503 -str-ng(b)      0.780


           Looking at Table 4, we may observe that in run2 we obtained 0.780 of
        F1-macro in subtask2b just using lc as feature. Also, that just using str and
        ng(bigram) we obtained 0.503 of F1-macro on subtask2a.


        3.1    Official ranking

        We did not expect good results in English (see Table 5), but we obtained scores
        slightly above the average macro F1-baseline (0.3374) in subtask2a and sub-
        task2b (see run3 and run4). While in subtask1 we were below the accuracy
        baseline (0.7837). These results can be due to a bad combination of our features.


                  Table 5. Official results for English subtask1, subtask2a and subtask2b

                                 subtask1                          subtask2ab
                     Rank          Run            Accuracy Rank Average F1-macro
                     16     Our approach.run2        0.7809        17         0.336433966
                     17     Our approach.run3        0.7809        14          0.33914113
                     18     Our approach.run4        0.7809        13         0.339590051
                     26     Our approach.run1        0.7094        23         0.316368399


        Table 6 shows the better results we obtained in Spanish (between the first five
        teams). However, we think that classifying misogynous tweets in this corpus was
        quite difficult because the performance of the teams was approximately 80% in
        terms of accuracy. Similarly, in subtask2a and subtask2b, mostly the teams were
        not far from the baseline.


                  Table 6. Official results for Spanish subtask1, subtask2a and subtask2b

                                subtask1                            subtask2ab
                    Rank          Run             Accuracy Rank Average Macro F1
                    9      Our approach.run1 0.805054152           8            0.42722476
                    20     Our approach.run2 0.76654633            13           0.41174962
                    22     Our approach.run3 0.65944645            21           0.27271983


                                                        278
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)


        6        Victor Nina-Alcocer

        4     Conclusions

        In this work, we proposed an approach that takes into account some aspects:
        weights of ngrams, LIWC categories, structural information and lexical analysis.
        We observed that each aspect contributes in some way to the different subtasks.
        Moreover, we notice that the four aspects contributed to obtaining a better
        accuracy and F1-macro in the corpus of English tweets. However, only the first
        three aspects were useful for the Spanish tweets.
        As future work, it is interesting to use some techniques to face unbalanced dataset
        and explore other features. Moreover, we plan to use deep learning to see what
        performance this technique could achieve.


        References
        1. Bailey, M.: Haters: Harassment, abuse, and violence online by bailey poland.
           Signs: Journal of Women in Culture and Society 43(2), 495–497 (2018).
           https://doi.org/10.1086/693771
        2. Fersini, E., Anzovino, M., Rosso, P.: Overview of the task on automatic misog-
           yny identification at ibereval. In: Proceedings of the Third Workshop on Eval-
           uation of Human Language Technologies for Iberian Languages (IberEval 2018),
           co-located with 34th Conference of the Spanish Society for Natural Language Pro-
           cessing (SEPLN 2018) CEUR Workshop Proceedings. Seville, Spain, September 18,
           2018, CEUR-WS.org
        3. Hewitt, S., Tiropanis, T., Bokhove, C.: The problem of identifying misogy-
           nist language on twitter (and other online social spaces). In: Proceedings of
           the 8th ACM Conference on Web Science. pp. 333–335. WebSci ’16 (2016),
           http://doi.acm.org/10.1145/2908131.2908183
        4. Poland, B.: Haters: Harassment, Abuse, and Violence Online. University of Nebraska
           Press (2016), http://www.jstor.org/stable/j.ctt1fq9wdp


                                                        279

</pre>