=Paper=
{{Paper
|id=Vol-2006/paper038
|storemode=property
|title=Mining Offensive Language on Social Media
|pdfUrl=https://ceur-ws.org/Vol-2006/paper038.pdf
|volume=Vol-2006
|authors=Serena Pelosi,Alessandro Maisto,Pierluigi Vitale,Simonetta Vietri
|dblpUrl=https://dblp.org/rec/conf/clic-it/PelosiMVV17
}}
==Mining Offensive Language on Social Media==
<pdf width="1500px">https://ceur-ws.org/Vol-2006/paper038.pdf</pdf>
<pre>
                     Mining Offensive Language on Social Media

             Alessandro Maisto, Serena Pelosi, Simonetta Vietri, Pierluigi Vitale
                                    University of Salerno
                  Department of Political, Social and Communication Science
                                 Via Giovanni Paolo II, 132
                  {amaisto,spelosi,vietri,pvitale}@unisa.it


                    Abstract                           presence of offensive behaviors. The words col-
                                                       lected in vulgar lexicons, in some cases, are neu-
    English. The present research deals with           tral or even positive. Moreover, profanity can be
    the automatic annotation and classification        used with comical or satirical purposes, and bad
    of vulgar ad offensive speech on social            words are often just the expression of strong emo-
    media. In this paper we will test the ef-          tions (Yin et al., 2009).
    fectiveness of the computational treatment            In this paper, we propose a system for the auto-
    of the taboo contents shared on the web,           matic treatment of vulgar and offensive utterances
    the output is a corpus of 31,749 Facebook          in Italian. The strength of our method is that lex-
    comments which has been automatically              ical items are not considered in isolation. Instead,
    annotated through a lexicon-based method           we recognize the power of the local context of the
    for the automatic identification and classi-       words, which can modulate the meaning of words,
    fication of taboo expressions.                     phrases and sentences.
                                                          Section 2 briefly illustrates the state of the
    Italiano. La presente ricerca affronta             art contributions on offensive language modeling.
    il tema dell’annotazione e della classi-           Next, Section 3 describes the Italian lexical and
    ficazione automatica dei contenuti vol-            grammatical resources for the automatic detection
    gari e offensivi espressi nei social me-           of taboo language in Italian. Then, Section 4 ex-
    dia. Lo scopo del nostro lavoro consiste           plains how we tested our method and resources on
    nel testare l’efficacia del trattamento com-       a Facebook corpus and describes the results of the
    putazionale dei contenuti tabù condivisi in       taboo expressions automatic annotation. Finally,
    rete. L’output che forniamo un corpus di           Section 5 reports the future works that will en-
    31,749 commenti generati dagli utenti di           hance our research.
    Facebook e annotato automaticamente at-
    traverso un metodo basato sul lessico per          2   State of the Art on the Computational
    l’identificazione e la classificazione delle           Treatment of Offensive Language
    espressioni tabù.
                                                       As it is anticipated, taboo words are basically con-
                                                       sidered a strong clue of online hate speech (Chen
1   Introduction                                       et al., 2012; Reynolds et al., 2011; Xu and Zhu,
                                                       2010; Yin et al., 2009; Mahmud et al., 2008).
Flaming, trolling, harassment, cyberbullying, cy-      Nevertheless, the methods that simply match of-
berstalking, cyberthreats are all terms used for re-   fensive words stored in blacklists, are clearly not
ferring to vulgar and offensive contents shared on     meant to reach high levels of accuracy. Consis-
the web. The shapes can be different and the focus     tent with this idea, in the recent years many stud-
can be on various topics, such as physical appear-     ies on offensive cyberbullying and flame detection
ance, ethnicity, sexuality, social acceptance and so   integrated the bad words context in their methods
forth.                                                 and tools. Chen et al. (2012) exploited a Lexi-
   Although taboo language is generally consid-        cal Syntactic Feature (LSF) architecture to detect
ered to be the strongest clue of harassment in the     offensive content and identify potential offensive
web, it must be clarified that the presence of bad     users in social media. Xu and Zhu (2010) pro-
words in posts does not necessarily indicate the       posed a sentence-level semantic filtering approach
that combined grammatical relations with offen-                Sentiment Lexicon SentIta (Pelosi, 2015) and
sive words. Insulting phrases and derogatory com-              manually evaluated with reference to the cat-
parisons of human beings with insulting items or               egories described above;
animals were clues used by Mahmud et al. (2008).
Razavi et al. (2010) proposed an automatic flame           • Multiword Expressions (MWE), that are
detection method based on the variety of statistical         nouns automatically annotated through the
models and the rule-based patterns. Among the                integrated use of the simple words list and ad
flame topics that they identified, there are attacks         hoc regular expressions (e.g. see section 3.2);
and abuses that embarrass the readers. Xiang et            • Idiomatic Structures, which are verbs +
al. (2012) learned topic models from a dataset of            frozen complement collected from Vietri
tweets through Latent Dirichlet Allocation (LDA)             (2014) and manually annotated on the
algorithm. Waseem and Hovy (2016) and Kwok                   grounds of the hate speech tags.
and Wang (2013) focused on racist and sexist slurs
on Twitter; Waseem and Hovy (2016) made refer-              This choice is due to the fact that in collo-
ence to hate speech expressed without any deroga-        quial and informal situations, a taboo expression
tory term, and Kwok and Wang (2013) focused              can work simply as intensifier, also for positive
on the relation between the tweet content and the        sentences (e.g. it’s fucking nice!). This is why
identity of the user, on the base of which a post        the words’ semantic orientation must be, case by
is considered to be racist or not. Badjatiya et al.      case, modulated when occurring into the context
(2017) also used Twitter in order to investigate the     of (semi)frozen structures. Concrete examples are
application of deep neural network architectures.        idiomatic structures that involve concrete nouns
                                                         indicating body part (with a vulgar meaning) as
3    Lexical and Grammatical Resources                   fixed constituents (e.g. essere culo e camicia, “to
In this paragraph we will describe the Italian lexi-     be thick as thieves”).
cal database and the grammatical rules which have        3.1   Simple Vulgar Words
been used as indicators for the automatic identifi-
                                                         Our project is grounded on a collection of 342
cation of the taboo language.
                                                         taboo simple words that include the following
   The items of the lexicon are labeled through the
                                                         grammatical categories: nouns, verbs, adjectives,
use of the following three main categories:
                                                         adverbs and exclamations. Nouns count 242 en-
    • Trait, that specifies if the taboo expression      tries, among which 216 are simple words (e.g.
      is addressed to other users, to events or to       cozza, “mussel”, addressed to ugly women) and
      things;                                            26 are monorematic compounds (e.g rompiballe
                                                         “pain in the ass”). Verbs count 72 entries, among
    • Type, that verifies if an expression is offen-     which 27 are verbs indicating bodily predicates
      sive, if it represents a threat or if it is just   that involve acts of violence, e.g. violentare,
      rudeness;                                          “to rape”, and 21 are pro-complementary and
                                                         pronominal verbs (e.g. incazzarsi “to get mad”).
    • Semantic Field which specifies the taboo do-
                                                         Adjectives count 16 entries (e.g. cazzuto “die-
      main (namely sex, sexism, aesthetics, behav-
                                                         hard”), adverbs 4 entries (e.g. incazzosamente
      ior, homophobia, racism, scatology).
                                                         “grumpily”) and exclamations 8 entries (e.g. vaf-
Such tags have been collected and classified by          fanculo “fuck off”).
a team of four annotators (one linguist and three
                                                         3.2   Taboo Multiword Structures
Italian native speakers), which annotated the lin-
guistic resources through an agreement of 92%.           The simple words listed in our database, especially
   Taboo words, which were impossible to classify        the ones with an uncertain semantic orientation
through a defined semantic field, have been anno-        (see “N.C. in Figure 1”), can be part of frozen or
tated with the residual category “N.C.”. Our taboo       semi-frozen expressions that can make clear, for
lexicon is composed of                                   each occurrence, the actual meaning of the words
                                                         in context.
    • Simple Words, which include nouns, adjec-             Idioms are particularly interesting in a work on
      tives, verbs and adverbs collected from the        online harassment, because they are open to word-
plays and trolls. Indeed, it must be reported a         correlation, as it happens with girare le palle/avere
higher than expected presence of idiomatic struc-       le palle girate, “to bust the balls/to have the balls
tures in our corpus. Nevertheless, their syntac-        busted”.
tic flexibility and the lexical variations make them       The idioms under examination can be also re-
very difficult to automatically locate, if compared     lated to some derived nominals in -tore,-trice,-
with other multiword expressions. A very typi-          ura,-ata (e.g. rottura di palle “pain in the arse”)
cal Italian example is cazzo “dick”, with its, more     and/or with VC compounds (verb + fixed con-
or less vulgar, stilistic and regional variants (e.g.   stituent e.g. rompipalle “ball-buster”). These
minchia, pirla, cavolo “cabbage”, cacchio “dang”,       compounds occur in the corpus as both simple
mazza “stick” , tubo “pipe”, corno “horn”, etc...).     words and multiword units.
The context systematically gives the word un-              The automatic recognition of taboo idioms,
der examination a clear connotation. Examples           similar to MWEs, start from the nouns indicating
are (negative) adverbial and adjectival expressions     vulgar body parts, and proceed with another lexi-
(e.g. a cazzo, “fucked up”); (emphatic) exclama-        cal anchor that is associated to the idiom in the lex-
tions and interrogative forms (e.g. che cazzo “what     ical resources (e.g. girare le palle “to piss off” is
the hell”); intensification of negations (e.g. non V    annotated in the corpus when the tool locates at the
un cazzo, “don’t V shit”).                              same time palle e girare with a maximum distance
                                                        of three word forms). This procedure streamlines
Multiwords Expressions. With Multiwords Ex-
                                                        the automatic recognition of the idioms, guaran-
pressions, we mean sequences of simple words
                                                        teeing high levels of recall in spite of the large va-
separated by blanks, characterized by semantic
                                                        riety of syntactic transformations that the frozen
atomicity, restriction of distribution, shared and
                                                        structure can go through (causative constructions,
established use and lack of ambiguity. In this re-
                                                        infinitive forms preceded by da, dislocation, mod-
search, we automatically located and annotated
                                                        ification, among others (Vietri, 2014)).
MWEs through the combined use of the taboo
simple words that trigger the recognition and a set
                                                        4       Experiment and Evaluation
of regular expressions (based on part of speech
patterns) that locate the MWEs (e.g. culo rotto,        The linguistic resources described so far have
“lucky” from the simple noun culo and the pat-          been tested on a large corpus of User-Generated-
tern NA). Other MWEs are those ones related to          Contents scraped by Facebook. We chose an Ital-
idioms (see next paragraph, e.g. rottura di palle       ian Facebook page called Sesso Droga e Pastor-
“nuisance”). The regular expressions used to iden-      izia, which became popular for its explicit and of-
tify the taboo MWEs are summarized below:               fensive contents. The page has been shut down
  • Taboo Noun + Preposition + Noun (NPN)               the 10/03/2017 for the social network policy vi-
  • Noun + Preposition + Taboo Noun (NPN)
                                                        olation; therefore, the page’s administrators cre-
                                                        ated a set of connected pages in order to continue
  • Taboo Noun + Adjective (NA)                         the activity in case of temporal or definitive clos-
  • Noun + Taboo Adjective (NA)                         ing. For our experimentation, posts and comments
  • Adjective + Taboo Noun (AN)
                                                        have been extracted from three pages correspon-
                                                        dent to the following indices: sessodrogapastor-
  • Taboo Adjective + Noun (AN)
                                                        izia1, sessodrogapastorizia3 and sessodrogapas-
Idiomatic Expressions. Among the possible id-           toriziariserva. The corpus includes 31,749 com-
iomatic structures, the present research focuses on     ments published between 28 March 2017 and 13
those idioms (verb and at least one frozen com-         April 2017 by over 20 thounsand users, replying
plement) which have vulgar nouns of body part as        to 122 status. We extracted 2,797 taboo expres-
frozen complement.                                      sions with a Recall of 97% and a Precision of 83%
   The lexical resources used in this research are      by applying dictionaries and grammars to the gen-
composed of 52 items that include 28 ordinary           erated corpus1 .
verb structures (e.g. girare le palle “to bust the         Figure 1 represents a bubble chart which illus-
balls”) and 23 support verb idioms (avere culo “to          1
                                                             The Recall has been evaluated on the entire corpus of
be lucky”). The classes to whom they belong (Vi-        over 31,000 comments, while the Precision has been calcu-
etri, 2014) are various and can be in systematic        lated on the extracted 2,700+ sentences.
                          Figure 1: Extracted words occurrence, types and fields


trates the distribution of Semantic Fields and Types    insults shared through User Generated Contents.
of the extracted words. The fields are listed in the       As a matter of fact, in May 2016, the European
horizontal axis. The vertical axis and the size of      Commission, together with companies like Face-
the circles describe together the frequency of the      book, Twitter, YouTube and Microsoft, underlined
extracted items. Finally, the colors of the circles     the relevance of these topics by presenting a code
represent the words’ types.                             of conduct2 which aimed to constrain the virality
   As far as MWEs are concerned, we extracted           of illegal online violence and hate speech, with
134 idioms; moreover 597 MWEs have been an-             a special focus on utterances fomenting racism,
notated as NPN structures, 213 as NA and 175 as         xenophobia and terrorist contents. The negative
AN. Among the most frequent MWEs we men-                impact of such practices is not limited to individu-
tion some items which were already listed into the      als, but strongly affects the freedom of expression
dictionaries (e.g. 9 occurrences of testa di cazzo,     and the democratic discourse on the Web.
“dickhead”; 7 occurrences of pezzo di merda,               Our research focused on a particular Facebook
“piece of shit”).                                       page, which became famous in Italy for the num-
   Also new vulgar structures belonging to various      ber of times it has been shut down due to its dis-
fields have been automatically located through our      turbing content. More than 31,000 users’ com-
strategy (e.g. 51 occurrences of cazzo duro, “hard-     ments downloaded from this page have been auto-
on” from the field sex; 3 occurrences of gran troia     matically annotated according to a dataset of taboo
“total slut” from the sexism field; 2 occurrences of    expressions, in the form of simple words and mul-
busta di piscio “box of piss” from the scat field).     tiword expressions. This operation has led to a
   The extracted patterns underline the relevance       hate speech annotated corpus which distinguishes
of the local context in the disambiguation of some      eight harassment semantic fields, four types of in-
words which have classified N.C. as simple words,       sult and four hate targets (traits). The evaluation
because of their ambiguity out of the context. An       of the experiment performances confirmed the hy-
example is cazzo which, alone, did not receive any      pothesis that the local context of words represents
field or type label, but as a MWE clearly belongs       an essential feature for an effective hate speech
to defined categories. Cazzo duro belongs to the        mining on the web.
sex field. Cazzo di + Noun “this fucking + N” is           In future works we will test the interaction of
a generic offense (e.g. cazzo di pagina “this fuck-     the taboo item located in the corpus with some
ing page”) and cazzo di + Taboo Noun represents         Italian Contextual Valence Shifters (Maisto and
an intensification of the expressed offensive term      Pelosi, 2014) in order to verify if the sentence con-
(e.g. cazzo di zingaro “this fucking gypsy”).           text of the insult indicators affects the semantic
                                                        orientation of the items into an Opinion Mining
5   Conclusion
                                                           2
                                                           http://ec.europa.eu/justice/
In this paper we described an experiment on the         fundamental-rights/files/hate_speech_
detection and classification of offenses, threats and   code_of_conduct_en.pdf
view.                                                     Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-
   Furthermore, it would be interesting to ver-             bols or hateful people? predictive features for hate
                                                            speech detection on twitter. In SRW@ HLT-NAACL,
ify the efficacy of our resources and our method
                                                            pages 88–93.
on different domains, Political Communication,
among others.                                             Guang Xiang, Bin Fan, Ling Wang, Jason Hong, and
   In the end, just because the automatic extrac-           Carolyn Rose. 2012. Detecting offensive tweets
                                                            via topical feature discovery over a large scale twit-
tion has been done in this paper on a very polar-           ter corpus. In Proceedings of the 21st ACM inter-
ized corpus, future analyses will focus on testing          national conference on Information and knowledge
the reliability of this research on more neutral col-       management, pages 1980–1984. ACM.
lections of texts.                                        Zhi Xu and Sencun Zhu. 2010. Filtering offensive lan-
                                                            guage in online communities using grammatical re-
                                                            lations. In Proceedings of the Seventh Annual Col-
References                                                  laboration, Electronic Messaging, Anti-Abuse and
                                                            Spam Conference.
Pinkesh Badjatiya, Shashank Gupta, Manish Gupta,
   and Vasudeva Varma. 2017. Deep learning for hate       Dawei Yin, Zhenzhen Xue, Liangjie Hong, Brian D
   speech detection in tweets. In Proceedings of the        Davison, April Kontostathis, and Lynne Edwards.
   26th International Conference on World Wide Web          2009. Detection of harassment on web 2.0. In Pro-
   Companion, pages 759–760. International World            ceedings of the Content Analysis in the WEB, vol-
   Wide Web Conferences Steering Committee.                 ume 2, pages 1–7.
Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu.
  2012. Detecting offensive language in social me-
  dia to protect adolescent online safety. In Privacy,
  Security, Risk and Trust (PASSAT), 2012 Interna-
  tional Conference on and 2012 International Con-
  fernece on Social Computing (SocialCom), pages
  71–80. IEEE.

Irene Kwok and Yuzhou Wang. 2013. Locate the hate:
   Detecting tweets against blacks. In AAAI.

Altaf Mahmud, Kazi Zubair Ahmed, and Mumit Khan.
  2008. Detecting flames and insults in text. In Pro-
  ceedings of the Sixth International Conference on
  Natural Language Processing. BRAC University.

Alessandro Maisto and Serena Pelosi. 2014. A
  lexicon-based approach to sentiment analysis. the
  italian module for nooj. In Proceedings of the Inter-
  national Nooj 2014 Conference, University of Sas-
  sari, Italy. Cambridge Scholar Publishing.

Serena Pelosi. 2015. Sentita and doxa: Italian
  databases and tools for sentiment analysis purposes.
  In Proceedings of the Second Italian Conference
  on Computational Linguistics CLiC-it 2015, pages
  226–231. Accademia University Press.

Amir Razavi, Diana Inkpen, Sasha Uritsky, and Stan
 Matwin. 2010. Offensive language detection using
 multi-level classification. Advances in Artificial In-
 telligence, pages 16–27.

Kelly Reynolds, April Kontostathis, and Lynne Ed-
  wards. 2011. Using machine learning to detect cy-
  berbullying. In Machine Learning and Applications
  and Workshops (ICMLA), 2011 10th International
  Conference on, volume 2, pages 241–244. IEEE.

Simonetta Vietri. 2014. Idiomatic Constructions in
  Italian: A Lexicon-grammar Approach, volume 31.
  John Benjamins Publishing Company.

</pre>