=Paper=
{{Paper
|id=Vol-2006/paper038
|storemode=property
|title=Mining Offensive Language on Social Media
|pdfUrl=https://ceur-ws.org/Vol-2006/paper038.pdf
|volume=Vol-2006
|authors=Serena Pelosi,Alessandro Maisto,Pierluigi Vitale,Simonetta Vietri
|dblpUrl=https://dblp.org/rec/conf/clic-it/PelosiMVV17
}}
==Mining Offensive Language on Social Media==
Mining Offensive Language on Social Media Alessandro Maisto, Serena Pelosi, Simonetta Vietri, Pierluigi Vitale University of Salerno Department of Political, Social and Communication Science Via Giovanni Paolo II, 132 {amaisto,spelosi,vietri,pvitale}@unisa.it Abstract presence of offensive behaviors. The words col- lected in vulgar lexicons, in some cases, are neu- English. The present research deals with tral or even positive. Moreover, profanity can be the automatic annotation and classification used with comical or satirical purposes, and bad of vulgar ad offensive speech on social words are often just the expression of strong emo- media. In this paper we will test the ef- tions (Yin et al., 2009). fectiveness of the computational treatment In this paper, we propose a system for the auto- of the taboo contents shared on the web, matic treatment of vulgar and offensive utterances the output is a corpus of 31,749 Facebook in Italian. The strength of our method is that lex- comments which has been automatically ical items are not considered in isolation. Instead, annotated through a lexicon-based method we recognize the power of the local context of the for the automatic identification and classi- words, which can modulate the meaning of words, fication of taboo expressions. phrases and sentences. Section 2 briefly illustrates the state of the Italiano. La presente ricerca affronta art contributions on offensive language modeling. il tema dell’annotazione e della classi- Next, Section 3 describes the Italian lexical and ficazione automatica dei contenuti vol- grammatical resources for the automatic detection gari e offensivi espressi nei social me- of taboo language in Italian. Then, Section 4 ex- dia. Lo scopo del nostro lavoro consiste plains how we tested our method and resources on nel testare l’efficacia del trattamento com- a Facebook corpus and describes the results of the putazionale dei contenuti tabù condivisi in taboo expressions automatic annotation. Finally, rete. L’output che forniamo un corpus di Section 5 reports the future works that will en- 31,749 commenti generati dagli utenti di hance our research. Facebook e annotato automaticamente at- traverso un metodo basato sul lessico per 2 State of the Art on the Computational l’identificazione e la classificazione delle Treatment of Offensive Language espressioni tabù. As it is anticipated, taboo words are basically con- sidered a strong clue of online hate speech (Chen 1 Introduction et al., 2012; Reynolds et al., 2011; Xu and Zhu, 2010; Yin et al., 2009; Mahmud et al., 2008). Flaming, trolling, harassment, cyberbullying, cy- Nevertheless, the methods that simply match of- berstalking, cyberthreats are all terms used for re- fensive words stored in blacklists, are clearly not ferring to vulgar and offensive contents shared on meant to reach high levels of accuracy. Consis- the web. The shapes can be different and the focus tent with this idea, in the recent years many stud- can be on various topics, such as physical appear- ies on offensive cyberbullying and flame detection ance, ethnicity, sexuality, social acceptance and so integrated the bad words context in their methods forth. and tools. Chen et al. (2012) exploited a Lexi- Although taboo language is generally consid- cal Syntactic Feature (LSF) architecture to detect ered to be the strongest clue of harassment in the offensive content and identify potential offensive web, it must be clarified that the presence of bad users in social media. Xu and Zhu (2010) pro- words in posts does not necessarily indicate the posed a sentence-level semantic filtering approach that combined grammatical relations with offen- Sentiment Lexicon SentIta (Pelosi, 2015) and sive words. Insulting phrases and derogatory com- manually evaluated with reference to the cat- parisons of human beings with insulting items or egories described above; animals were clues used by Mahmud et al. (2008). Razavi et al. (2010) proposed an automatic flame • Multiword Expressions (MWE), that are detection method based on the variety of statistical nouns automatically annotated through the models and the rule-based patterns. Among the integrated use of the simple words list and ad flame topics that they identified, there are attacks hoc regular expressions (e.g. see section 3.2); and abuses that embarrass the readers. Xiang et • Idiomatic Structures, which are verbs + al. (2012) learned topic models from a dataset of frozen complement collected from Vietri tweets through Latent Dirichlet Allocation (LDA) (2014) and manually annotated on the algorithm. Waseem and Hovy (2016) and Kwok grounds of the hate speech tags. and Wang (2013) focused on racist and sexist slurs on Twitter; Waseem and Hovy (2016) made refer- This choice is due to the fact that in collo- ence to hate speech expressed without any deroga- quial and informal situations, a taboo expression tory term, and Kwok and Wang (2013) focused can work simply as intensifier, also for positive on the relation between the tweet content and the sentences (e.g. it’s fucking nice!). This is why identity of the user, on the base of which a post the words’ semantic orientation must be, case by is considered to be racist or not. Badjatiya et al. case, modulated when occurring into the context (2017) also used Twitter in order to investigate the of (semi)frozen structures. Concrete examples are application of deep neural network architectures. idiomatic structures that involve concrete nouns indicating body part (with a vulgar meaning) as 3 Lexical and Grammatical Resources fixed constituents (e.g. essere culo e camicia, “to In this paragraph we will describe the Italian lexi- be thick as thieves”). cal database and the grammatical rules which have 3.1 Simple Vulgar Words been used as indicators for the automatic identifi- Our project is grounded on a collection of 342 cation of the taboo language. taboo simple words that include the following The items of the lexicon are labeled through the grammatical categories: nouns, verbs, adjectives, use of the following three main categories: adverbs and exclamations. Nouns count 242 en- • Trait, that specifies if the taboo expression tries, among which 216 are simple words (e.g. is addressed to other users, to events or to cozza, “mussel”, addressed to ugly women) and things; 26 are monorematic compounds (e.g rompiballe “pain in the ass”). Verbs count 72 entries, among • Type, that verifies if an expression is offen- which 27 are verbs indicating bodily predicates sive, if it represents a threat or if it is just that involve acts of violence, e.g. violentare, rudeness; “to rape”, and 21 are pro-complementary and pronominal verbs (e.g. incazzarsi “to get mad”). • Semantic Field which specifies the taboo do- Adjectives count 16 entries (e.g. cazzuto “die- main (namely sex, sexism, aesthetics, behav- hard”), adverbs 4 entries (e.g. incazzosamente ior, homophobia, racism, scatology). “grumpily”) and exclamations 8 entries (e.g. vaf- Such tags have been collected and classified by fanculo “fuck off”). a team of four annotators (one linguist and three 3.2 Taboo Multiword Structures Italian native speakers), which annotated the lin- guistic resources through an agreement of 92%. The simple words listed in our database, especially Taboo words, which were impossible to classify the ones with an uncertain semantic orientation through a defined semantic field, have been anno- (see “N.C. in Figure 1”), can be part of frozen or tated with the residual category “N.C.”. Our taboo semi-frozen expressions that can make clear, for lexicon is composed of each occurrence, the actual meaning of the words in context. • Simple Words, which include nouns, adjec- Idioms are particularly interesting in a work on tives, verbs and adverbs collected from the online harassment, because they are open to word- plays and trolls. Indeed, it must be reported a correlation, as it happens with girare le palle/avere higher than expected presence of idiomatic struc- le palle girate, “to bust the balls/to have the balls tures in our corpus. Nevertheless, their syntac- busted”. tic flexibility and the lexical variations make them The idioms under examination can be also re- very difficult to automatically locate, if compared lated to some derived nominals in -tore,-trice,- with other multiword expressions. A very typi- ura,-ata (e.g. rottura di palle “pain in the arse”) cal Italian example is cazzo “dick”, with its, more and/or with VC compounds (verb + fixed con- or less vulgar, stilistic and regional variants (e.g. stituent e.g. rompipalle “ball-buster”). These minchia, pirla, cavolo “cabbage”, cacchio “dang”, compounds occur in the corpus as both simple mazza “stick” , tubo “pipe”, corno “horn”, etc...). words and multiword units. The context systematically gives the word un- The automatic recognition of taboo idioms, der examination a clear connotation. Examples similar to MWEs, start from the nouns indicating are (negative) adverbial and adjectival expressions vulgar body parts, and proceed with another lexi- (e.g. a cazzo, “fucked up”); (emphatic) exclama- cal anchor that is associated to the idiom in the lex- tions and interrogative forms (e.g. che cazzo “what ical resources (e.g. girare le palle “to piss off” is the hell”); intensification of negations (e.g. non V annotated in the corpus when the tool locates at the un cazzo, “don’t V shit”). same time palle e girare with a maximum distance of three word forms). This procedure streamlines Multiwords Expressions. With Multiwords Ex- the automatic recognition of the idioms, guaran- pressions, we mean sequences of simple words teeing high levels of recall in spite of the large va- separated by blanks, characterized by semantic riety of syntactic transformations that the frozen atomicity, restriction of distribution, shared and structure can go through (causative constructions, established use and lack of ambiguity. In this re- infinitive forms preceded by da, dislocation, mod- search, we automatically located and annotated ification, among others (Vietri, 2014)). MWEs through the combined use of the taboo simple words that trigger the recognition and a set 4 Experiment and Evaluation of regular expressions (based on part of speech patterns) that locate the MWEs (e.g. culo rotto, The linguistic resources described so far have “lucky” from the simple noun culo and the pat- been tested on a large corpus of User-Generated- tern NA). Other MWEs are those ones related to Contents scraped by Facebook. We chose an Ital- idioms (see next paragraph, e.g. rottura di palle ian Facebook page called Sesso Droga e Pastor- “nuisance”). The regular expressions used to iden- izia, which became popular for its explicit and of- tify the taboo MWEs are summarized below: fensive contents. The page has been shut down • Taboo Noun + Preposition + Noun (NPN) the 10/03/2017 for the social network policy vi- • Noun + Preposition + Taboo Noun (NPN) olation; therefore, the page’s administrators cre- ated a set of connected pages in order to continue • Taboo Noun + Adjective (NA) the activity in case of temporal or definitive clos- • Noun + Taboo Adjective (NA) ing. For our experimentation, posts and comments • Adjective + Taboo Noun (AN) have been extracted from three pages correspon- dent to the following indices: sessodrogapastor- • Taboo Adjective + Noun (AN) izia1, sessodrogapastorizia3 and sessodrogapas- Idiomatic Expressions. Among the possible id- toriziariserva. The corpus includes 31,749 com- iomatic structures, the present research focuses on ments published between 28 March 2017 and 13 those idioms (verb and at least one frozen com- April 2017 by over 20 thounsand users, replying plement) which have vulgar nouns of body part as to 122 status. We extracted 2,797 taboo expres- frozen complement. sions with a Recall of 97% and a Precision of 83% The lexical resources used in this research are by applying dictionaries and grammars to the gen- composed of 52 items that include 28 ordinary erated corpus1 . verb structures (e.g. girare le palle “to bust the Figure 1 represents a bubble chart which illus- balls”) and 23 support verb idioms (avere culo “to 1 The Recall has been evaluated on the entire corpus of be lucky”). The classes to whom they belong (Vi- over 31,000 comments, while the Precision has been calcu- etri, 2014) are various and can be in systematic lated on the extracted 2,700+ sentences. Figure 1: Extracted words occurrence, types and fields trates the distribution of Semantic Fields and Types insults shared through User Generated Contents. of the extracted words. The fields are listed in the As a matter of fact, in May 2016, the European horizontal axis. The vertical axis and the size of Commission, together with companies like Face- the circles describe together the frequency of the book, Twitter, YouTube and Microsoft, underlined extracted items. Finally, the colors of the circles the relevance of these topics by presenting a code represent the words’ types. of conduct2 which aimed to constrain the virality As far as MWEs are concerned, we extracted of illegal online violence and hate speech, with 134 idioms; moreover 597 MWEs have been an- a special focus on utterances fomenting racism, notated as NPN structures, 213 as NA and 175 as xenophobia and terrorist contents. The negative AN. Among the most frequent MWEs we men- impact of such practices is not limited to individu- tion some items which were already listed into the als, but strongly affects the freedom of expression dictionaries (e.g. 9 occurrences of testa di cazzo, and the democratic discourse on the Web. “dickhead”; 7 occurrences of pezzo di merda, Our research focused on a particular Facebook “piece of shit”). page, which became famous in Italy for the num- Also new vulgar structures belonging to various ber of times it has been shut down due to its dis- fields have been automatically located through our turbing content. More than 31,000 users’ com- strategy (e.g. 51 occurrences of cazzo duro, “hard- ments downloaded from this page have been auto- on” from the field sex; 3 occurrences of gran troia matically annotated according to a dataset of taboo “total slut” from the sexism field; 2 occurrences of expressions, in the form of simple words and mul- busta di piscio “box of piss” from the scat field). tiword expressions. This operation has led to a The extracted patterns underline the relevance hate speech annotated corpus which distinguishes of the local context in the disambiguation of some eight harassment semantic fields, four types of in- words which have classified N.C. as simple words, sult and four hate targets (traits). The evaluation because of their ambiguity out of the context. An of the experiment performances confirmed the hy- example is cazzo which, alone, did not receive any pothesis that the local context of words represents field or type label, but as a MWE clearly belongs an essential feature for an effective hate speech to defined categories. Cazzo duro belongs to the mining on the web. sex field. Cazzo di + Noun “this fucking + N” is In future works we will test the interaction of a generic offense (e.g. cazzo di pagina “this fuck- the taboo item located in the corpus with some ing page”) and cazzo di + Taboo Noun represents Italian Contextual Valence Shifters (Maisto and an intensification of the expressed offensive term Pelosi, 2014) in order to verify if the sentence con- (e.g. cazzo di zingaro “this fucking gypsy”). text of the insult indicators affects the semantic orientation of the items into an Opinion Mining 5 Conclusion 2 http://ec.europa.eu/justice/ In this paper we described an experiment on the fundamental-rights/files/hate_speech_ detection and classification of offenses, threats and code_of_conduct_en.pdf view. Zeerak Waseem and Dirk Hovy. 2016. Hateful sym- Furthermore, it would be interesting to ver- bols or hateful people? predictive features for hate speech detection on twitter. In SRW@ HLT-NAACL, ify the efficacy of our resources and our method pages 88–93. on different domains, Political Communication, among others. Guang Xiang, Bin Fan, Ling Wang, Jason Hong, and In the end, just because the automatic extrac- Carolyn Rose. 2012. Detecting offensive tweets via topical feature discovery over a large scale twit- tion has been done in this paper on a very polar- ter corpus. In Proceedings of the 21st ACM inter- ized corpus, future analyses will focus on testing national conference on Information and knowledge the reliability of this research on more neutral col- management, pages 1980–1984. ACM. lections of texts. Zhi Xu and Sencun Zhu. 2010. Filtering offensive lan- guage in online communities using grammatical re- lations. In Proceedings of the Seventh Annual Col- References laboration, Electronic Messaging, Anti-Abuse and Spam Conference. Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. 2017. Deep learning for hate Dawei Yin, Zhenzhen Xue, Liangjie Hong, Brian D speech detection in tweets. In Proceedings of the Davison, April Kontostathis, and Lynne Edwards. 26th International Conference on World Wide Web 2009. Detection of harassment on web 2.0. In Pro- Companion, pages 759–760. International World ceedings of the Content Analysis in the WEB, vol- Wide Web Conferences Steering Committee. ume 2, pages 1–7. Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu. 2012. Detecting offensive language in social me- dia to protect adolescent online safety. In Privacy, Security, Risk and Trust (PASSAT), 2012 Interna- tional Conference on and 2012 International Con- fernece on Social Computing (SocialCom), pages 71–80. IEEE. Irene Kwok and Yuzhou Wang. 2013. Locate the hate: Detecting tweets against blacks. In AAAI. Altaf Mahmud, Kazi Zubair Ahmed, and Mumit Khan. 2008. Detecting flames and insults in text. In Pro- ceedings of the Sixth International Conference on Natural Language Processing. BRAC University. Alessandro Maisto and Serena Pelosi. 2014. A lexicon-based approach to sentiment analysis. the italian module for nooj. In Proceedings of the Inter- national Nooj 2014 Conference, University of Sas- sari, Italy. Cambridge Scholar Publishing. Serena Pelosi. 2015. Sentita and doxa: Italian databases and tools for sentiment analysis purposes. In Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015, pages 226–231. Accademia University Press. Amir Razavi, Diana Inkpen, Sasha Uritsky, and Stan Matwin. 2010. Offensive language detection using multi-level classification. Advances in Artificial In- telligence, pages 16–27. Kelly Reynolds, April Kontostathis, and Lynne Ed- wards. 2011. Using machine learning to detect cy- berbullying. In Machine Learning and Applications and Workshops (ICMLA), 2011 10th International Conference on, volume 2, pages 241–244. IEEE. Simonetta Vietri. 2014. Idiomatic Constructions in Italian: A Lexicon-grammar Approach, volume 31. John Benjamins Publishing Company.