-

A Social Monitor for Detecting Inappropriate Behavior

Jose Alberto Mesa Murgado

Flor Miriam Plaza-del-Arco

Pilar Lopez-Ubeda

M. Teresa Mart n-Valdivia

0 0 Departamento de Informatica, CEATIC, Universidad de Jaen , Espan~a

41 44

In this paper we present a prototype of a social monitor aimed at detecting inappropriate behavior in social networks applying Machine Learning (ML) solutions. This monitor is able to retrieve textual data from two of the current main social networks (YouTube and Twitter) and analyze them using ML algorithms based systems trained on di erent scenarios. In particular, those that have emerged in recent years in the Natural Language Processing (NLP) community and that are related to social and mental health issues, such as the detection of o ensive language, cyberbullying, fake news and the identi cation of mental disorders.

The emergence of the Web 2.0 has completely changed the way people communicate and interact. The most popular social networks such as YouTube, Twitter or WhatsApp have more than 2,000 million registered users. Every second, on average, over 6,000 tweets are published on Twitter, which results in over 500 million tweets per day1.

Given the large amount of data accessible via the Web, di erent studies in the eld of Natural Language Processing (NLP) have emerged seeking to o er solutions to society in di erent areas such as psychology to identify people moods (Plaza del Arco et al., 2020) mental disorders (Lopez-Ubeda et al., 2019) (Lopez-Ubeda et al., 2021b) , mar

1https://www.internetlivestats.com/

twitter-statistics/ keting to analyze product reviews to boost sales (Rambocas and Pacheco, 2018), and sociology to detect hate speech (Plaza-del Arco et al., 2021) or constructive news (LopezUbeda et al., 2021a).

On the one hand, every day, many current events go viral on social media, usually related to political issues, celebrities, video games, fashion or diseases. On the other hand, although we nd many studies applying ML solutions to analyze this data, the availability of tools that integrate these systems to be used in real scenarios is scarce.

In this paper we present a prototype of a social monitor aimed at detecting inappropriate behavior through the following functionalities: i ) collecting data from two of the main social network: Twitter and YouTube; ii ) integrating ML systems trained on di erent tasks, and iii ) analyzing the data colCopyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). lected using NLP solutions. With this prototype, ML-based systems can be applied to real scenarios that test their performance. It should be remarked that the monitor is especially aimed at integrating NLP solutions, which have arisen over the last few years within the scienti c community, to tackle social challenges in social networks such as cyberbullying detection, fake news and mental disorders identi cation.

The rest of the paper is structured as follows: In Section 2 we present the description of the tool in which we will detail information related to backend and frontend development. The real scenarios where the tool will be implemented are explained in Section 3. Finally, Section 4 presents conclusions and future work. 2

System Description

The system is a social monitor designed to retrieve posts from two of the current main social networks: Twitter and YouTube. The purpose of collecting such posts is to analyze them using models trained on well known tasks of the NLP community such as text classi cation and Named Entity Recognition.

This tool is implemented using Python's FastAPI2 framework to establish the connection between users, database and the o cial social networks APIs through POST and GET HTTP requests. This framework has been chosen due to its advantages: high performance, detailed documentation, strong relationship with API development standards such as JSON schemas, and nally, for the ease of learning and developing using it.

The architecture of this tool is displayed in Figure 1. First step consists of extracting posts from di erent social networks, specifically Twitter and YouTube. Subsequently, these posts are stored in a database (step 2). Finally, with the idea of analyzing the information extracted from these social networks, posts are evaluated using pre-trained ML systems (step 3). 2.1

Backend

This side of the application is responsible of extracting information from di erent social networks (Section 2.1.1), storing data (Section 2.1.2) and integrating di erent NLP models that have been previously trained on a speci c corpora for a speci c task(Section 2.1.3). 2.1.1 Data Retrieval Twitter posts and thread. To retrieve posts from Twitter, we use its o cial API3. It limits the number of tweets that can be requested to 25,000 monthly under the standard and academic license4 (the enterprise license extends this limitation up to 5 million tweets per month). The social monitor allows the possibility of receiving tweets extracted from Twitter regarding: 1. An user timeline and the associated responses to those tweets. 2. Contained in a given hashtag. 3. Nearby a given location de ned in terms of latitude and longitude.

Speci cally, we look for tweets written in Spanish although this restriction could be updated to t other requirements. Tweets retrieved from this request are stored inside our Elasticsearch index.

YouTube video comments. To retrieve comments from YouTube videos Google provides an o cial API5. To request this data from this API we use Google's o cial Python library6 along with our credentials. The number of elements to retrieve is restricted 3https://developer.twitter.com/en/docs/ twitter-api

4https://developer.twitter.com/en/docs/ twitter-api/rate-limits 5https://developers.google.com/youtube/v3 6https://github.com/googleapis/ google-api-python-client up to 10,000 per day. Comments to YouTube videos can be retrieved regarding: 1. Channel or user identi er to focus on, the API deepens the search in its latest video releases. 2. Hashtag or term present either as part of the video's title or its associated description. 3. Location in terms of latitude and longitude from which to look for nearby videos.

Besides these parameters, requests must declare how many videos the search for comments should be performed on and how many of these comments are to be retrieved per video at most. YouTube data API does not discriminate comments by language and this should be performed by a third party agent such as the spaCy library for Python 7. 2.1.2 Data Storage Elasticsearch8 search engine has been used to save the posts because it allows us to store, search and analyze vast volumes of structured and non-structured data within a fast response time.

Every post extracted from Twitter or YouTube is stored in our Elasticsearch index containing the folowing data: • Post identi er, it is the identi er associated to the user post. • Thread identi er, given the case that the item is a response to another post. • Textual comment written by the user. • Date on which the post is published. • Name of the social network where the post is retrieved. 2.1.3 Integrating Machine Learning

Models The social monitor also allows the integration of di erent models based on ML to analyze and evaluate the information stored in Elasticsearch. These systems are previously trained on di erent NLP tasks.

For this purpose, the user could select the model to be used and the information to be analyzed, for example.

7https://spacy.io/ 8https://www.elastic.co/es/elasticsearch/

Moreover, the tool o ers high exibility and adaptability as it allows the incorporation of di erent systems, trained by using several methods for di erent tasks and languages. Speci cally, this monitor relies on traditional ML systems such as Support Vector Machine (SVM) and other state-of-theart methods based on Transformer models such as BERT. 2.2

Frontend

An automatic visual and interactive interface has been generated using Swagger UI9. The functionalities o ered by the tool are described below: 1. Collect posts from the di erent social networks (Twitter and YouTube) and store them in the Elasticsearch database. 2. Retrieve previously stored data to produce a graphical representation of the information. 3. Perform an analysis to analyze the stored comments using the NLP models integrated into the tool. Since the data extracted from the di erent social networks are stored in a database, we have integrated di erent systems based on ML to the tool in order to evaluate each post.

So far the system has been trained to detect inappropriate behavior on social media

9https://swagger.io/tools/swagger-ui/

and address two challenging tasks for the NLP community. The rst one is anorexia detection, identifying whether a post contains information related to this eating disorder. The second, o ensive language identication or recognizing whether a tweet contains hurtful, derogatory, or obscene terms. These models are based on ML systems including SVM (Noble, 2006) and state-of-theart Transformers models such as BERT (Devlin et al., 2018) . Figure 3 shows an output example of the monitor after analyzing a tweet using the o ensive language detection trained model.

Conclusions and Future Work In this paper we present a prototype of a social monitor aimed at detecting inappropriate behavior in social media: Twitter and Youtube. For this purpose, the tool includes a database where the user can store the posts from the social networks to later analyze them with di erent previously trained NLP systems. Speci cally, the tool integrates two systems based on document classi cation to identify anorexia and o ensive language although it is implemented in a way that allows its use on other tasks.

For future work, we plan to incorporate visual analytics using Kibana10 to display statistics associated with the stored data. In addition, we will integrate NLP trained models related to di erent topics of interest for the scienti c community using data retrieved from social networks. In this way, the same database will be useful for di erent purposes.

Acknowledgements

This work has been partially supported by a grant from Fondo Europeo de Desarrollo

Regional (FEDER), LIVING-LANG project [RTI2018-094653-B-C21], and the Ministry of Science, Innovation and Universities (scholarship [FPI-PRE2019-089310]) from the Spanish Government.

Noble, W. S. 2006.

vector machine? 24(12):1565{1567.

What is a support Nature biotechnology, Plaza-del Arco, F. M., M. D. MolinaGonzalez, , and M. T. Mart n-Valdivia. 2021. Comparing pre-trained language models for spanish hate speech detection. Expert Systems with Applications, 166:114120.

Plaza del Arco, F. M., C. Strapparava, L. A.

Urena Lopez, and M. Martin. 2020. EmoEvent: A multilingual emotion corpus based on di erent events. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association.

Rambocas, M. and B. G. Pacheco. 2018. Online sentiment analysis in marketing research: a review. Journal of Research in Interactive Marketing.

Devlin , J. ,

M.-W.

Chang ,

Lee , and

Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv: 1810 .04805.

Lopez-Ubeda , P. ,

F. M.

Plaza-del Arco , M. C.

D az-

Galiano , and M. T. Mart nValdivia. 2021a . Necos: An annotated corpus to identify constructive news comments in spanish . Procesamiento del Lenguaje Natural , 66 : 41 { 51 .

Lopez-Ubeda , P. ,

F. M.

Plaza-del Arco , M. C.

D az-

Galiano , and M. T. Mart nValdivia. 2021b . How successful is transfer learning for detecting anorexia on social media? Applied Sciences , 11 ( 4 ): 1838 .

Lopez-Ubeda , P. ,

F. M.

Plaza del Arco , M. C.

D az-

Galiano , L. A.

Uren

~a-

Lopez , and M. T. Mart n-Valdivia. 2019 . Detecting anorexia in spanish tweets . In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019 ), pages 655 { 663 .