=Paper= {{Paper |id=Vol-2968/paper11 |storemode=property |title=A Social Monitor for Detecting Inappropriate Behavior |pdfUrl=https://ceur-ws.org/Vol-2968/paper11.pdf |volume=Vol-2968 |authors=José Alberto Mesa Murgado,Flor Miriam Plaza del Arco,Pilar López-Úbeda,Marı́a Teresa Martı́n-Valdivia |dblpUrl=https://dblp.org/rec/conf/sepln/Mesa-MurgadoALM21 }} ==A Social Monitor for Detecting Inappropriate Behavior== https://ceur-ws.org/Vol-2968/paper11.pdf
       A Social Monitor for Detecting Inappropriate
                         Behavior
Monitor Social para Detectar Comportamientos Inapropiados
           José Alberto Mesa Murgado, Flor Miriam Plaza-del-Arco,
                 Pilar López-Úbeda, M. Teresa Martı́n-Valdivia
          Departamento de Informática, CEATIC, Universidad de Jaén, España
                  {jmurgado, fmplaza, plubeda, maite}@ujaen.es

      Abstract: In this paper we present a prototype of a social monitor aimed at de-
      tecting inappropriate behavior in social networks applying Machine Learning (ML)
      solutions. This monitor is able to retrieve textual data from two of the current
      main social networks (YouTube and Twitter) and analyze them using ML algo-
      rithms based systems trained on different scenarios. In particular, those that have
      emerged in recent years in the Natural Language Processing (NLP) community and
      that are related to social and mental health issues, such as the detection of offensive
      language, cyberbullying, fake news and the identification of mental disorders.
      Keywords: Social media Mining, Machine Learning, Natural Language Processing.
      Resumen: En este artı́culo presentamos un prototipo de monitor social destinado
      a detectar comportamientos inapropiados en las redes sociales, aplicando soluciones
      de aprendizaje automático. Este monitor es capaz de recuperar datos textuales de
      dos de las principales redes sociales hoy en dı́a (YouTube y Twitter) y analizar esta
      información haciendo uso de sistemas basados en aprendizaje automático entrenados
      en diferentes escenarios. En particular, aquellos que han surgido en los últimos años
      en la comunidad del Procesamiento del Lenguaje Natural (PLN) y que están rela-
      cionados con importantes retos sociales, como por ejemplo la detección de lenguaje
      ofensivo, ciberacoso, noticias falsas y la identificación de trastornos mentales.
      Palabras clave: Minerı́a de datos en redes sociales, Aprendizaje Automático,
      Procesamiento del Lenguaje Natural.


1   Introduction and Motivation                                                          keting to analyze product reviews to boost
The emergence of the Web 2.0 has completely                                              sales (Rambocas and Pacheco, 2018), and so-
changed the way people communicate and in-                                               ciology to detect hate speech (Plaza-del Arco
teract. The most popular social networks                                                 et al., 2021) or constructive news (López-
such as YouTube, Twitter or WhatsApp have                                                Ubeda et al., 2021a).
more than 2,000 million registered users. Ev-                                                 On the one hand, every day, many cur-
ery second, on average, over 6,000 tweets are                                            rent events go viral on social media, usually
published on Twitter, which results in over                                              related to political issues, celebrities, video
500 million tweets per day1 .                                                            games, fashion or diseases. On the other
    Given the large amount of data accessi-                                              hand, although we find many studies apply-
ble via the Web, different studies in the field                                          ing ML solutions to analyze this data, the
of Natural Language Processing (NLP) have                                                availability of tools that integrate these sys-
emerged seeking to offer solutions to soci-                                              tems to be used in real scenarios is scarce.
ety in different areas such as psychology to                                                  In this paper we present a prototype of
identify people moods (Plaza del Arco et                                                 a social monitor aimed at detecting inappro-
al., 2020) mental disorders (López-Úbeda et                                            priate behavior through the following func-
al., 2019) (López-Úbeda et al., 2021b), mar-                                           tionalities: i ) collecting data from two of the
                                                                                         main social network: Twitter and YouTube;
   1
     https://www.internetlivestats.com/                                                  ii ) integrating ML systems trained on differ-
twitter-statistics/                                                                      ent tasks, and iii ) analyzing the data col-
         Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).



                                                                                41
lected using NLP solutions. With this pro-
totype, ML-based systems can be applied to
real scenarios that test their performance. It
should be remarked that the monitor is es-
pecially aimed at integrating NLP solutions,
which have arisen over the last few years
within the scientific community, to tackle so-
cial challenges in social networks such as cy-
berbullying detection, fake news and mental
disorders identification.
   The rest of the paper is structured as fol-
lows: In Section 2 we present the description
of the tool in which we will detail informa-
tion related to backend and frontend devel-
opment. The real scenarios where the tool
will be implemented are explained in Section             Figure 1: Social Monitor Architecture.
3. Finally, Section 4 presents conclusions and
future work.
                                                      trained on a specific corpora for a specific
2       System Description                            task(Section 2.1.3).
                                                      2.1.1 Data Retrieval
The system is a social monitor designed to
retrieve posts from two of the current main           Twitter posts and thread. To retrieve
social networks: Twitter and YouTube. The             posts from Twitter, we use its official API3 .
purpose of collecting such posts is to analyze        It limits the number of tweets that can be
them using models trained on well known               requested to 25,000 monthly under the stan-
tasks of the NLP community such as text               dard and academic license4 (the enterprise li-
classification and Named Entity Recognition.          cense extends this limitation up to 5 million
                                                      tweets per month). The social monitor allows
   This tool is implemented using Python’s
                                                      the possibility of receiving tweets extracted
FastAPI2 framework to establish the connec-
                                                      from Twitter regarding:
tion between users, database and the offi-
cial social networks APIs through POST and             1. An user timeline and the associated re-
GET HTTP requests. This framework has                     sponses to those tweets.
been chosen due to its advantages: high per-
formance, detailed documentation, strong re-           2. Contained in a given hashtag.
lationship with API development standards              3. Nearby a given location defined in terms
such as JSON schemas, and finally, for the                of latitude and longitude.
ease of learning and developing using it.
   The architecture of this tool is displayed             Specifically, we look for tweets written in
in Figure 1. First step consists of extracting        Spanish although this restriction could be up-
posts from different social networks, specif-         dated to fit other requirements. Tweets re-
ically Twitter and YouTube. Subsequently,             trieved from this request are stored inside our
these posts are stored in a database (step 2).        Elasticsearch index.
Finally, with the idea of analyzing the infor-
                                                      YouTube video comments. To retrieve
mation extracted from these social networks,
                                                      comments from YouTube videos Google pro-
posts are evaluated using pre-trained ML sys-
                                                      vides an official API5 . To request this data
tems (step 3).
                                                      from this API we use Google’s official Python
2.1       Backend                                     library6 along with our credentials. The
                                                      number of elements to retrieve is restricted
This side of the application is responsi-
                                                        3
ble of extracting information from differ-                 https://developer.twitter.com/en/docs/
                                                      twitter-api
ent social networks (Section 2.1.1), storing             4
                                                           https://developer.twitter.com/en/docs/
data (Section 2.1.2) and integrating differ-          twitter-api/rate-limits
ent NLP models that have been previously                 5
                                                           https://developers.google.com/youtube/v3
                                                         6
                                                           https://github.com/googleapis/
    2
        https://fastapi.tiangolo.com/                 google-api-python-client




                                                 42
up to 10,000 per day. Comments to YouTube                 Moreover, the tool offers high flexibility
videos can be retrieved regarding:                     and adaptability as it allows the incorpora-
                                                       tion of different systems, trained by using
 1. Channel or user identifier to focus                several methods for different tasks and lan-
    on, the API deepens the search in its lat-         guages. Specifically, this monitor relies on
    est video releases.                                traditional ML systems such as Support Vec-
 2. Hashtag or term present either as part             tor Machine (SVM) and other state-of-the-
    of the video’s title or its associated de-         art methods based on Transformer models
    scription.                                         such as BERT.

 3. Location in terms of latitude and                  2.2       Frontend
    longitude from which to look for nearby            An automatic visual and interactive inter-
    videos.                                            face has been generated using Swagger UI9 .
                                                       The functionalities offered by the tool are de-
   Besides these parameters, requests must             scribed below:
declare how many videos the search for com-
ments should be performed on and how many                  1. Collect posts from the different social
of these comments are to be retrieved per                     networks (Twitter and YouTube) and
video at most. YouTube data API does not                      store them in the Elasticsearch database.
discriminate comments by language and this
should be performed by a third party agent                 2. Retrieve previously stored data to pro-
such as the spaCy library for Python 7 .                      duce a graphical representation of the
                                                              information.
2.1.2 Data Storage
Elasticsearch8 search engine has been used to              3. Perform an analysis to analyze the
save the posts because it allows us to store,                 stored comments using the NLP models
search and analyze vast volumes of struc-                     integrated into the tool.
tured and non-structured data within a fast
response time.                                            Figure 2 provides a screenshot of the fron-
   Every post extracted from Twitter or                tend of the social monitor to enhance the
YouTube is stored in our Elasticsearch index           above functionalities.
containing the folowing data:

  • Post identifier, it is the identifier asso-
    ciated to the user post.
  • Thread identifier, given the case that
    the item is a response to another post.
  • Textual comment written by the user.
  • Date on which the post is published.
  • Name of the social network where
    the post is retrieved.

2.1.3   Integrating Machine Learning
                                                                Figure 2: Social monitor frontend.
        Models
The social monitor also allows the integra-
tion of different models based on ML to an-            3       Real Scenarios
alyze and evaluate the information stored in
Elasticsearch. These systems are previously            Since the data extracted from the different
trained on different NLP tasks.                        social networks are stored in a database, we
   For this purpose, the user could select the         have integrated different systems based on
model to be used and the information to be             ML to the tool in order to evaluate each post.
analyzed, for example.                                    So far the system has been trained to de-
                                                       tect inappropriate behavior on social media
  7
      https://spacy.io/
  8                                                        9
      https://www.elastic.co/es/elasticsearch/                 https://swagger.io/tools/swagger-ui/




                                                  43
and address two challenging tasks for the              Regional (FEDER), LIVING-LANG project
NLP community. The first one is anorexia               [RTI2018-094653-B-C21], and the Ministry of
detection, identifying whether a post con-             Science, Innovation and Universities (scholar-
tains information related to this eating dis-          ship [FPI-PRE2019-089310]) from the Span-
order. The second, offensive language identi-          ish Government.
fication or recognizing whether a tweet con-
tains hurtful, derogatory, or obscene terms.           References
These models are based on ML systems in-               Devlin, J., M.-W. Chang, K. Lee, and
cluding SVM (Noble, 2006) and state-of-the-              K. Toutanova. 2018. Bert: Pre-training
art Transformers models such as BERT (De-                of deep bidirectional transformers for lan-
vlin et al., 2018). Figure 3 shows an out-               guage understanding.       arXiv preprint
put example of the monitor after analyzing a             arXiv:1810.04805.
tweet using the offensive language detection
trained model.                                         López-Ubeda, P., F. M. Plaza-del Arco,
                                                          M. C. Dı́az-Galiano, and M. T. Martı́n-
     "offensiveness": "TRUE",                             Valdivia.   2021a.     Necos: An anno-
     "id_thread": "1234",                                 tated corpus to identify constructive news
     "id_post": "1254",                                   comments in spanish. Procesamiento del
     "comment": "Como @user puede ser                     Lenguaje Natural, 66:41–51.
     tan tonto, es que no me lo explico"
     (How can @user be so dumb?                        López-Úbeda, P., F. M. Plaza-del Arco,
     I don’t get it),                                     M. C. Dı́az-Galiano, and M. T. Martı́n-
     "time": "2021-01-22T12:24:18",                       Valdivia. 2021b. How successful is trans-
     "source": "Twitter"                                  fer learning for detecting anorexia on so-
                                                          cial media? Applied Sciences, 11(4):1838.
Figure 3: Example of the social monitor out-           López-Úbeda, P., F. M. Plaza del Arco, M. C.
put after analyzing a retrieved tweet.                    Dı́az-Galiano, L. A. Ureña-López, and
                                                          M. T. Martı́n-Valdivia. 2019. Detecting
                                                          anorexia in spanish tweets. In Proceedings
4        Conclusions and Future Work                      of the International Conference on Recent
In this paper we present a prototype of a so-             Advances in Natural Language Processing
cial monitor aimed at detecting inappropri-               (RANLP 2019), pages 655–663.
ate behavior in social media: Twitter and              Noble, W. S. 2006. What is a support
Youtube. For this purpose, the tool includes             vector machine? Nature biotechnology,
a database where the user can store the posts            24(12):1565–1567.
from the social networks to later analyze
them with different previously trained NLP             Plaza-del Arco, F. M., M. D. Molina-
systems. Specifically, the tool integrates two            González, , and M. T. Martı́n-Valdivia.
systems based on document classification to               2021. Comparing pre-trained language
identify anorexia and offensive language al-              models for spanish hate speech detec-
though it is implemented in a way that allows             tion. Expert Systems with Applications,
its use on other tasks.                                   166:114120.
    For future work, we plan to incorporate            Plaza del Arco, F. M., C. Strapparava, L. A.
visual analytics using Kibana10 to display                Urena Lopez, and M. Martin. 2020. Emo-
statistics associated with the stored data. In            Event: A multilingual emotion corpus
addition, we will integrate NLP trained mod-              based on different events. In Proceedings
els related to different topics of interest for           of the 12th Language Resources and Eval-
the scientific community using data retrieved             uation Conference. European Language
from social networks. In this way, the same               Resources Association.
database will be useful for different purposes.
                                                       Rambocas, M. and B. G. Pacheco. 2018. On-
Acknowledgements                                         line sentiment analysis in marketing re-
This work has been partially supported by                search: a review. Journal of Research in
a grant from Fondo Europeo de Desarrollo                 Interactive Marketing.
    10
         https://www.elastic.co/es/kibana




                                                  44