<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Social Monitor for Detecting Inappropriate Behavior</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jose Alberto Mesa Murgado</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Flor Miriam Plaza-del-Arco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pilar Lopez-Ubeda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Teresa Mart n-Valdivia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Informatica, CEATIC, Universidad de Jaen</institution>
          ,
          <addr-line>Espan~a</addr-line>
        </aff>
      </contrib-group>
      <fpage>41</fpage>
      <lpage>44</lpage>
      <abstract>
        <p>In this paper we present a prototype of a social monitor aimed at detecting inappropriate behavior in social networks applying Machine Learning (ML) solutions. This monitor is able to retrieve textual data from two of the current main social networks (YouTube and Twitter) and analyze them using ML algorithms based systems trained on di erent scenarios. In particular, those that have emerged in recent years in the Natural Language Processing (NLP) community and that are related to social and mental health issues, such as the detection of o ensive language, cyberbullying, fake news and the identi cation of mental disorders.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The emergence of the Web 2.0 has completely
changed the way people communicate and
interact. The most popular social networks
such as YouTube, Twitter or WhatsApp have
more than 2,000 million registered users.
Every second, on average, over 6,000 tweets are
published on Twitter, which results in over
500 million tweets per day1.</p>
      <p>
        Given the large amount of data
accessible via the Web, di erent studies in the eld
of Natural Language Processing (NLP) have
emerged seeking to o er solutions to
society in di erent areas such as psychology to
identify people moods (Plaza del Arco et
al., 2020) mental disorders
        <xref ref-type="bibr" rid="ref4">(Lopez-Ubeda et
al., 2019)</xref>
        <xref ref-type="bibr" rid="ref2 ref3">(Lopez-Ubeda et al., 2021b)</xref>
        ,
mar
      </p>
    </sec>
    <sec id="sec-2">
      <title>1https://www.internetlivestats.com/</title>
      <p>twitter-statistics/
keting to analyze product reviews to boost
sales (Rambocas and Pacheco, 2018), and
sociology to detect hate speech (Plaza-del Arco
et al., 2021) or constructive news
(LopezUbeda et al., 2021a).</p>
      <p>On the one hand, every day, many
current events go viral on social media, usually
related to political issues, celebrities, video
games, fashion or diseases. On the other
hand, although we nd many studies
applying ML solutions to analyze this data, the
availability of tools that integrate these
systems to be used in real scenarios is scarce.</p>
      <p>In this paper we present a prototype of
a social monitor aimed at detecting
inappropriate behavior through the following
functionalities: i ) collecting data from two of the
main social network: Twitter and YouTube;
ii ) integrating ML systems trained on di
erent tasks, and iii ) analyzing the data
colCopyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
lected using NLP solutions. With this
prototype, ML-based systems can be applied to
real scenarios that test their performance. It
should be remarked that the monitor is
especially aimed at integrating NLP solutions,
which have arisen over the last few years
within the scienti c community, to tackle
social challenges in social networks such as
cyberbullying detection, fake news and mental
disorders identi cation.</p>
      <p>The rest of the paper is structured as
follows: In Section 2 we present the description
of the tool in which we will detail
information related to backend and frontend
development. The real scenarios where the tool
will be implemented are explained in Section
3. Finally, Section 4 presents conclusions and
future work.
2</p>
      <sec id="sec-2-1">
        <title>System Description</title>
        <p>The system is a social monitor designed to
retrieve posts from two of the current main
social networks: Twitter and YouTube. The
purpose of collecting such posts is to analyze
them using models trained on well known
tasks of the NLP community such as text
classi cation and Named Entity Recognition.</p>
        <p>This tool is implemented using Python's
FastAPI2 framework to establish the
connection between users, database and the o
cial social networks APIs through POST and
GET HTTP requests. This framework has
been chosen due to its advantages: high
performance, detailed documentation, strong
relationship with API development standards
such as JSON schemas, and nally, for the
ease of learning and developing using it.</p>
        <p>The architecture of this tool is displayed
in Figure 1. First step consists of extracting
posts from di erent social networks,
specifically Twitter and YouTube. Subsequently,
these posts are stored in a database (step 2).
Finally, with the idea of analyzing the
information extracted from these social networks,
posts are evaluated using pre-trained ML
systems (step 3).
2.1</p>
        <sec id="sec-2-1-1">
          <title>Backend</title>
          <p>This side of the application is
responsible of extracting information from di
erent social networks (Section 2.1.1), storing
data (Section 2.1.2) and integrating di
erent NLP models that have been previously
trained on a speci c corpora for a speci c
task(Section 2.1.3).
2.1.1 Data Retrieval
Twitter posts and thread. To retrieve
posts from Twitter, we use its o cial API3.
It limits the number of tweets that can be
requested to 25,000 monthly under the
standard and academic license4 (the enterprise
license extends this limitation up to 5 million
tweets per month). The social monitor allows
the possibility of receiving tweets extracted
from Twitter regarding:
1. An user timeline and the associated
responses to those tweets.
2. Contained in a given hashtag.
3. Nearby a given location de ned in terms
of latitude and longitude.</p>
          <p>Speci cally, we look for tweets written in
Spanish although this restriction could be
updated to t other requirements. Tweets
retrieved from this request are stored inside our
Elasticsearch index.</p>
          <p>YouTube video comments. To retrieve
comments from YouTube videos Google
provides an o cial API5. To request this data
from this API we use Google's o cial Python
library6 along with our credentials. The
number of elements to retrieve is restricted
3https://developer.twitter.com/en/docs/
twitter-api</p>
          <p>4https://developer.twitter.com/en/docs/
twitter-api/rate-limits
5https://developers.google.com/youtube/v3
6https://github.com/googleapis/
google-api-python-client
up to 10,000 per day. Comments to YouTube
videos can be retrieved regarding:
1. Channel or user identi er to focus
on, the API deepens the search in its
latest video releases.
2. Hashtag or term present either as part
of the video's title or its associated
description.
3. Location in terms of latitude and
longitude from which to look for nearby
videos.</p>
          <p>Besides these parameters, requests must
declare how many videos the search for
comments should be performed on and how many
of these comments are to be retrieved per
video at most. YouTube data API does not
discriminate comments by language and this
should be performed by a third party agent
such as the spaCy library for Python 7.
2.1.2 Data Storage
Elasticsearch8 search engine has been used to
save the posts because it allows us to store,
search and analyze vast volumes of
structured and non-structured data within a fast
response time.</p>
          <p>Every post extracted from Twitter or
YouTube is stored in our Elasticsearch index
containing the folowing data:
• Post identi er, it is the identi er
associated to the user post.
• Thread identi er, given the case that
the item is a response to another post.
• Textual comment written by the user.
• Date on which the post is published.
• Name of the social network where
the post is retrieved.
2.1.3 Integrating Machine Learning</p>
          <p>Models
The social monitor also allows the
integration of di erent models based on ML to
analyze and evaluate the information stored in
Elasticsearch. These systems are previously
trained on di erent NLP tasks.</p>
          <p>For this purpose, the user could select the
model to be used and the information to be
analyzed, for example.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>7https://spacy.io/</title>
    </sec>
    <sec id="sec-4">
      <title>8https://www.elastic.co/es/elasticsearch/</title>
      <p>Moreover, the tool o ers high exibility
and adaptability as it allows the
incorporation of di erent systems, trained by using
several methods for di erent tasks and
languages. Speci cally, this monitor relies on
traditional ML systems such as Support
Vector Machine (SVM) and other
state-of-theart methods based on Transformer models
such as BERT.
2.2</p>
      <sec id="sec-4-1">
        <title>Frontend</title>
        <p>An automatic visual and interactive
interface has been generated using Swagger UI9.
The functionalities o ered by the tool are
described below:
1. Collect posts from the di erent social
networks (Twitter and YouTube) and
store them in the Elasticsearch database.
2. Retrieve previously stored data to
produce a graphical representation of the
information.
3. Perform an analysis to analyze the
stored comments using the NLP models
integrated into the tool.
Since the data extracted from the di erent
social networks are stored in a database, we
have integrated di erent systems based on
ML to the tool in order to evaluate each post.</p>
        <p>So far the system has been trained to
detect inappropriate behavior on social media</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>9https://swagger.io/tools/swagger-ui/</title>
      <p>
        and address two challenging tasks for the
NLP community. The rst one is anorexia
detection, identifying whether a post
contains information related to this eating
disorder. The second, o ensive language
identication or recognizing whether a tweet
contains hurtful, derogatory, or obscene terms.
These models are based on ML systems
including SVM (Noble, 2006) and
state-of-theart Transformers models such as BERT
        <xref ref-type="bibr" rid="ref1">(Devlin et al., 2018)</xref>
        . Figure 3 shows an
output example of the monitor after analyzing a
tweet using the o ensive language detection
trained model.
      </p>
      <p>Conclusions and Future Work
In this paper we present a prototype of a
social monitor aimed at detecting
inappropriate behavior in social media: Twitter and
Youtube. For this purpose, the tool includes
a database where the user can store the posts
from the social networks to later analyze
them with di erent previously trained NLP
systems. Speci cally, the tool integrates two
systems based on document classi cation to
identify anorexia and o ensive language
although it is implemented in a way that allows
its use on other tasks.</p>
      <p>For future work, we plan to incorporate
visual analytics using Kibana10 to display
statistics associated with the stored data. In
addition, we will integrate NLP trained
models related to di erent topics of interest for
the scienti c community using data retrieved
from social networks. In this way, the same
database will be useful for di erent purposes.</p>
      <sec id="sec-5-1">
        <title>Acknowledgements</title>
        <p>This work has been partially supported by
a grant from Fondo Europeo de Desarrollo</p>
        <p>Regional (FEDER), LIVING-LANG project
[RTI2018-094653-B-C21], and the Ministry of
Science, Innovation and Universities
(scholarship [FPI-PRE2019-089310]) from the
Spanish Government.</p>
        <p>Noble, W. S. 2006.</p>
        <p>vector machine?
24(12):1565{1567.</p>
        <p>What is a support
Nature biotechnology,
Plaza-del Arco, F. M., M. D.
MolinaGonzalez, , and M. T. Mart n-Valdivia.
2021. Comparing pre-trained language
models for spanish hate speech
detection. Expert Systems with Applications,
166:114120.</p>
        <p>Plaza del Arco, F. M., C. Strapparava, L. A.</p>
        <p>Urena Lopez, and M. Martin. 2020.
EmoEvent: A multilingual emotion corpus
based on di erent events. In Proceedings
of the 12th Language Resources and
Evaluation Conference. European Language
Resources Association.</p>
        <p>Rambocas, M. and B. G. Pacheco. 2018.
Online sentiment analysis in marketing
research: a review. Journal of Research in
Interactive Marketing.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Lopez-Ubeda</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Plaza-del Arco</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. C.</surname>
          </string-name>
          <article-title>D az-</article-title>
          <string-name>
            <surname>Galiano</surname>
          </string-name>
          , and M.
          <source>T. Mart nValdivia. 2021a</source>
          .
          <article-title>Necos: An annotated corpus to identify constructive news comments in spanish</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          ,
          <volume>66</volume>
          :
          <fpage>41</fpage>
          {
          <fpage>51</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Lopez-Ubeda</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Plaza-del Arco</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. C.</surname>
          </string-name>
          <article-title>D az-</article-title>
          <string-name>
            <surname>Galiano</surname>
          </string-name>
          , and M.
          <source>T. Mart nValdivia. 2021b</source>
          .
          <article-title>How successful is transfer learning for detecting anorexia on social media?</article-title>
          <source>Applied Sciences</source>
          ,
          <volume>11</volume>
          (
          <issue>4</issue>
          ):
          <year>1838</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Lopez-Ubeda</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Plaza del Arco</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. C.</surname>
          </string-name>
          <article-title>D az-</article-title>
          <string-name>
            <surname>Galiano</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Uren</surname>
          </string-name>
          <article-title>~a-</article-title>
          <string-name>
            <surname>Lopez</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. T. Mart</surname>
          </string-name>
          n-Valdivia.
          <year>2019</year>
          .
          <article-title>Detecting anorexia in spanish tweets</article-title>
          .
          <source>In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP</source>
          <year>2019</year>
          ), pages
          <fpage>655</fpage>
          {
          <fpage>663</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>