<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Natural Language Processing in an Instant Messaging Environment for User Analysis</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Lukas Stasytis Faculty of Informatics Kaunas Technology University Student g.</institution>
          <addr-line>50, Kaunas 51368</addr-line>
        </aff>
      </contrib-group>
      <fpage>147</fpage>
      <lpage>151</lpage>
      <abstract>
        <p>-With recent advances in machine learning technology and a resurgence of Instant Messaging (IM) software, a possibility to incorporate natural language processing (NLP) solutions into IM servers for user personality profiling and monitoring has presented itself. This paper presents a novel use-case for NLP in a rapidly expanding data-generating environment - instant messaging application servers to gauge emotional profiles of internet users over time and to appropriately respond without the need for any human interaction from the side of the monitor. IBM Watson's Personality Insights API is looked at as a case-study NLP system for analyzing user data and the IM software Discord as a user-data-generating and user-monitoring environment using a 300.000 message sample. Results show clear and consistent differences in user personality profiles, suggesting that the IM space is a promising environment for further user analytics based on NLP.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A key component of current IM applications like Discord
and Slack, as well as IRCs, is the inclusion of bots. Bots
are non-human users within IM servers that are running
client scripts while interacting with the IM server application
interface (API). These bots are highly modular, require no
additional fees to be in the servers and are, above all, user
friendly and universal. Using Discord as a case-study, a
simple bot can be started by just installing the appropriate
Copyright held by the author(s).
programming language library (of which there are options
for every popular programming language), registering the bot
client identification number on the Discord official websites
developer interface and finally using the provided bot token
to communicate with the Discord API. The bot can then
be added by a server owner using a simple invitation link
and the developer simply has to run the script to have the
bot connect to all the servers it has been added to. At this
point, a bot can parse every message written in the server it
is in and depending on the permissions given by the server
administrators respond appropriately. Generally, these bots
are used for simple services, like server administration using
the chat interface, acting as chatterbots or allowing to interface
with the world wide web using API calls and just the server
text interface from the users end. These bots, while seeming
like a novelty, are highly popular, which can be seen by
looking at some of the more popular bots official pages and the
number of servers they are present in. An example being the
music bot Himebot, currently running on over 66.000 different
Discord servers [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>The most popular bots are following hundreds of thousands
of servers at once, which can house thousands of users each,
each writing messages that can be used as datasets for analysis.
The case-study that will be explored further in this paper
explores the use of a bot on a single Discord server with 3.000
users that generated 300.000 messages in just under a month.</p>
      <p>
        An important consideration is the legality of using these bots
for logging messages. Using the Discord example, their current
terms of service state as following: Developers: ”Developers
using our SDK or API will have access to their end users
information, including message content, message metadata, and
voice metadata. Developers must use such information only to
provide the SDK/API functionality within their applications
and/or services.” [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] This falls in line with the proposal of
using the bots for user profiling, given the option for users to
view their own profiles and keeping the process transparent.
Additionally, according to Section 2.4 of the Developer Terms
of Service, the bot functionality would have to be limited
to non-commercial, non-advertisement-based use-cases, which
fall in line with the proposals presented later in this paper.
      </p>
      <p>Lastly, given the bots modularity, the bots can act as
simple interfaces between IM applications and analytics servers,
allowing for bots written in many programming languages,
housed in many different servers across a wide array of IMs
and IRCs to all interface with the same central server that
would analyze all user messages, allowing for vast scalability
of the service, whilst being easily incorporated into current
bots as a new feature.</p>
    </sec>
    <sec id="sec-2">
      <title>IV. USER MESSAGE ANALYSIS MODEL</title>
      <p>With the increasingly growing userbase for IM services,
the potential for Big Data analytics projects develops.
Abundant public messages that are exchanged every day on these
services can be used in natural language processing-based
solutions for both training and inference of machine
learning implementations. The developed models can be used to
provide powerful user emotional and personality trait-based
modelling. A proposed methodology for user message analysis
is to:
(a) Group the messages sent on the monitored servers based
on time, servers, channels or rooms within those servers
and message authors.
(b) Input the group datasets to NLP tools, such as IBM
Watsons Personality insights API to obtain personality
insight tags for the datasets.
(c) Use a different, unused metric in step 1 to plot the tagged
messages and analyze personality trends.
(d) Look for abnormalities within the plotted datasets or
noteworthy trends and react appropriately by either contacting
the users in question directly via the bot or by sending
warning signals to social services that might be able to
respond appropriately themselves.
(e) Provide open access to the generated datasets for users
within their own message-scope or wider if the message
authors comply to fall in line with all terms of service for
the applications and provide transparency for the users.</p>
      <p>This proposed model allows to look for trends in user or
community personalities and emotions in a time space or
across different environments (servers and channels). Thus,
different applications can be created, making use of the
varying combinations of results achievable.</p>
      <p>
        Much work has already been done in the field of social
network analysis, some with practical case-studies that proved
highly encouraging results [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. IM applications can
be a potential new input environment to add to the field of
research.
      </p>
    </sec>
    <sec id="sec-3">
      <title>V. USE-CASES</title>
      <p>The proposed user analysis model, given data reliability
for analysis engine inference, can be a promising tool for
user personality profiling, monitoring and implementation of
autonomous systems that respond to shifts in user personalities
and emotional ranges. Some of the proposed use-cases for the
model are:
(a) Emotional monitoring for early spotting of depression,
stress and suicidal tendencies and automated responses
using the same bot system.
(b) General community personality analysis for detection of
groups of individuals falling into set personality groups
that a corporate or individual entity might be seeking.
(c) User retention in a text message communication
environment analysis.
(d) User personality assessment without the need for
personalized testing in the form of quizzes, allowing for a more
unbiased result.</p>
    </sec>
    <sec id="sec-4">
      <title>VI. RESULTS</title>
      <p>For the purpose of testing the capabilities of current NLP
systems in an IM environment, a case-study Discord server,
which houses 3.000 users was monitored from January 25th,
2018 to February 25th of the same year. Over the course of the
month, a bot was tracking every message written within the
server and saving it in a logging file alongside extra message
metadata.</p>
      <p>We start by extracting every message of every user which
was posted within the server and saving the message contents,
author’s identification tag, time and the name of the channel
in which the message was written to a logging file for further
processing.</p>
      <p>In our case study, after obtaining sufficient amounts of user
information, the messages are grouped up based on authors
and a few of the most active authors are taken and their
messages grouped by date - a parameter not used in the initial
message grouping. For this example, we will be taking 2 of
the most active members on the server and the server average.
Comparing their Big 5 personality assessment and 3 ’needs’ as
output by the IBM NLP engine - Watson’s Personality Insights.</p>
      <p>A trend in user personality traits analyzed by Watson
emerges. A high degree of openness and neuroticism is present
in all users, whilst traits like agreeableness, conscientiousness
and extraversion are vastly underrepresented.</p>
      <p>Next we take our two example users and compare their
results in relation to the server average. This allows us to find
users that are anomalous within the target group and prevent
biased results that might emerge from the format in which the
messages are sent. Figures 3 and 4 represent these users.</p>
      <p>For this particular case-study, we take 2 of the most active
50 members on the server, each providing, on average, a
few hundred messages per day and sending these grouped
messages to IBMs Personality Insights engine - Watson. We
also look at the server’s average results.</p>
      <p>Fig. 3. Larry’s Big 5 personality insight output</p>
      <p>A focal point is the fact that, whilst two users can have
considerably different representations of the common five traits,
the average over the course of the month remains relatively
the same with specific spikes happening that are the precise
type of anomaly that we would be looking for with such a
system. This can, however, be a product of overfitted datasets.
Careful message analysis would be required to conclude how
accurate Watson’s estimation really is in this environment.</p>
      <p>Lastly, we look at some of the ”needs” of the same users
presented by the Watson engine. For the sake of focusing on
potential cases of depression, we will focus on only these 3
parameters - ’love’, ’closeness’ and ’stability’.</p>
      <p>The results are a lot less spread out than with the Big 5
analysis, however we can still see outliers once each user’s
analysis is presented relative to the average. Most notably
the spike around February 9th, where the need for love spiked
for both users, as far as 6 times the norm for Jenny, and the
increase in need for closeness from Larry’s side.</p>
      <p>
        The results all have a common trend of individual users
having consistently diverse personality assessments by Watson,
while maintaining a level of variation over time, which can
closely match their mood and personality changes over time.
This aligns with recent studies which have been done, showing
that there is valuable insight to be gained from analyzing and
modelling emotions, sentiments and personality traits. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
      </p>
      <p>Lastly, at the end of the manuscript, Table 1 shows the
entirety of the server’s average needs output by the Watson
engine. Some additional trends that we can find are that the
need for curiosity and ideal are much more represented than
the rest and the percentile of each need remains consistant.
closeness
0.15
0.33
0.29
0.28
0.15
0.11
0.20
0.23
0.15
0.27
0.22
curiosity
0.60
0.72
0.72
0.69
0.72
0.73
0.70
0.74
0.83
0.65
0.71</p>
      <p>In this paper, the motivation for implementing IM server
tracking bots for user personality and emotional analysis
was overviewed and a methodology for implementing such
a system looked at. Additionally, a case study was explored
of using the proposed methodology on a 3.000 user server
over a period of one month and example datasets of some of
the users presented while using the IBM</p>
    </sec>
    <sec id="sec-5">
      <title>Watson Personality</title>
      <p>Insights engine for the inference of the messages.</p>
      <p>The results show that active members on IM servers can
provide plenty of data to extract and analyze, with over
300.000 messages being generated per month on the case
study server and at least 100 messages being sent by active
users every day. Furthermore, there is a clear and even
distribution in the average engine output, with a very high
representation of openness (80th percentile), a moderate level
of concientiousness (30th to 40th percentile) and a very low
representation of extraversion, agreebleness and neuroticism
(all under the 10th percentile). The same underrepresentation
could be noticed with the ’need for consideration, love and
stability’ parameters.</p>
      <p>Additionally, relative graphs of users, compared to the
server can be used to find anomalous behaviour. The results,
showing clear distinctions between individual users, suggest
the engine having an analytical accuracy in the IM space
worthy of further consideration and scientific analysis, giving
weight to the proposed use-cases in section V.</p>
      <p>For future work, using the generated emotional range results
as parameters for a recurrent neural networks, an automated
system to predict future changes in user emotional range will
be looked at.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENTS</title>
      <p>I would like to express my deepest thanks to my mentor
Kazimieras Bagdonas for suggestions in implementing the
system and providing guidance in the writing of the paper
itself.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Alexander</surname>
          </string-name>
          .
          <article-title>As discord nears 100 million users, safety concerns are heard</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ducharme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Jauvin</surname>
          </string-name>
          .
          <article-title>A neural probabilistic language model</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>3</volume>
          (Feb):
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karlen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Kuksa</surname>
          </string-name>
          .
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          (Aug):
          <fpage>2493</fpage>
          -
          <lpage>2537</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Damasˇevicˇius</surname>
          </string-name>
          , C. Napoli, T. Sidekerskiene˙, and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Woz´niak. Imf mode demixing in emd for jitter analysis</article-title>
          .
          <source>Journal of Computational Science</source>
          ,
          <volume>22</volume>
          :
          <fpage>240</fpage>
          -
          <lpage>252</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[5] Discord. Section 2</source>
          .
          <fpage>4</fpage>
          -
          <string-name>
            <surname>End</surname>
            <given-names>User Data.</given-names>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>High</surname>
          </string-name>
          .
          <article-title>The era of cognitive systems: An inside look at ibm watson and how it works</article-title>
          .
          <source>IBM Corporation</source>
          , Redbooks,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lambiotte</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kosinski</surname>
          </string-name>
          .
          <article-title>Tracking the digital footprints of personality</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <volume>102</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1934</fpage>
          -
          <lpage>1939</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>I. C. I. Ltd.</surname>
          </string-name>
          <article-title>Threat report: Messaging applications: The new dark web</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mostafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Crick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Calderon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Oatley</surname>
          </string-name>
          .
          <article-title>Incorporating emotion and personality-based analysis in user-centered modelling</article-title>
          .
          <source>In International Conference on Innovative Techniques and Applications of Artificial Intelligence</source>
          , pages
          <fpage>383</fpage>
          -
          <lpage>389</lpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Oatley</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Crick</surname>
          </string-name>
          .
          <article-title>Measuring uk crime gangs: a social network problem</article-title>
          .
          <source>Social Network Analysis and Mining</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ):
          <fpage>33</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Oatley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Crick</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Mostafa</surname>
          </string-name>
          .
          <article-title>Digital footprints: envisaging and analysing online behaviour</article-title>
          .
          <source>In Proceedings of AISB Symposium</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>[12] H. official website.</mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Slack</surname>
          </string-name>
          .
          <article-title>Where work happens around the world</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Venckauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpavicius</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Damasˇevicˇius</surname>
          </string-name>
          , R. Marcinkevicˇius, J. Kapocˇiu¯te-Dzikiene´, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          .
          <article-title>Open class authorship attribution of lithuanian internet comments using one-class classifier</article-title>
          .
          <source>In Computer Science and Information Systems (FedCSIS)</source>
          ,
          <source>2017 Federated Conference on</source>
          , pages
          <fpage>373</fpage>
          -
          <lpage>382</lpage>
          . IEEE,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wro</surname>
          </string-name>
          <article-title>´bel</article-title>
          , J. T. Starczewski, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          .
          <article-title>Handwriting recognition with extraction of letter fragments</article-title>
          .
          <source>In International Conference on Artificial Intelligence and Soft Computing</source>
          , pages
          <fpage>183</fpage>
          -
          <lpage>192</lpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>