<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modeling of online social networks for automated monitoring system</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yu.B. Savva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu.V. Davydova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Orel State University</institution>
          ,
          <addr-line>95, Komsomol'skaya, 302026, Orel</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>76</fpage>
      <lpage>79</lpage>
      <abstract>
        <p>Monitoring using keywords is necessary step in solving the problem of detection of users' illegal behavior such as drug use, extremist propaganda in online social networks. Analysis of text posts is difficult because of using jargon and making mistakes in communications. In paper model of online social networks for automated monitoring system is presented. This model focuses not on communications between users but on text posts. Features of Russian text posts are given. Problem of text posts obfuscation by users involved in illicit fields of activities is discussed.</p>
      </abstract>
      <kwd-group>
        <kwd>online social networks</kwd>
        <kwd>monitoring</kwd>
        <kwd>text analysis</kwd>
        <kwd>information retrieval</kwd>
        <kwd>fuzzy search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>M . Every post can be</p>
      <p>k
– text j is text of the post m j , text j  wj1, wj2 ,..., wjg , w ji is the i-th word in text;
– typeh Type , Type  type1, type2 , type3 is a set of post types, where type1 is original post (which means that user</p>
      <p>Data Science / Yu.B. Savva, Yu.V. Davydova
– parent j is a user’s or community’s identifier. If type of current post is a repost or a comment then parent contains
identifier of author who posted original message.</p>
      <p>P   p1, p2 ,..., pic  is a set of pages of OSNs, number of pages is equal to number of users’ and communities’ identifiers as
every page belongs to user or community. Page is defined as follows:
pk  ik , ttq , ck , M k  mkz z  1..x , ik  I , x  mc , q  1..2 ,</p>
      <p>, ttk  tt1
ck  
in  I , n  1..ic</p>
      <p>,
y textkz ,WFq </p>
      <p>g r
y textkz ,WFq     d (wzki , wfqj ) , d (wzki , wfqj )   ,</p>
      <p>i1 j1
where: – ik is the identifier of user or community of current page pk ;
– ttk TT , TT  tt1, tt2 is a set of pages type. tt1 is a personal page and tt2 is a community page;
– ck is a set of user’s identifiers. If current page pk is a personal page then ck is an empty set as page pk belongs to
one user. If pk is a community page then ck keeps user’s identifiers who are owners or managers of community (it can be one
user, so ck keeps one element);</p>
      <p>– M k is a group of posts which are posted on the page pk . It can be empty Mk   , that means OSNs page doesn’t
contain any posts at the moment.</p>
      <p>
        Set of keywords is given L  l1,l2 ,...,llc , where lc – the number keywords. Every keyword is represented by its
grammatical, semantic information and word forms (according to inflection rules in Russian language) ls  GRs , SMs ,WFs .
This keywords storage model was described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this work we are focused on word forms of keywords. They was defined
as a language WF over the alphabet A . W F  A  .
      </p>
      <p>The goal of automated monitoring system of OSNs is to find set of pages PF  P which contains required amount of
keywords, therefore these pages are indicators of potential illegal actions of their owners. Conceptually it can be presented as
  x lc  
follows: PF   ik , ttq , ck , M k  mkz z  1..x |    f mkz , lq       k  1..ic ,</p>
      <p>  z1 q1  
where: – f  mkz ,lq  is a function which is defined as f mkz ,lq   ytextkz ,WFq  ;
–  is a threshold of presence of keywords in text posts of current user. It can be defined by decision maker.
is
a
function
of fuzzy
search
matching, conceptually
it can
be
presented
as
follows:
where: – d (wzki , wfqj ) is a distance measure, which shows similarity between two words wzki and wfqj . Initial states are:
d (0, wfqj )  wfqj and d (wzki , 0)  wk ,</p>
      <p>zi
–  is a threshold of distance measure, it can be defined by decision maker. The less is the value of distance measure,
the higher is similarity between words. That means that current word in text post is a keyword written with mistakes with great
probability. By choosing the value of threshold of distance measure, decision maker can manage the levels of precision and
recall of information retrieval. The less is the value the more precise search is, but in this case, some relevant text posts will be
lost and recall will be lower.</p>
      <p>As the result of Russian text posts analysis from OSNs it was revealed that users use informal style of writing and
often neglect the language rules.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Features of Russian text posts of OSNs users</title>
      <p>Text posts in OSNs have the following features:
- use of conversational style in writing, slang and jargon use, abbreviations use;
- short length of average text post with weak formal syntactic relations;
- use of smileys, different special symbols;
- intentional and unintentional garble of words, including spelling errors and typos;
- borrowings from English language, like “4u” (For You).</p>
      <p>These features characterize modern informal communications, where there is a high speed of information exchange and
additional expression. Thus, text posts of OSNs users can be considered as unstructured sequences of letters symbols and images
combining with each other. This fact should be taken into account in text analysis. As for communications in illegal fields of
activity, additional features should be noted. To avoid detecting by law-enforcement agencies people use jargon. The main
difficulty of text analysis in case of jargon use consists in high degree of homonony. Words of common used lexics may be
organized in collocations and thus get new semantics as a result. There is a constant appearance of new jargon. The most glaring
example is jargon in the field of illicit traffic of narcotic drugs as new substances appear rather quickly. Also it should be
considered that OSNs users involved into illegal activities obfuscate posts with the same aim to prevent their detection.</p>
      <p>Data Science / Yu.B. Savva, Yu.V. Davydova</p>
    </sec>
    <sec id="sec-3">
      <title>4. Problem of text posts obfuscation and methods dealing with it</title>
      <p>
        First mention of obfuscation method appeared in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Authors suggested confusing of program code by adding extra variables
and constructions with aim to prevent algorithm analysis and deter reverse engineering. Also obfuscation can be used to
optimize code. Analysis of obfuscation methods of computer program is given in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], deobfuscation methods are presented in
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Later obfuscation was applied to creating spam emails, spam messages on different web sites. In this case obfuscation
allows to pass through content filtering. Obfuscated words can’t be found during exact matching between words from message
and words from dictionary. Dictionary contains words, which are indicators of spam messages.
      </p>
      <p>
        Text posts in OSNs can be obfuscated by users involved in illegal fields of activity, for instance terrorist and extremist
propaganda, illicit drug sales. In this case as it was discussed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] users obfuscate their posts to prevent effective linguistic
analysis of texts and avoid detection of their destructive actions and influence on other OSNs users. For text obfuscation
generally the following methods are used:
- intentional garble of words, including spelling errors, typos, wrong word boundaries (space insertions and deletions);
- letter substitution by digits, symbols which look like substituted letters;
- insertion extra not meaningful symbols;
- transliteration use.
      </p>
      <p>
        Thus, text posts deobfuscation is the actual and difficult issue. Solution by computer means is not a trivial task as there are
many ways of obfuscation of even one word. Thereby such methods as spell checking, deleting non-alphabetic symbols and
constructing variants by possible substitutions are not so effective. Applying Hidden Markov Model to the task of spam emails
deobfuscation showed good results [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Also, statistical models can be useful, for instance, model based on Bayesian rule,
ngram model [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>5. Using model of OSNs in automated monitoring system</title>
      <p>Automated monitoring system includes the following main subsystems:
- data collection;
- fuzzy text search which includes linguistic knowledge base, keywords database, algorithmic search and deobfuscation
modules;
- results processing and report generation modules;
- database of text posts and database of search index.</p>
      <p>According to model, data collection subsystem gathers identifiers and text posts with additional attributes like type of
messages, time and date of posting. This information is stored in database of text messages. Decision maker can specify settings
of OSNs crawl strategy.</p>
      <p>Subsystem of fuzzy text search takes information from database of text posts and implements the goal of automated
monitoring system, trying to detect illegal behavior by using linguistic knowledge base and keywords database. At first stage
tokenization is held, text is deobfuscated if it requires. The second stage is fuzzy text search using keywords. The use of
linguistic knowledge base helps to make information retrieval not so sensitive to mistakes. Linguistic knowledge base contains
information about inflection paradigms, models of mistakes, typos. Keywords database stores grammatical, semantic
information and word forms of keywords lexemes. In case some text post contains threshold amount of keywords, it is indexed
and is sent to database of search index. Processes of gathering information by data collection subsystem and searching by
subsystem of fuzzy text search are parallel. Report generation modules show different slices of results to user of monitoring
system such as topic distribution, age and location distribution of OSNs users and some others. Results are grouped according to
threshold of similarity distance measure.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Results and discussions</title>
      <p>
        At the present time automated monitoring system is to be used in detection of drugs use propaganda and illicit drug sales in
OSNs [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], though system can be used in different fields, it depends on keywords database. Linguistic database of keywords
used in the field of illicit traffic of narcotic drugs and psychotropic substances was developed [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. It allows to store not only
word forms but semantics of jargon. Deobfuscation method using Hidden Markov Model was developed [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], example of
algorithm is presented at 0
      </p>
      <p>
        Corpus of text posts is gathered from OSN Vkontakte. Currently algorithms of fuzzy search using keywords from developed
linguistic database and models for linguistic knowledge base are developed. Features of algorithms and default values for
distance measure should be tested on text corpus and corrected in case of need as they are a kind of empirical data because
natural language is not a good formalized object [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion</title>
      <p>For providing information and psychological security of users, it is necessary to organize online social networks monitoring.
Monitoring process has many difficulties like short messages in OSNs, informal communications using jargon, text posts
obfuscation. To detect users’ illicit activities and destructive influence effective text analysis and search by keywords should be
organized. Thus, in OSNs modeling emphasis should be on text posts, corresponding model was presented in this paper. Main</p>
      <p>Data Science / Yu.B. Savva, Yu.V. Davydova
subsystems of automated monitoring system using aspects of the model were described. The results of the work done were
discussed and features of future work were given.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Davydova</given-names>
            <surname>YuV</surname>
          </string-name>
          .
          <article-title>To the issue of need for automation of threats search process in virtual social networks and communities. Actual problems in modern science in XXI century: proceedings of the 6th international scientific-practical conference</article-title>
          .
          <source>Makhachkala: “Aprobaciya” Publisher</source>
          ,
          <year>2014</year>
          ;
          <fpage>25</fpage>
          -
          <lpage>26</lpage>
          . (in Russian)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] The top 25 social media monitoring tools</article-title>
          . URL: http://keyhole.co/blog/the-top-25
          <string-name>
            <surname>-</surname>
          </string-name>
          social-media
          <string-name>
            <surname>-</surname>
          </string-name>
          monitoring-tools
          <source>/ (19.01</source>
          .
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Gubanov</surname>
            <given-names>DA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Novikov</surname>
            <given-names>DA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chhartishvili</surname>
            <given-names>AG</given-names>
          </string-name>
          .
          <article-title>Online social networks: models of information influence, control and confrontation</article-title>
          . Moscow: “Fizmatlit” Publisher,
          <year>2010</year>
          ; 228 p.
          <article-title>(in Russian)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Savva</surname>
            <given-names>YuB</given-names>
          </string-name>
          , DavydovaYuV.
          <article-title>Linguistic database for monitoring system of online social networks in providing information and psychological security. European integration: justice, freedom and security: proceedings of VII scientific and professional conference with international participation: in 3 volumes</article-title>
          . Belgrade: “
          <string-name>
            <surname>Criminalistic-Police</surname>
            <given-names>Academy</given-names>
          </string-name>
          ” Publisher,
          <year>2016</year>
          ;
          <volume>1</volume>
          :
          <fpage>145</fpage>
          -
          <lpage>154</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Diffie</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellman</surname>
            <given-names>M</given-names>
          </string-name>
          .
          <article-title>New directions in cryptography</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          <year>1976</year>
          ; IT-22(
          <issue>6</issue>
          ):
          <fpage>644</fpage>
          -
          <lpage>654</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Korobejnikov</surname>
            <given-names>AG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kutuzov</surname>
            <given-names>IM</given-names>
          </string-name>
          ,
          <article-title>Kolesnikov PYu</article-title>
          .
          <source>Analysis of obfuscation methods. Cybernetics and programming</source>
          <year>2012</year>
          ;
          <volume>1</volume>
          :
          <fpage>31</fpage>
          -
          <lpage>37</lpage>
          . (in Russian)
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Kasperski</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rokko</surname>
            <given-names>E.</given-names>
          </string-name>
          <article-title>The art of disassembling</article-title>
          .
          <source>SPb: BHV-Peterburg</source>
          ,
          <year>2008</year>
          ; 892 p.
          <article-title>(in Russian)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Savva</surname>
            <given-names>YuB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eryomenko</surname>
            <given-names>VT</given-names>
          </string-name>
          ,
          <article-title>Davydova YuV. About the problem of the linguistic analysis of the slang in the problem of the automated search of threats of spread of drug addiction on virtual social networks</article-title>
          .
          <source>Information systems and Technologies</source>
          <year>2015</year>
          ;
          <volume>6</volume>
          (
          <issue>92</issue>
          ):
          <fpage>68</fpage>
          -
          <lpage>75</lpage>
          . (in Russian)
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Honglak</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrew YNg</surname>
          </string-name>
          .
          <article-title>Spam Deobfuscation using Hidden Markov Model</article-title>
          .
          <source>Proceedings of the Second Conference on Email and Anti-Spam</source>
          ,
          <year>2005</year>
          . URL: http://ai.stanford.edu/~ang/papers/ceas05-spamdeobfuscation.
          <source>pdf (11.07</source>
          .
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ingersoll</surname>
            <given-names>GS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morton</surname>
            <given-names>TS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farris</surname>
            <given-names>AL</given-names>
          </string-name>
          .
          <article-title>Taming text. How to find, organize and manipulate it</article-title>
          . NY: Manning Publications Co.,
          <year>2013</year>
          ; 320 p.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Manning</surname>
            <given-names>CD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schutze</surname>
            <given-names>H</given-names>
          </string-name>
          .
          <article-title>Introduction to information retrieval</article-title>
          . Cambridge: Cambridge University Press,
          <year>2008</year>
          ; 496 p.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Savva</surname>
            <given-names>YuB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eryomenko</surname>
            <given-names>VT</given-names>
          </string-name>
          ,
          <article-title>Davydova YuV</article-title>
          .
          <article-title>Design of information system identification of persons which participate illicit in field of narcotic drugs and psychotropic substances in the virtual social networks using the database jargon</article-title>
          .
          <source>Information systems and Technologies</source>
          <year>2016</year>
          ;
          <volume>1</volume>
          (
          <issue>93</issue>
          ):
          <fpage>68</fpage>
          -
          <lpage>75</lpage>
          . (in Russian)
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Savva</surname>
            <given-names>YuB</given-names>
          </string-name>
          , Davydova YuV.
          <article-title>Certificate of state registration database no. 2016620197. Jargon in the field of illicit traffic of narcotic drugs and psychotropic substances</article-title>
          .
          <source>Registered 10 February</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Nikol</surname>
          </string-name>
          <article-title>'skaya AN, Savva YuB. About the problem of opening of obfuscated Russian-language texts of participants of online social networks</article-title>
          .
          <source>Information systems and Technologies</source>
          <year>2016</year>
          ;
          <volume>6</volume>
          (
          <issue>98</issue>
          ):
          <fpage>44</fpage>
          -
          <lpage>55</lpage>
          . (in Russian)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>