<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Network Twitter. CEUR Workshop Proceedings</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.18287/1613-0073-2016-1638-851-856</article-id>
      <title-group>
        <article-title>CLASSIFICATION OF TEXT DATA FROM THE SOCIAL NETWORK TWITTER</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>I.A. Rytsarev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A.V. Blagov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Samara National Research University</institution>
          ,
          <addr-line>Samara</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1638</volume>
      <fpage>851</fpage>
      <lpage>856</lpage>
      <abstract>
        <p>Social networks play an important role in the modern world, and it is important to define the important and popular topics discussed. This article deals with data collection from the social network Twitter, and further clustering and classification of the collected data.</p>
      </abstract>
      <kwd-group>
        <kwd>bigdata</kwd>
        <kwd>data processing</kwd>
        <kwd>data analysis</kwd>
        <kwd>clustering</kwd>
        <kwd>classification</kwd>
        <kwd>tfidf</kwd>
        <kwd>latent dirichlet allocation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The extra-large volumes of data in information technology - data sets the size of
which is beyond the capabilities of typical database (DB) for collecting, storage,
management and analysis of information [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].There are many series of approaches, tools
and methods of processing such extra large volumes of structured and unstructured
data[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1-4</xref>
        ].
      </p>
      <p>The concept of big data means working with the vast volume of information and
varied composition, very frequently updated and located in different sources in order to
increase efficiency and create new ones.</p>
      <p>
        At the moment, social networks are at the peak of their popularity, millions of people
are already use Facebook and Twitter. Many companies need to analyze the data
collected from social networks to assess the relationship of users to their products [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Also the analysis of this area is used in solving security issues [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Having collected
and clustered text data from the social network, it is possible to identify the main
themes and events discussed by social network users in different cities and countries.
1
      </p>
    </sec>
    <sec id="sec-2">
      <title>The clustering of text information based on the frequency analysis</title>
      <p>
        Clustering (or cluster analysis) - is the task of partitioning a set of objects into groups
called clusters. Within each group must be "similar" objects, and objects of different
groups should be as different as possible, at the same time certain measure has to be
determined. The main difference between clustering and classification is that a lists of
groups are not clearly defined and determined during the process of the algorithm.
The main goal of clustering is the searching of existing structures [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
One of the main methods of frequency analysis is to count the number of occurrences
of each word in document. Based on the information received you can create
socalled "tag cloud" - a visual representation of the weight of the words in the document
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
However, as a result of the data processing at the output we obtain that different
articles, prepositions and other service parts of speech will have the largest number of
occurrences. Therefore, for the most honest evaluation of the meaning of the word is
necessary to use measures that will not only count the number of occurrences of the
word in the document, but also take into account the number of occurrences of the
word in other documents. An example of such measures is the TF-IDF. During the
calculation, the weight of a word is proportional to the amount of use of the word in
the document, and is inversely proportional to the frequency of use of the word in
other documents of the collection [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Thus, TF-IDF measure is the product of two
factors:
tfidf t, d , D  tf t, d  idf t, D, (1)
where tf t, d   ni and idf t, D  log D
      </p>
      <p>
        k nk di  ti 
After determination by the TF-IDF measure, clustering can be performed by various
algorithms, such as k-means, for example [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Classification of the text information based on the approach with machine learning</title>
      <p>Also, there is another approach to solve the problem of the classification - the
classification of the information through the machine learning.</p>
      <p>Machine learning - the process through which the machine (computer) is able to show
the behavior which was not explicitly programmed. There are two types of learning:
inductive and deductive.</p>
      <p>
        In the work of researchers involved in cluster analysis of the text information in
various search engines, the inductive measure Word2vec is frequently used [
        <xref ref-type="bibr" rid="ref10 ref11">10-11</xref>
        ].
The principle of the measures is to find connections between the context, the word
according to the assumption that the words that are in similar contexts, tend to mean
similar things, i.e., be semantically close. More formally, the task is: to maximize the
cosine proximity between vectors of the words (scalar product of the vectors) that
appear next to each other, and minimizing the cosine proximity between the vectors
of words that do not appear next to each other. “Next to each other” in this case
means “in close contexts”.
      </p>
      <p>
        Word2vec analyzes the use of words contexts and concludes that they are or are not
close in meaning. Since word2vec making such conclusions based on large amounts
of text, the conclusions are quite adequate The algorithms on which word2vec is base
dare described in detail in [
        <xref ref-type="bibr" rid="ref12 ref13">12-13</xref>
        ].
      </p>
      <p>The examples of vector distances obtained by word2vec are in Table 1.</p>
      <sec id="sec-3-1">
        <title>The word The vector distance</title>
        <p>Paris 0.978443
Spain 0.665923
Belgium 0.665923</p>
      </sec>
      <sec id="sec-3-2">
        <title>Netherlands 0.652428</title>
        <p>Italy 0.633130
Portugal 0.577154</p>
        <p>Russia 0.571507</p>
        <p>
          Germany 0.563291
One type of a deductive approach can be considered is the Latent Dirichlet Allocation
(LDA). This generative model that allows to explain the results of observations with
the help of implicit groups, that allows to receive an explanation of why some of the
pars of data are similar. Typically, when using this approach, you identify a limited
number of topics and further states that each document is a mixture of a small number
of topics [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>For more detailed analysis, it is best to combine different approaches and techniques
depending on the amount of processed data.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The method of collecting information</title>
      <p>To research the work of TF-IDF algorithm software tool that allows to collect data
directly from the Twitter servers has been developed. The implementation is based on
an open Twitter API 2.0 interface. As the object of the study the tweets from Samara
region were taken, for this as a selection criterion of the messages geolocation was set
to the Samara region (including all settlements of the region). All tweets collected in
such a way need to be clustered using TF-IDF metric for short messages of 140
characters in length and the algorithm k-means.</p>
      <p>To carry out the data collection from Twitter server a request containing a consumer
key and consumersecret key was sent. In reply we have oauth.accessToken,
oauth.accessTokenSecret which give us the ability to retrieve data from servers.
The second step is to send the query-request, in response to which a set of tweets
returns.</p>
      <p>Next, the third step is the counting the TF-IDF metric values for each message.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>The data was collected for 24 hours, by the query-request, which characterizes the
Samara region. As the result over 6000 messages has been collected. By applying
metrics TF-IDF and the k-means algorithm 22 clusters were obtained. On example,
one of the obtained clusters (Figure 2) shows that the messages are similar in
meaning, but among them there are messages with "foreign" theme.
Apparently, such is not quite accurate result was obtained due to the fact that studied
Twits have a 140 character limit. For this reason, for more accurate clustering and
further classification it is necessary to modernize TF-IDF measure, introducing
additional weighting coefficients corresponding to the number of symbols (words) in the
message.</p>
      <p>Moreover, high density of the clusters (Figure 3) shows the need for revision of the
metric.
TF-IDF metric works for short messages with relative accuracy. For this reason it is
necessary to optimize this metric: adding weight coefficients, the coefficients
associated with the hashtag, the introduction of the normalization coefficient associated with
the length of the message and the number of words in it, etc.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Issues related to clustering and further classification of text data are relevant in rel
ation to the enormous spread of social networks and online services worldwide.
Approaches and techniques presented in the article are planned for testing on text data
collected from Twitter social network in the Russian segment. Collecting the nece
ssary data is being produced by means of the developed software system, based on
time zones and geolocation. It is planned to develop the subject in the direction of the
output and optimization of parallel clustering algorithms.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Dean</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>MapReduce: simplified data processing on large clusters</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <year>2008</year>
          ;
          <volume>51</volume>
          :
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Vossen</surname>
            <given-names>G</given-names>
          </string-name>
          .
          <article-title>Big Data as the new enabler in business and other intelligence</article-title>
          .
          <source>Vietnam Journal of Computer Science</source>
          ,
          <year>2014</year>
          ;
          <volume>1</volume>
          :
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Tamhane</surname>
            <given-names>DS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sayyad</surname>
            <given-names>SN</given-names>
          </string-name>
          .
          <article-title>Big Data Analysis Using Hace Theorem</article-title>
          .
          <source>International Journal of Advanced Research in Computer Engineering &amp; Technology (IJARCET)</source>
          ,
          <year>2015</year>
          ;
          <volume>4</volume>
          :
          <fpage>18</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kazanskiy</surname>
            <given-names>NL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Protsenko</surname>
            <given-names>VI</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Serafimovich</surname>
            <given-names>PG</given-names>
          </string-name>
          .
          <article-title>Comparison of system performance for streaming data analysis in image processing tasks by sliding window</article-title>
          .
          <source>Computer Optics</source>
          ,
          <year>2014</year>
          ;
          <volume>38</volume>
          (
          <issue>4</issue>
          ):
          <fpage>804</fpage>
          -
          <lpage>810</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tan</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blake</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saleh</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dustdar</surname>
            <given-names>S</given-names>
          </string-name>
          .
          <article-title>Social-network-sourced big data analytics</article-title>
          .
          <source>IEEE Internet Computing</source>
          ,
          <year>2013</year>
          ;
          <volume>5</volume>
          :
          <fpage>62</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. How “
          <article-title>Big Data” help to improve security</article-title>
          . URL: http://www.computerra.ru/108760 /security-n
          <article-title>-big-data/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>7. Data Mining tasks. Classification and cauterization</article-title>
          [In Russian]. URL: http://www.intuit.ru/ studies/courses/6/6/lecture/166.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Blagov</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rytcarev</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strelkov</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khotilin</surname>
            <given-names>M</given-names>
          </string-name>
          .
          <article-title>Big Data Instruments for Social Media Analysis</article-title>
          .
          <source>Proceedings of the 5th International Workshop on Computer Science and Engineering</source>
          ,
          <year>2015</year>
          ;
          <fpage>179</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ramos</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Using tf-idf to determine word relevance in document queries</article-title>
          . URL: https://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/papers/ramos.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wang</surname>
            <given-names>H</given-names>
          </string-name>
          .
          <article-title>Introduction to Word2vec and its application to find predominant word senses</article-title>
          . URL:http://compling.hss.ntu.edu.sg/courses/hg7017/pdf/word2vec
          <source>%20and%20its%20appli cation%20to%20wsd.pdf&gt;.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Yu</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dredze</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Improving lexical embeddings with semantic knowledge</article-title>
          .
          <source>Association for Computational Linguistics (ACL)</source>
          ,
          <year>2013</year>
          ;
          <volume>2</volume>
          :
          <fpage>545</fpage>
          -
          <lpage>550</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Mikolov</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          . URL: http://arxiv.org/pdf/1301.3781.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mikolov</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Distributed Representations of Words and Phrases and their Compositionality</article-title>
          .
          <source>Advances in neural information processing systems</source>
          ,
          <year>2013</year>
          ;
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>MacQueen J. Some</surname>
          </string-name>
          <article-title>Methods for Classification and Analysis of Multivariate Observations</article-title>
          .
          <source>In Proc. of the 5th Berkeley Symposium on Mathmatical Statistics and Probability</source>
          ,
          <year>1967</year>
          ;
          <fpage>281</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Blei</surname>
            <given-names>DM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            <given-names>AY</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            <given-names>MI</given-names>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>The Journal of machine Learning research</source>
          ,
          <year>2003</year>
          ;
          <volume>3</volume>
          :
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>