=Paper=
{{Paper
|id=Vol-2667/paper43
|storemode=property
|title=Approaches to sentiment analysis of the social network text data 
|pdfUrl=https://ceur-ws.org/Vol-2667/paper43.pdf
|volume=Vol-2667
|authors=Vadim Moshkin,Nadezhda Yarushkina,Ilya Andreev
}}
==Approaches to sentiment analysis of the social network text data ==
<pdf width="1500px">https://ceur-ws.org/Vol-2667/paper43.pdf</pdf>
<pre>
        Approaches to sentiment analysis of the social
                     network text data
          Vadim Moshkin                                      Nadezhda Yarushkina                                   Ilya Andreev
 Ulyanovsk State Technical University                 Ulyanovsk State Technical University              Ulyanovsk State Technical University
         Ulyanovsk, Russia                                    Ulyanovsk, Russia                                 Ulyanovsk, Russia
        v.moshkin@ulstu.ru                                       jng@ulstu.ru                                  ia.andreev@ulstu.ru

    Abstract—The article provides an overview of the most                         the presence of speech and spelling errors.
modern approaches to sentiment analysis of text data. The
features of using machine learning approaches and dictionary-                     the use of smiles, emoji to give the message a certain
based methods are also described. In addition, the description                     emotional coloring.
of sentiment dictionaries and the most popular software for
                                                                                 In this article we will consider the use of various existing
sentiment analysis of data are given. An original approach was
also proposed for sentiment analysis of text data using the
                                                                             algorithms for assessing the sentiment of social network texts
integration of machine learning methods with the Wodr2vec                    within the framework of the developed software system for
data vectorization algorithm. Also presented is the architecture             Opinion Mining. The article proposes an original approach
of the developed system for Opinion Mining data of social                    for analyzing the emotional coloring of text data using the
networks. At the end of the article, experiments are presented               integration of machine learning methods with the Wodr2vec
to evaluate text reviews using the data from the IMDB portal                 algorithm.
as an example, confirming the proposed approach.
                                                                             II.THE EXISTING METHODS AND SOFTWARE FOR SENTIMENT
   Keywords—sentiment analysis, word2vec, Opinion Mining,                                       ANALYSIS OF TEXT DATA
machine learning                                                                   There are two main groups of methods for the
                         I.INTRODUCTION                                      automatic sentiment analysis of text data:
    Currently, the main source of information from where                           A. Statistical methods
you can get knowledge about certain interests of the client,                     The basis of these methods is the use of machine
prepare for him and proactively offer a new product or                       classifier. This classifier is learned on pre-marked texts in the
service, are the Internet and social networks [1]. This                      first stages. Then the classifier builds a model for analyzing
problem is solved by the Opinion mining. Opinion mining                      new documents using the knowledge gained. The algorithm
for data from social networks contains two tasks:                            consists of:
     morphological analysis to identify entities that will be                    a collection of documents is collected for machine
      evaluated;                                                                   classifier learning;
     analysis of the sentiment of expressions related to this                    each document is decomposed into a feature vector;
      entity.
                                                                                  the correct sentiment type is indicated for each
    By sentiment analyzing of the users’ text messages the                         document;
researcher can draw conclusions about:
                                                                                  the selection of the classification algorithm and the
     emotional evaluation of users of various events and                          method for learning the classifier;
      objects;
                                                                                  the resulting model is used to determine the
     individual user preferences;                                                 documents sentiment of the new collection.
     some features of the users’ nature [2].                                        The disadvantage of such methods is the need for a
    Sentiment analysis is a section of text mining, a system                 large amount of data for learning.
for automatically extracting subjective opinions from text.                          The statistical approach widely uses the support
Sentiment analysis determines the content of the text as                     vector method (SVM) [3], Bayesian models [4], various
much as its tonality.                                                        types of regression [5], methods Word2Vec, Doc2Vec [6],
                                                                             CRF [7], convolutional and recurrent neural networks
    Automatic analysis of the tonality of the text is based on               [8][9].
the technologies of linguistic interpretation of emotions,
machine learning, extracting emotional meaning from
                                                                                 Word2Vec. The Word2Vec method is based on the
information, etc.
                                                                             vector representation of words and the determination of the
     The technology of sentiment analysis has become                         semantic proximity of lexical units based on their
especially relevant with the development of Web 2.0, as a                    distribution in collections of texts on specific topics.
tool for monitoring the views of millions of Web users.
                                                                                 A big set of texts are input to Word2Vec. Specialized
    However, text data in social networks have the following                 vocabulary is created and at the same time is learned on this
features:                                                                    set of texts. At the second stage, the dictionary turns into a
                                                                             set of vector representations of words. This representation is
     use whole and incomplete sentences.                                    based on the contextual proximity of a given word: if the


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

words are found in the text side by side often enough, then                   The disadvantage of this method is a significant amount
there is a semantic connection between them, and therefore,               of labor because the method requires the creation of many
in the vector representation, these words will have close                 rules.
coordinates.
                                                                               A mixed method is also sometimes used [14-16].
    For this algorithm, two training methods were developed
- CboW and Skip-gram. Schemes of these algorithms are                         C. Dictionaries and thesauri
presented in Figure 1. The first algorithm is based on the                   There are a number of thesauri labeled with regard to the
prediction of the next word in the sequence given during                  emotional component. These dictionaries are necessary for
training. The second learning method works differently - it               computer programs when analyzing the tonality of the text.
predicts the surrounding words. The result of this method is                  WordNet-Affect is a semantic thesaurus in which
the ability to calculate the "semantic distance" for each pair            concepts related to emotions are represented using words that
of words.                                                                 have an emotional component. WordNet-Affect also uses
                                                                          additional emotional labels to separate synsets according to
                                                                          their emotional valency. To do this, four additional
                                                                          emotional labels are defined:
                                                                                positive;
                                                                                negative;
                                                                                ambiguous;
                                                                                neutral.
                                                                              SentiWordNet [17] is a lexical semantic thesaurus. The
                                                                          first version of SentiWordNet was developed in 2006. This
Fig. 1. CboW and Skip-gram training methods.                              thesaurus appeared as a result of automatic annotation of
                                                                          each set of synonyms in accordance with its degree of
    Doc2Vec. The Doc2Vec method consists of two                           positivity, negativity and objectivity.
methods: distributed memory (DM) and distributed word bag
(DBOW). The DM method predicts a word from known                              SenticNet is another semantic thesaurus for working with
prior words and a paragraph vector. The paragraph vector                  sets of emotional concepts. SenticNet is used to design
does not move and takes into account the word order Despite               intelligent applications designed to analyze the emotional
the fact that the context moves through the text. DBOW                    component of text. The main purpose of SenticNet is to
predicts random word groups in a paragraph based only on                  simplify the process of machine recognition of conceptual
the paragraph vector.                                                     and emotional information transmitted using natural
                                                                          language. If we compare other lexical thesauruses, such as
    A serious disadvantage of this method is the complexity               SentiWordNet and WordNet-Affect with SenticNet, then
of the analysis of the training sample, which is why it is                their main difference is that SentiWordNet and WordNet-
extremely difficult to continuously update the model when                 Affect provide the linking of words and emotional concepts
new training data is received.                                            at the syntactic level, not allowing to reveal the semantic
                                                                          component.
    B. Methods based on dictionaries
    The method using dictionaries is based on the search for
emotive vocabulary (lexical tonality) in the text according to
pre-compiled tonal dictionaries and rules using linguistic
analysis.
   These methods can use rule lists that are substituted into
regular expressions and special rules for connecting tonal
vocabulary within sentences [10].
    Glossary terms must have a weight corresponding to the
subject area of the document in order to classify the
document with high accuracy. Emotion is taken into account
in the algorithm when finding the marker. The result of the
algorithm is the average emotional color of the text [11-12].
The following algorithm is usually used:
     assign the sentiment score from the dictionary to each
      word in the text;
     calculate the overall sentiment score of the entire text
      by adding the sentiment score of individual words
      [13].                                                               Fig. 2 Developed clustering algorithm.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                             199
Data Science

     D. Existing sentiment analysis software.                             vectors were used to most accurately identify words. The
    Currently there is a certain set of libraries and software            resulting model is saved as a file.
for sentiment analysis of text data.                                          3) The next stage is the clustering of vector words
    Chorus is a service for determining the emotional                     according to the K-Means method for splitting by synonyms
coloring of email. This service was a startup and was                     and similar words. The number of clusters should be such
developed by a company from Australia. Chorus is intended                 that on average there are 5 words per cluster, for the most
for customer support services:                                            accurate result.
     recommends the following message for processing;
                                                                             It is required to prepare data for machine learning after
     indicates a message that needs an urgent response;                  breaking all the significant words into clusters. A two-
                                                                          dimensional array is created for each file as follows:
     indicates where you can save the client after the
      response.                                                                 the number of lines is equal to the number of text
                                                                                 messages in the file;
   The disadvantage is the ability to analyze only emails.
Currently no longer supported.                                                  the number of columns is equal to the number of
    Sentiment Analysis with Python NLTK Text Classification                      clusters.
[18] is a demo showing the capabilities of NLTK. He divides                 These data will be important in determining the
the emotional coloring into positive, negative and neutral. An            emotional coloring.
API with restrictions and the ability to buy premium access
is also offered. The demo sample is a form for manual
verification with character size restrictions.
    Sentirength [19] is a library for analyzing emotional
coloring. The algorithm is based on the search for the
maximum tonality value in the text for each scale (ie, the
search for the word with the maximum negative rating and
the word with the maximum positive rating) [20]. As a result,             Fig. 3 Random Forest Sentiment Analysis of the text.
a double score (positive and negative) is given from 1 to 5.
There are also options for triple and single assessment of                    The Random Forest model (Fig. 3) was used for machine
results. This library is paid. You can check the library on the           learning. The random forest method is currently one of the
project website.                                                          most popular and effective methods for solving machine
   Tone Analyzer [21] is a service from IBM based on IBM                  learning problems, such as classification and regression. He
Watson. This service uses linguistic analysis to detect                   trains not one decision tree with his weights, but many
emotional and linguistic connotations in the written text.                decision trees [23].
Options for using the analyzer are social listening, improving               Predicting data and calculating the accuracy of the
the quality of customer service and integration with chat                 algorithm is performed using a trained model.
bots. This service is paid and supports only English.
                                                                          IV.SOFTWARE ARCHITECTURE FOR OPINION MINING SOCIAL
III.SENTIMENT ANALYSIS USING MACHINE LEARNING AND                                               MEDIA
                    WORD2VEC.
                                                                              A module for assessing the tonality of texts in the
    The Random forest method of text sentiment analysis is a              information system for Opinion Mining (Fig.4) was
clustering method based on machine learning.                              developed to evaluate the effectiveness of the proposed
    Schematically, the developed algorithm is presented in                algorithms [24-26].
the Fig.2.
     1) Text data pre-processing is carried out at the first
stage. The html code, any non-alphabetic characters, and
also stop words are removed from the text. Stop words are
phrases and words that do not carry a semantic load and
make it difficult to index a page by search engines. Further,
all remaining words are reduced to lowercase.
     2) At the second stage, the text from these files (test
and training) presented in the form of a list of significant
words is processed using the Word2Vec tool.
     Word2vec is an open source tool for calculating word
spacing provided by Google [22]. Word2Vec creates a
special model that includes a dictionary of words with their
vector representation.
   By the similarity of the values of the vectors, synonyms
and similar words can be determined. 300-dimensional
                                                                          Fig. 4. The architectural scheme of the software system for Opinion Mining
                                                                          social media.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                         200
Data Science

                                                                                              TABLE I. RESULTS OF EXPERIMENTS
    This information system solves the following tasks:                                      Number of vector          Number of       Min. number
     extracts data from various social networks (Facebook,                 Accuracy
                                                                                                 spaces                 clusters          of words
       Ok, VKontakte, Instagram, Twitter, etc.)                           0.789            300                        3956             60
                                                                          0.776            200                        2301             100
     conducts preprocessing of the extracted data;                       0.772            400                        2301             100
                                                                          0.771            300                        1573             60
     makes matching (comparison) of user profile data
      from different social networks;                                                                   CONCLUSION
     translates the extracted data into an internal format for               Thus, in this paper, an approach to the analysis of the text
      storing knowledge;                                                  data of social networks was proposed. This approach is based
                                                                          on the integration of the word2vec vectorization algorithm
     conducts semantic analysis of data using subject                    and the k-means clustering algorithm using a random forest
      ontologies to simplify the search;                                  algorithm for training a neural network. This approach was
     conducts sentimental analysis of the extracted data                 implemented in the Opinion Mining analysis system.
      using the developed algorithm.                                         Experiments were conducted to evaluate the effectiveness
    The developed software system for Opinion Mining has a                of this algorithm when analyzing user feedback from the
service architecture and supports the REST architectural                  IMDB portal. The experiments showed that the Best result
style. The ElasticSearch library is used to extract and                   was shown using 300-dimensional vectors, the minimum
preprocess data. MongoDB is used to store a large set of                  number of repetitions of words was 60, and the number of
data. The Sypher query language is used to search the graph               vectors was calculated so that each cluster had an average of
database Neo4j [27-28].                                                   5 words, i.e. 3956 clusters.
                                                                              In future works, we plan to hybridize this approach using
                   V.EXPERIMENT RESULTS.
                                                                          well-known sentimental ontologies and dictionaries to take
   Experiments were conducted to determine the accuracy                   into account the peculiarities of word usage and language..
of estimating the emotional coloring of text data using the
random forest method.                                                                              ACKNOWLEDGMENT
    Test data is a data set from the IMDB site that contains                 This work was supported by the Russian Federal Property
100,000 detailed film reviews (positive and negative). 1,500              Fund. Projects No. 18-47-730035 and 18-47-732007.
reviews were taken separately to verify accuracy. The
                                                                                                       REFERENCES
maximum accuracy is 79% because some reviews do not                       [1]  O. Shipilov and A. Belyaev, “Analysis of the emotional color of
contain emotional coloring, but are only a retelling of the                    messages in the social network twitter,” Science Questions, vol. 3,
plots of films, which lowered the accuracy of the program.                     pp. 91-98, 2016.
                                                                          [2] D. Vlasov, “Description of the information image of a social
    When using different parameters, the running time of the                   network user, taking into account its psychological characteristics,”
algorithm ranged from 40 to 55 minutes. In the experiments,                    International Journal of Open Information Technologies, vol. 6. no.
the optimal values of the algorithm's work were revealed,                      4, 2018.
such as the dimension of the vectors, the number of clusters              [3] M.S. Sabuj, Z. Afrin and K.M.A. Hasan, “Opinion Mining Using
                                                                               Vector Machine for Web Based Diverse Data,” Pattern Recognition
and the minimum amount of use of the word in the reviews                       and Machine Intelligence. Lecture Notes in Computer Science, vol.
to make it important.                                                          10597, pp. 673-678, 2017.
   The results of the experiments are presented in Table 1                [4] L.P. Dinu and I. Iuga, “The Best Feature of the Set,” Computational
                                                                               Linguistics and Intelligent Text Processing. Lecture Notes in
and Fig.5.                                                                     Computer Science, vol 7181, pp. 556-567, 2012.
    The best result was shown when using 300-dimensional                  [5] I. Chetviorkin and N. Loukachevitch, “Sentiment Analysis Track at
vectors, the minimum number of repetitions of words equal                      ROMIP-2012,” Computational linguistics and intellectual
                                                                               technologies: Sat scientific articles, vol. 2, pp. 40-50, 2013.
to 60 and the number of vectors calculated so that each
                                                                          [6] Q. Chen and M. Sokolova, “Word2Vec and Doc2Vec in
cluster had an average of 5 words, i.e. 3956 clusters.                         Unsupervised Sentiment Analysis of Clinical Discharge
                                                                               Summaries,” CoRR abs 1805.00352, 2018.
                                                                          [7] A. Antonovam and A. Soloviev, “Using the conditional random
                                                                               fields method for processing texts in Russian,” Computational
                                                                               linguistics and intellectual technologies: Sat scientific articles, vol.
                                                                               12, no. 19, pp. 27-44, 2013.
                                                                          [8] A. Maas, R. Daly, P. Pham, D. Huang, A. Ng and C. Potts,
                                                                               “Learning word vectors for sentiment analysis,” The International
                                                                               Language       Technologies.       International      Association    for
                                                                               Computational Linguistics, vol. 1, pp. 142-150, 2011.
                                                                          [9] Yu.V. Vizilter, V.S. Gorbatsevich and S.Y. Zheltov, “Structure-
                                                                               functional analysis and synthesis of deep convolutional neural
                                                                               networks,” Computer Optics, vol. 43, no. 5, pp. 886-900, 2019. DOI:
                                                                               10.18287/2412-6179-2019-43-5-886-900.
                                                                          [10] H. Saif, “Contextual semantics for sentiment analysis of Twitter,”
                                                                               Information Processing & Management, vol. 52, no. 1, pp. 5-19,
                                                                               2016.
                                                                          [11] A. Pak and P. Paroubek, “Twitter as a Corpus for Sentiment
                                                                               Analysis and Opinion Mining,” LREC, 2010.
Fig. 5. Sentiment analysis using machine learning and Word2Vec.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                            201
Data Science

[12] A. Tarasova, “Synergy of interrogative and exclamation marks in         [22] Watson Tone Analyzer [Online]. URL: https://www.ibm.com/cloud/
     network texts (on the material of Tatar, Russian and English                 watson-tone-analyzer.
     languages),” Bulletin of Vyatka State University, vol. 4, 2015.         [23] Introduction to Word Embedding and Word2Vec [Online]. URL:
[13] S. Ionova, “Emotiveness of a Text as a Linguistic Problem,” Diss ....        https://towardsdatascience.com/introduction-to-word-embedding-
     Cand. filol. Sciences, 1998.                                                 and-word2vec-652d0c2060fa.
[14] B. Pang, L. Lee and S. Vaithyanathan, “Thumbs up?” Sentiment            [24] J. Žižka, F. Dařena and A. Svoboda, “Random Forest,” 2019. DOI:
     Classification using Machine Learning Techniques, pp. 79-86, 2002.           10.1201/9780429469275-8.
[15] P. Turney, “Thumbs Up or Thumbs Down? Semantic Orientation              [25] N. Yarushkina, A. Filippov, M. Grigoricheva and V. Moshkin, “The
     Applied to Unsupervised Classification of Reviews,” Proceedings of           Method for Improving the Quality of Information Retrieval Based on
     the Association for Computational Linguistics, pp. 417-424, 2002.            Linguistic Analysis of Search Query,” Artificial Intelligence and
[16] I.A. Rycarev, D.V. Kirsh and A.V. Kupriyanov, “Clustering of                 Soft Computing. Lecture Notes in Computer Science, vol. 11509, pp.
     media content from social networks using BigData technology,”                474-485, 2019.
     Computer Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/      [26] A. Pazelskaya and A. Soloviev, “Method for determining emotions
     2412-6179-2018-42-5-921-927.                                                 in texts in Russian,” Computational linguistics and intellectual
[17] V. Moshkin, N. Yarushkina and I. Andreev, "The Sentiment                     technologies: Sat scientific articles, vol. 11, no. 18, pp. 510-523,
     Analysis of Unstructured Social Network Data Using the Extended              2011.
     Ontology SentiWordNet," 12th International Conference on                [27] A. Filippov, V. Moshkin and N. Yarushkina, “Development of the
     Developments in eSystems Engineering (DeSE), Kazan, Russia, pp.              Social Media Analysis,” Recent Research in Control Engineering
     576-580, 2019.                                                               and Decision Making. Studies in Systems, Decision and Control, vol.
[18] Natural Language Processing APIs and Python NLTK Demos                       199, pp. 421-432, 2019.
     [Online]. URL: https://text-processing.com/demo/sentiment/.             [28] N. Yarushkina, A. Filippov, V. Moshkin, G. Guskov and A.
[19] A. Esuli and F. Sebastiani, “SENTIWORDNET: A Guide for                       Romanov, “Intelligent Instrumentation for Opinion Mining in Social
     Respecting the Opthion Mining,” pp.417-422, 2006.                            Media,” Proceedings of the II International Scientific and Practical
                                                                                  Conference Fuzzy Technologies in the Industry, Ulyanovsk, Russia,
[20] SentiStrength [Online]. URL: http://sentistrength.wlv.ac.uk/.
                                                                                  pp. 50-55, 2018.
[21] I. Menshikov and A. Kudryavtsev, “Review of systems for analysis
     of tonality of a text in Russian,” Young scientist, no. 12, pp. 140-
     143, 2012.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                           202

</pre>