=Paper= {{Paper |id=Vol-2870/paper23 |storemode=property |title=Usage of Sentiment Analysis to Tracking Public Opinion |pdfUrl=https://ceur-ws.org/Vol-2870/paper23.pdf |volume=Vol-2870 |authors=Zoia Kochuieva,Natalia Borysova,Karina Melnyk,Dina Huliieva |dblpUrl=https://dblp.org/rec/conf/colins/KochuievaBMH21 }} ==Usage of Sentiment Analysis to Tracking Public Opinion== https://ceur-ws.org/Vol-2870/paper23.pdf
Usage of Sentiment Analysis to Tracking Public Opinion
Zoia Kochuieva, Natalia Borysova, Karina Melnyk and Dina Huliieva
National Technical University “Kharkiv Polytechnic Institute”, Kirpichova, 2, Kharkiv, 61002, Ukraine


                Abstract
                This study reveals the problems of analysis of public opinion. The description, use cases and
                efficiency estimation of software for sentiment analysis of public opinion have been presented.
                The relevance of the problem of sentiment analysis as one of the important tasks of
                computational linguistics is substantiated. An overview of the existing classical methods of
                sentiment analysis and some software applications that solve this problem is conducted. The
                business process model of analysis of public opinion is presented in the form of BPMN-
                diagram. The principles of operation of the developed classifier that used the lexicon-based
                method are described. The model of determining the tonality of the news in the form of an
                activity diagram was considered. The efficiency estimation of the developed lexicon-based
                classifier has been evaluated based on standard metrics (Recall, Precision). The obtained
                results have been compared with values of similar metrics based on the using of the Naïve
                Bayesian Classifier and Recurrent Neural Network Cmeans Classifier. The calculation of the
                Recall and Precision has been conducted for two cases: the sentiment analyzer used a
                dictionary of affective words without slang words and with slang words. Conducted numerical
                studies show increasing of the efficiency of the sentiment analyzer by 5-6% in the case of using
                a dictionary with slang words.

                Keywords1
                Sentiment analysis, sentiment analysis methods, lexicon-based sentiment analysis, sentiment
                analysis software, automated analysis of public opinion, classifier efficiency estimation

1. Introduction
    The problem of public opinion analysis today falls within the interests of many professionals,
including marketers, sociologists, political scientists and many others. Public opinion is a form of mass
consciousness, which reflects the attitude (hidden or overt) of different groups of people to the events
and processes of society that affect their interests and needs. Public opinion has expressed publicly. It
affects the functioning of society and political system. At the same time, public opinion is a set of many
individual opinions on a specific issue that concerns a group of people. The structure of public opinion
includes mass moods, emotions, feelings, as well as evaluations and judgments. In addition, public
opinion is a base for a government for the following: an idea of the interests of the population, attitudes
to innovations, events, statements of officials, politicians, public figures, mechanisms for presenting the
most acute and significant problems for citizens, and others. People at present can express their opinions
on the Internet, and the number of statement grows every day. The manual analysis of the opinions is
not possible, because the public opinion can change quickly. So, there is an urgent need to automate the
process of public opinion analysis. Opinion mining is a research domain dealing with automatic
methods of detection and extraction of opinions and sentiments presented in text [1]. This study focuses
on sentiment analysis, which can determine the emotional attitude of the author of the statement to any
entity (a product, service, the person, the organization, an event) and / or its properties, signs, parts, etc.

COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine
EMAIL: aliseiko@gmail.com (Z. Kochuieva); borysova.n.v@gmail.com (N. Borysova); karina.v.melnyk@gmail.com (K. Melnyk);
dgulieva@ukr.net (D. Huliieva)
ORCID: 0000-0002-4300-3370 (Z. Kochuieva); 0000-0002-8834-2536 (N. Borysova); 0000-0001-9642-5414 (K. Melnyk);
0000-0001-8310-745X (D. Huliieva)
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
2. An overview of existing methods and tools of sentiment analysis
   Consider the classic methods and some software applications for sentiment analysis that currently
exist.

2.1.    A synopsis of methods of sentiment analysis
    All methods of automated sentiment analysis can be divided into the following groups:
    1. Rule-based methods.
    2. Lexicon-based methods.
    3. Supervised machine learning methods.
    4. Unsupervised machine learning methods.
    5. Hybrid methods.
    Rule-based methods use sets of rules identified by experts based on the analysis of texts in the subject
area. The information system (IS) defines the tone of the texts based on these rules. To obtain high
accuracy of the classifier, it is necessary to write a large number of rules. Nevertheless, it is a long and
time-consuming process. In addition, the rules describe only specific domain, so the changing the
domain needs the re-composing of the rules. However, this approach is most accurate with a good rule
base, because rule-based algorithms are closely related to word semantics. Also, these methods give
good results in the classification of structured or poorly structured texts, such as texts of scientific
articles, or other grammatically correct texts without spelling errors. However, rule-based methods
depend heavily on the language of the texts, i.e. they are not universal [2].
    Lexicon-based methods use affective lexicons to analyze texts. A tonal dictionary is a list of words
with tonality for each one (positive, negative, neutral) and weight coefficients (for example, from -5 to
5, or from -10 to 10). The IS analyzes some text, finds particular words from the dictionary, calculate
the overall tone of the whole text according to the weights of these words. There are many methods of
calculation the tone of the text, for instance, the using of arithmetic mean. However, these methods are
not universal, because they depend on the language of the texts, as well as on the domain area (each
domain area nedds own dictionary) [3].
    Supervised machine learning methods for training the classifier use a training sample (texts corpora).
This set contains from marked texts divided into classes. The classifier or IS can determine the tonality
of new texts unknown based on this sample. The most widely used methods of sentiment analysis are
the naive Bayesian classifier and the algorithm of support vector machine. The usage of supervised
machine learning methods gets good results; the accuracy of the algorithms can exceed 90%. The main
difficulty of using these methods is creating a test sample to teach the classifier, because the quality of
texts corpora has an influence on the effectiveness of the classifier [4].
    Unsupervised machine learning methods for training the algorithm use a training sample (corpora)
based on undivided into classes and unmarked texts. The biggest weights allow find the most common
words in the text, however, they are presented only in a limited number of texts of the whole set. One
of the most used method in practice is the K-means algorithm. However, this group of methods for
determination the tonality of the texts is not frequently used because of lower accuracy in comparison
with supervised machine learning methods [4].
    Hybrid methods are a combination of methods of different groups. They allow using advantages of
the selected methods and eliminating their disadvantages. An example of such hybridization is the
method of sentiment analysis, which accommodates the syntactic structure of the text and the
relationship between words in a sentence. The classifier applies such text structures that used to express
a person’s emotional attitude toward an object. The decision tree and the lexicon-based method utilize
simultaneously for it. It should also pointed out that the dictionaries can consist of positive, negative
words and inverter words. Inverter words are the words that can change the polarity of the whole
sentence. The nodes of the tree are the words of the sentence. The values of the higher node are
calculated on the following: the values of the lower nodes, the ability of the word to invert the tonality,
and the tonality of the word from the dictionary. If IS ignores the sentence structure, it can get the wrong
classification result. For example, the attitude to the news can be defined as negative because of two
negative words and one positive, while the attitude is neutral based on the content of the message [5].
2.2.    Existing software for sentiment analysis
    In addition to the existing methods of sentiment-analysis, some sentiment-analysis software has
analyzed in the work. This software is based on different approaches for solving the problem and is
designed for using in different conditions. Each software has a number of advantages and
disadvantages. In order to define best software solution for sentiment analysis, independent
organizations and experts create reviews with lists of TOP-10 or TOP-12 sentiment-analysis tools based
on surveys of a large number of users. Sometimes such lists are differ, but some tools are presented in
all reviews. So, analysis of existing software contains from such tools.
    As the developers say, Rosette Sentiment Analyzer has a machine learning model that was training
on tweets and reviews to detect strong positive and negative sentiments in documents. It also uses an
entity extraction to identify set of products from customer review, where customer mentioned two or
more products. Rosette has sentiment analysis and entity extraction models for six languages. However,
user can add new languages for training Rosette. Rosette Text Analytics is the company owner of
Rosette Sentiment analyzer. It has several price plans for customers: Analytics, Full Stack, Enterprise.
All these plans include sentiment analysis feature. There are three subplans within the Analytics Plan:
Starter for $100 per month, Medium for $400 per month and Large for $1,000 per month. The Full
Stack Plan has two subplans: Small for $500 per month and Medium for $1,350 per month. The pricing
for Enterprise Plan is revealed upon request [6].
    Social Searcher is a free social media search engine. It could be used by users in two possible ways:
firstly, for searching in social networks (such as Twitter, Facebook, Youtube, Instagram, Flickr, Vimeo,
etc) in a real time, and secondly, for the monitoring of social media. Social Searcher gives such
information about posts: sentiment, type of content and language. Its syntax supports phrase searching
and operators using. Social media monitoring could be made with Social Searcher API. API
concentrates information about brand mentions and provides access to it. This information could be sort
by date or popularity, could be filter by social network, sentiment or content type, could be found from
chosen posts, could be export to CSV format, etc. Users’ data is stored till their subscription is valid.
There are two types of users that could work with Social Searcher: Free and Premium [7, 8]. Social
Searcher could be used for free with 100 searches during the day and 2 email alerts. It has three price
plans: Basic for 3,49 € per month with 200 searches during the day, 3 email alerts, 3 monitorings, 3000
posts per month, all mentions in the web; Standard for 8,49 € per month with 400 searches during the
day, 5 email alerts, 5 monitorings, 20000 posts per month, all mentions in the web; Professional for
19,49 € per month with 800 searches during the day, 10 email alerts, 10 monitorings, 100000 posts per
month, all mentions in the web. And now there is a special offer on their site “Start Standard plan 14-
day free trial” [9].
    Repustate’s sentiment analysis multilingual API uses a combination of machine learning methods
to identify sentimental insights in messages from all possible communication channels and users’ data.
There are five steps of natural language processing for sentiment analysis by Repustate:
    Step 1: POS-tagging.
    Step 2: Lemmatization.
    Step 3: Prior polarity determining and intensity of the polarity calculating.
    Step 4: Determining of negations, amplifiers and other grammatical constructs.
    Step 5: Machine learning using.
    Repustate offers two price plans: Standard for $299 per month, that provide English language
processing only, document sentiment only, standard document volume, basic support by email, cloud
API; Custom is available upon request and provide all 23 supported languages processing, document,
topic and aspect sentiment analysis, expanded document volume, premium support by phone and email,
cloud API / on-premise deployment, customized machine learned models, named entity recognition,
data retrieval (news, social, blogs), sentiment analysis dashboard, video/audio/image content retrieval,
enterprise semantic search [10].
    Following the information from website, Social Mention is a special social media platform for
searching and collecting users’ content from the web. Social Mention monitors more than 100 social
networks properties. It provides searching, analysis and daily alerts in social media, third-party APIs
and applications. Developers can interact with the Social Mention website using special API [11. 12].
Social Mention gives the results by four characteristics: strength, sentiment, passion, reach. Strength is
the likelihood of mentioning a certain brand in social networks during the last 24 hours. Sentiment is
the ratio of all positive mentions to all negative mentions. Passion is a likelihood of multiple mention
of brand by same people. Reach is a measure of the influence diapason. It is a ratio of number of brand
mention by unique authors to the total number of mentions [13]. Users can work with API for free if
they make less the 100 requests during a day. Usage of Social Mention for commercial purposes is
required of contacting with the developers [12].
    MeaningCloud’s Sentiment Analysis API is a tool for making a detailed attribute-leveled aspect-
based multilingual sentiment analysis of different texts. It separates texts into three classes: positive,
negative and neutral texts. Aspect-based analysis means that polarity value for the whole text calculated
according to polarity values of all sentences of this text and relationships between them. API could be
useful for facts and opinions extraction, irony identification, polarity disagreement finding, etc. It is
possible to work with API using users’ sentiment dictionaries and users’ sentiment models [14].
Customers can use MeaningCloud’s Sentiment Analysis API for free and analyze 20000 requests per
month with free support and SaaS deployment. There are also four paid plans exist: Start-Up Plan for
$99 monthly with 120000 requests per month, standard support and SaaS deployment; Professional
Plan for $399 monthly with 700000 requests per month; Business Plan for $999 monthly with 4200000
requests per month; Enterprise Plan for custom paid per month with custom requests per month,
premium support, SaaS and On-premises deployment [15].
    In addition, we do not overlook the sentiment-analysis software of such global IT giants as IBM,
Microsoft and Google.
    IBM Watson Natural Language Understanding (NLU) allows detecting the insights in structured
and unstructured data. The NLU simplifies the text analysis for metadata extracting from content, which
includes concepts, keywords, categories, entities, semantic roles and relations. The NLU is a good
application to recognize emotions and sentiments, because it returns emotion and sentiment for the
whole text and keywords in the text for deeper analysis. The IBM Watson NLU uses Watson
Knowledge Studio to understand the texts in nine languages. The NLU also has the conversation feature
that enables to build and deploy chatbots and virtual agents across a different communication channels.
It provides the infrastructure for matching with individual use cases, therefore it gives users the support
they need [16]. The page [17] demonstrates the necessary information about the price and even link for
pricing calculator. It is worth noting that IBM Watson can be also used for free.
    Microsoft Azure Cognitive Service Text Analytics API supplies advanced processing of
unstructured natural language texts. The API has four main features: Sentiment Analysis (and Opinion
Mining), Key Phrase Extraction, Language Detection and Named Entity Recognition. The API uses
classification methods for Sentiment Analysis. Sentiment score is a numeric score between 0 and 1. If
the score value close to 1, text is positive. If the score value close to 0, text is negative. English, French,
Spanish and Portuguese languages are supported and 11 additional languages in preview. The API uses
techniques from Microsoft Office’s sophisticated Natural Language Processing toolkit for Key phrase
extraction. English, German, Spanish, and Japanese languages are supported. Key phrases are used for
topic detection. The API can detect the language of text for 120 languages. The language detection
score is a score between 0 and 1. If the score value close to 1, language is detected 100% certainty [18].
Text Analytics can be purchased in tiers [19]. Free Plan allows doing 5000 transactions free per month
with three of four main features without Named Entity Recognition. Standard Plan has the same features
as a Free Plan, but the quantity of analyzed text records bigger and price for their processing depends
on the quantity. S0-S4 plans have all four main features including Named Entity Recognition and cost
from $ 74,71 per month to $ 4999,99 per month.
    Google Cloud Natural Language API uncovers the structure and text meaning by using machine
learning models in a REST API. It could be used for finding mentions about people, places, events, etc.,
in texts and documents. It allows understanding sentiment about brand or/and product on social media
or analyzing customer conversations holding in a call center or a messengers. It searches useful insights
on product approbation or user experience from customer conversations in email, chat or social media.
It filters inappropriate content and classifies documents by topics; builds relationship graphs of entities
extracted from news or Wikipedia articles and extracts tokens and sentences and then identifies parts
of speech to create dependency parse trees for each sentence. The Google Cloud Natural Language API
supports 11 languages [20]. The API usage is based on the following principle: pay only for the features
you use [21]. Free Plan allows using free all features for 5000 units. If the text contains less than 1,000
Unicode characters, it could be considered as one “unit”. Prices in other plans depend on the units’
quantity and features, features differ in price, the more units the cheaper.
   The analysis has showed that all considered software are multifunctional, but only two of them
support the Ukrainian language. Other products allow downloading own model for sentiment analysis
and / or dictionary of sentiment words, but this service is paid or developers have set restrictions on the
use of models and user dictionaries. Thus, the development of its own sentiment-analyzer of public
opinion for Ukrainian-language texts is an urgent task.

3. The model of the sentiment-analyzer of public opinion
    In this study, it is proposed to conduct the process of determining the tonality of the news using the
lexicon-based method. The main idea of this method is to use the tonal dictionaries, where each word
has a certain weight coefficient or several weight coefficients. The calculation of the overall tonality of
the whole text is based on the weight coefficients of the words from the dictionary. The dictionary of
words tonality has been made for developed sentiment analyzer. Calculations of the tonality of the text
have been carried out according to the methodology proposed in [22]. The research [22] demonstrates
the determining the tonality of the news for English-language texts. However, the authors pointed out
the possibility of using their methodology for other languages based on an appropriate tonality
dictionary. Thus, consider the use of the proposed methodology for Ukrainian-language texts.
    Let’s assume 𝑁 is a set of news, which is needed for determination the tone according to the
comments on them. Denote 𝑠𝑖𝑁 as the tonality of 𝑖-th news, 𝑖 ∈ 𝑁.
    Let’s denote 𝑊 as a set of words and collocations of the tonality dictionary, so 𝑤𝑗 (𝑤𝑗 ∈ 𝑊) – 𝑗-th
word of this dictionary. Each word has its own tonality, so denote 𝑠𝑗𝑊 as the tonality of 𝑤𝑗 -th word from
the dictionary 𝑊. The range of changes of tonality is measured in the range [−100; 100], where
negative values characterize the negative tonality, and positive values are positive tonality, respectively.
If the word 𝑤𝑗 occurs in the text of the comment with a negation, then it is necessary to use formula (1).
The efficiency of formula (1) is proved in [22]:
                                            𝑠𝑗𝑊 + 100
                                      max (            , 10) ,    𝑠𝑗𝑊 < 0
                                                 2
                            𝑠𝑗𝑊′ =                                        .                            (1)
                                          𝑠𝑗𝑊 − 100
                                    min (            , −10) ,     𝑠𝑗𝑊 ≥ 0
                                   {           2
    Denote 𝐼 as set of words-intensifier, for example: дуже, трохи, доволі, etc. Some words have a
positive intensification, then they belong to the subset 𝐼𝑃 ⊂ 𝐼, and words with negative
intensificationare contained in the subset 𝐼𝑁 ⊂ 𝐼, respectively.
    Let’s denote 𝐾 as the set of comments to all news from the set 𝑁, then 𝐾𝑖 is the subset of comments
                                             𝐶
to the 𝑖-th news, 𝑖 ∈ 𝑁, 𝐾𝑖 ⊂ 𝐾. Denote 𝑠𝑘𝑖    (𝑖 ∈ 𝑁, 𝐾𝑖 ⊂ 𝐾) as the tonality of the 𝑘-th comment of the
𝑖-th news. Different groups of people who are public speakers can write comments. There are three
categories of comments: the opinion of the media (the opinions of authors of articles in various online
publications about particular news); the opinion of the people (the opinions of ordinary citizens about
news); the opinion of experts (the opinions of people, who are the experts in domain related with given
news). Let’s suggest 𝐾𝑖𝑐 ⊂ 𝐾𝑖 as a subset of the comments of the 𝑐-th category:
                                            ⋃ 𝐾𝑖𝑐 = 𝐾𝑖 .
                                            ̅̅̅̅
                                          𝑐=1,3
         𝑁𝑐
   Then 𝑠𝑖 is the tonality of the 𝑖-th news in the 𝑐-th category. In this paper, it is proposed to determine
 𝑁𝑐
𝑠𝑖 and 𝑠𝑖𝑁 by formula (2) and (3), respectively:
                                  𝑁       1     𝐶
                                 𝑠𝑖 𝑐 =      ∑ 𝑠𝑘𝑖 , 𝑘 ∈ 𝐾𝑖𝑐 , 𝑖 ∈ 𝑁,                                 (2)
                                          𝑟𝑐
                                             𝑘,𝑖
                                                1    𝑁
                                        𝑠𝑖𝑁 =     ∑ 𝑠𝑖 𝑐 ,                                           (3)
                                                3
                                                      ̅̅̅̅
                                                    𝑐=1,3
where 𝑟𝑐 is the cardinality of the set 𝐾𝑖𝑐 .
                                               𝐶
    To determine the tone of the comment 𝑠𝑘𝑖     , it is necessary to find 𝑊𝑘 , (𝑊𝑘 ⊂ 𝑊) , where 𝑊𝑘 is the
set of words of the comment 𝑘-th. The words from 𝑊𝑘 are the elements of the set 𝑊 and the sets of
words-intensifier of the comment 𝑘-th 𝐼𝑃𝑘 (𝐼𝑃𝑘 ⊂ 𝐼𝑃 ) and 𝐼𝑁𝑘 (𝐼𝑁𝑘 ⊂ 𝐼𝑁 ) simultaneously, if they exist for
the 𝑘-th comment. The cardinality of the sets |𝐼𝑃𝑘 | = 𝑞𝑃𝑘 and |𝐼𝑁𝑘 | = 𝑞𝑁𝑘 , respectively. If all selected
words have only one tonality, for example, positive, then the whole comment is considered like positive
one. In doing so, some methods for determining the tonality offer to find 𝐴𝑃 – the arithmetic mean of
all positive words from 𝑘-th comment by formula (4) or 𝐴𝑁 – the arithmetic mean of negative words
by formula (5):
                                             1
                                      𝐴𝑃 = ∑ 𝑠𝑗𝑊 , 𝑗 ∈ 𝑊𝑘 ,                                           (4)
                                             𝑝
                                                𝑗

                                           1
                                    𝐴𝑁 =     ∑ 𝑠𝑗𝑊 , 𝑗 ∈ 𝑊𝑘 ,                                        (5)
                                           𝑛
                                                𝑗
where 𝑝 and 𝑛 are the number of positive and negative words in the 𝑘-th comment, respectively. Thus,
                             𝐶
the tonality of the comment 𝑠𝑘𝑖 is defined as follows:
                                      𝐴 ,           ∀ 𝑗 ≥ 0, 𝑗 ∈ 𝑊𝑘
                                 𝐶
                                𝑠𝑘𝑖 ={ 𝑃                            ,                                (6)
                                      𝐴𝑁 ,          ∀ 𝑗 < 0, 𝑗 ∈ 𝑊𝑘
    The paper [22] empirically shows the inaccuracy of estimating the tonality of a sentence or text by
arithmetic mean. Authors of this methodic propose its own version of determining the tonality of the
comment sentence.
    Consider a model for determining the tonality of news based on the calculation of the tonality of the
set of comments to this news, using the model from [22].
    Let’s consider additional variables: X P and X N are the overall positive and negative sentiment in k-
th comment respectively; EP and EN are the overall positive and negative evidence in k-th comment
respectively.
                                                   𝐴𝑃
                             𝑋𝑃 = min {                        , 100},                              (7)
                                          2 − lg(3.5𝑝 + 𝑞𝑃𝑘 )
                                                 𝐴𝑁
                            𝑋𝑁 = max {                        , −100},                               (8)
                                         2 − lg(3.5𝑛 + 𝑞𝑁𝑘 )
                                                    𝐴𝑃
                                  𝐸𝑃 = min {                , 1},                                    (9)
                                              2 − lg(3.5𝑝)
                                                   𝐴𝑁
                                 𝐸𝑁 = max {                , −1}.                                   (10)
                                              2 − lg(3.5𝑛)
   These variables are needed to determine the tonality of a particular comment (Fig. 1). The process
of estimation of tonality of news is shown in the form of the activity diagram using the activity element
“Defining the tone of the k-th comment”, while the parameters are X P , X N , EP and EN. Thus, the model
of determining the tonality of the news in the form of the diagram has considered.
   Let’s consider the process of tracking public opinion based on using of sentiment analysis in more
details. To develop effective classifier for a specific domain, it is necessary to create a model of this
process. There are many techniques and case tools for modelling business process. This research
propose to use Business Process Modeling Notation (BPMN) for formalizing the process of tracking
public opinion. The Fig. 2 presents the business process model of the given process in the form of
BPMN-diagram. To start working with the sentiment analyzer, the user has to add a news item. Then
administrator or other user has to add comments with defined category for this news.
                                              Identification of the input data of i-th news

                                         Define a set KiC for appropriate category of comment

            Defining the tone of the k-th comment
  the positive and
 negative sentiment                 Identification of the input data of k-th comment
  the positive and           n  0                                                                     FP  FN  0
                                        p  0     EP  EN  0.1        EN  EP  0.1     FP  FN  0
 negative evidence
                 yes           no             no                no                  no            no

                                        yes                                  yes            yes
                                                         yes                                             yes
              XP  X
              EP  E                                                        XN  X                             no
                                                                            EN  E

                           X  25 || E  0.5
                                      yes          no
                             skiC  X                    skiC  0



                          Have all comments of appropriate
                       category of i-th news been considered?                no
                                                                      yes

                                                               Define siNc by (2)
                                                                            Have all categories of
                                                                            news been considered?
                                     Define siN by (3)         yes            no


Figure 1: The model for determining the tone of the news




Figure 2: The BPMN-diagram of the given business process

   Next step is the tokenization and lemmatization for each comment. This stage of analysis allow
comparing the found words with the words available in the sentiment-dictionary. If the word is in the
dictionary, its weight coefficient is taken for calculations. The tonality of each comment further is
calculated according to the proposed model of determining the tonality of the news. Next step leads for
calculation of the tonality of the comments with particular category. The purpose of the step is
determination of the attitude of different public opinion leaders to each news item. Finally, classifier
estimates the overall tonality for each news item. It means the general tonality of public opinion about
the news. To assess the efficiency of the work of the sentiment analyzer according to the standard
metrics Recall and Precision, it is necessary to ask the tonality of all comments from experts of
considered domain area. Detailed information of this process is presented in paragraph 4 of this article.
    Let’s consider the functional and non-functional requirements for the sentiment analyzer. There are
three roles of user: administrator, user and expert. The administrator can add and delete news,
comments, comment categories and user accounts, as well as view the tonality of comments and news
and the results evaluating the effectiveness of the sentiment analyzer. The user has the ability to add
news and comments to them, view all the sentiment assessment results and the effectiveness assessment
results. The expert has the ability to manually set own assessment of the sentiment of comments and
view the results of assessing the effectiveness of the program.
    Non-functional requirements include the following: intuitive and user-friendly interface, reliability
of data transfer and storage, usability, performance, high performance. The whole functionalities of the
developed sentiment analyzer for different categories of users are presented in the form of a use-case
diagram in Fig. 3.


                                                   News                            News title
                                                                   «include»
                                «extend»           adding                           adding
                    Management
                      of news                               «include»

                                  «extend»        News              News text
                                                 deleting            adding

                    Management
                      of news’                                   News’
                                             «extend»          comments
                     comments
                                                                 adding
                                                                                «include»        Comment
                                   «extend»
                                                 News’                                          text adding
                                               comments          «include» «include»
                                                deleting
                    Management                                      News                News comments
                    of comments’                                   choosen             category choosen
                                                                                                                      User
                     categories
                                                «extend»
                             «extend»         Comments’              Comments’
                                              categories             categories
                                                adding                deleting
                    Management
                      of users
                                     «extend»
                                                     Users
 Administrator            «extend»                  deleting

                                Users                                                              Marking the
                                adding                           By each                        comments' tonality
                                                                comments                            manually
                         Tonality             «extend»
                    evaluation results
                         viewing
                             «extend»      By comments’                                                              Expert
                      «extend»
                                         categories related
                                              to news

                    By news




                      Viewing results of
                          classification
                     efficiency estimation

Figure 3: The functional possibilities of the sentiment analyzer
    All data and results of the sentiment analyzer work are stored in the database. The logical structure
of the database allow seeing the relationships between different objects or entities of domain area. Every
business rule of the work of the sentiment analyzer should be base for creating of model of the database.
There are many different models of a database of domains, although the Entity-Relationship Model is
the most widely used one. Let’s consider the database model for the sentiment analyzer (Fig. 4).




Figure 4: The model of the database

   Database model consists of following entities:
   – the entity «Dictionary» represents list of sentiment words with their weight coefficients;
   – the entity «News» describes all news added by all users with their tonality expressed both in words
and numerical value;
   – the entity «Comment» represents all comments for all news with their categories and tonality
expressed both in words and numerical value;
   – the entity «Category» describes all comments’ categories;
   – the entity «Evaluation_Range» represents all tonality expressions by words;
   – the associative entity «Comment_Dictionary» describes the set of words from the comment
presented in a dictionary;
   – the associative entity «News_Category» describes tonality evaluation expressed by numerical
value for each comments’ categories that belong all news.
   Description of these entities and their attributes is presented in the Table 1.

Table 1
Description of database model
         Entity                    Attribute                       Attribute description
       Dictionary                      id                                 Word’s id
                                      word                                 Word
                                  word_value                      Word’s weight coefficient
         News                       id_news                               News’ id
                                      title                              News’ title
                                    content                            News’ content
                           Evaluation_Rangeid_mark        News’ general tonality expressed by words
                                   news_value                 News’ general tonality expressed by
                                                                        numerical value
       Comment                    id_comment                             Comment’s id
                                     name                             Comment’s content
                                   Categoryid                         Comment’s category
                                 Newsid_news                       Comment related news id
                           Evaluation_Rangeid_mark          Comment’s tonality expressed by words
                                comment_value             Comment’s tonality expressed by numerical
                                                                             value
        Category                       id                                 Category id
                                     name                               Category name
   Evaluation_Range                id_mark                           Tonality evaluation id
                                 name_mark                  Tonality evaluation expressed by words
 Comment_Dictionary              Dictionaryid                              Word’s id
                             Commentid_comment                           Comment’s id
    News_Category               Newsid_News                                News’ id
                                 Categoryid                               Category id
                             news_value_category           The numerical value of tonality evaluation

4. The efficiency estimation of the developed analyzer
    According to the aforementioned information, the standard metrics Precision and Recall have been
used to evaluate the efficiency of the sentiment analyzer work. To calculate these metrics, it is necessary
to find the following indicators:
        true positive – the number of answers we expected to see and received at the exit;
        false positive – the number of answers that we did not expect to see, but the analyzer mistakenly
    returned them at the exit;
        false negative – the number of answers that we expected to see, but the analyzer did not return
    them at the exit;
        true negative – the number of answers that we did not expect to see, and the analyzer did not
    return them at the exit.
    The Table 2 presents the examples of the assessments of the sentiment analyzer and experts of
several comments with different categories related to one news item, and the matching between results.

Table 2
The results of assessments of tonality of comments made by expert and sentiment-analyzer
  №               Evaluation                      Expert                      Match
               of the program                  assessment                     status
   1               negative                     negative                         +
   2               positive                      positive                        +
   3               positive                     negative                         -
   4               negative                     negative                         +
   5               negative                     negative                         +
   6               negative                     negative                         +
   7               positive                   very positive                      -
   8               positive                      positive                        +
   9               negative                     negative                         +
  10               positive                      positive                        +
  11               negative                      positive                        -
  12               negative                     negative                         +
  13               positive                      positive                        +
  14            very positive                    very positive                           +
  15               positive                         neutral                              -
  16              negative                         negative                              +
  17            very negative                    very negative                           +
  18            very negative                    very negative                           +
  19               positive                         positive                             +
  20               positive                         positive                             +
  21            very positive                       positive                             -
  22               positive                         positive                             +
  23               positive                         positive                             +
  24               positive                         positive                             +
  25               positive                         positive                             +
  26               positive                        negative                              -
  27            very positive                    very positive                           +
  28               positive                         positive                             +
  29               positive                         positive                             +
  30               positive                         positive                             +

    Precision is calculated as the proportion of relevant responses in the total volume of all responses
issued by the sentiment analyzer by the formula:
                                                           𝑇𝑃
                                       𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                                                   (11)
                                                      𝑇𝑃 + 𝐹𝑃
    Recall is calculated as the proportion of relevant responses in the total number of relevant responses.
Recall is calculated by the formula:
                                                        𝑇𝑃
                                          𝑅𝑒𝑐𝑎𝑙𝑙 =                                                   (12)
                                                    𝑇𝑃 + 𝐹𝑁
    The Table 3 presents examples of evaluation by the sentiment analyzer of the same several texts of
comments that are presented in the Table 2, but the results of the evaluation are presented in the relevant
classes. The one point in the Table 2 indicates class of the text obtained by the sentiment analyzer. If
there is a letter in parentheses next to the one point, it means that the sentiment analyzer made a mistake
and incorrectly assigned the text to this class. The letter in parentheses indicates the correct class for
this text chosen by the expert.

Table 3
The results of distribution the texts on classes by the sentiment analyzer
  №                                             Text evaluation
        Very negative (nn)      Negative (n)       Neutral (N)      Positive (p)      Very positive (рр)
  1                                   1
  2                                                                      1
  3                                                                     1(n)
  4                                   1
  5                                   1
  6                                   1
  7                                                                    1(pp)
  8                                                                      1
  9                                   1
  10                                                                     1
  11                                1(p)
  12                                  1
  13                                                                     1
  14                                                                                           1
  15                                                                    1(N)
  16                                  1
  17             1
  18             1
  19                                                                      1
  20                                                                      1
  21                                                                                        1(p)
  22                                                                     1
  23                                                                     1
  24                                                                     1
  25                                                                     1
  26                                                                    1(n)
  27                                                                                          1
  28                                                                      1
  29                                                                      1
  30                                                                      1

   According to the results of calculations, the following metric values have been obtained by (11)-
(12): Precision = 0.861; Recall = 0.849. Moreover, the value of the Precision metric within the class
has range from 0.807 to 0.921, and the Recall metric – from 0.775 to 0.930. Such results indicates that
the sentiment analyzer works adequately in general and within each tonality class as well.
   The obtained values of metrics Precision and Recall of the lexicon-based classifier have been
compared with the values of such metrics for two other classifiers: Naïve Bayesian Classifier and RNN
Cmeans Classifier, based on Recurrent Neural Network. Results for Naïve Bayesian Classifier and RNN
Cmeans Classifier are taken from [23]. All results of classifiers efficiency evaluation are presented in
the Table 4.

Table 4
Efficiency estimation of classification results
      Metrics       Lexicon-based Classifier       Naïve Bayesian Classifier     RNN Cmeans Classifier
     Precision                0,861                         0,869                      0,878
      Recall                  0,849                         0,853                      0,870

    The authors of the research [23] describe an experiment where additional training of classifiers on
Slang corpus, which the tonality of slang words have been marked up, allowed to increase the efficiency
of classifiers by 10-11%. Therefore, based on this information, we decided to supplement our dictionary
with slang words from the dictionary [24] and re-examine the work of the developed lexicon-based
sentiment analyzer. The results of metric calculations for the same three classifiers are presented in the
Table 5. Results for Naïve Bayesian Classifier and RNN Cmeans Classifier are also taken from [23].

Table 5
Efficiency estimation of classification results with Slang dictionary and Slang corpus
      Metrics       Lexicon-based Classifier        Naïve Bayesian Classifier     RNN Cmeans Classifier
     Precision                0,915                           0,975                     0,982
      Recall                  0,895                           0,948                     0,965

    Comparing the results of calculations of metrics from the Tables 3 and 4, the increasing of the
efficiency of the work of the lexicon-based classifier has not happen by 10-11%, the increasing has
occurred by 5-6%. This can be explained by the fact that not every of the analyzed comments contain
slang words, they are used only in the comments, that the opinion of the people reflect.
5. Conclusions
   The paper presents an approach to solving the problem of evaluation of public opinion using the
sentiment analyzer. The existing methods and software for sentiment analysis are analyzed. The
comparative characteristics of these methods have allowed choosing the lexicon-based methods. The
proprietary algorithm for solving the problem of determining the attitude of public opinion
representatives based on their comments to news is proposed. To calculate the sentiment of the
comments, the technique was used, which was first applied to Ukrainian-language texts, and its own
dictionary of sentiment words, subsequently supplemented with slang words. A functional model of the
business process of tonality identification by sentiment analyzer and a database model as well as their
description are presented. All its functionality has shown in the form of a use-case diagram and
described. The efficiency of the developed sentiment analyzer was assessed using the standard Precision
and Recall metrics. A comparative analysis of the efficiency of the Lexicon-based Classifier and Naïve
Bayesian Classifier, RNN Cmeans Classifier has been carried out. It is shown that adding the of slang
words to the sentiment dictionary increases the efficiency of the Lexicon-based Classifier by 5-6%,
while additional training of two other classifiers on Slang corpus showed an increase in efficiency by
10-11%.

6. References
[1] S. Ion, C. Bucur, Applying Supervised Opinion Mining Techniques on Online User Reviews.
     Informatica Economica Journal. 16. URL: https://core.ac.uk/download/pdf/27056535.pdf
[2] D. Vilares, C. Gómez-Rodríguez, M. A. Alonso, Universal, unsupervised (rule-based), uncovered
     sentiment analysis, Knowledge-Based Systems, volume 118, 2017, pp. 45–55, URL:
     https://www.sciencedirect.com/science/article/pii/S0950705116304701. doi: 10.1016/j.knosys.
     2016.11.014
[3] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, M. Stede, Lexicon-based methods for sentiment
     analysis. Computational Linguistics. 37. (2011) 267–307. doi: 10.1162/COLI_a_00049
[4] Z. Rahimi, S. Noferesti, M. Shamsfard, Applying data mining and machine learning techniques
     for sentiment shifter identification. Language Resources and Evaluation, volume 53, issue 2, 2019,
     pp. 279–302. doi: 10.1007/s10579-018-9432-0
[5] K. Ravi, V. Ravi, A survey on opinion mining and sentiment analysis: Tasks, approaches and
     applications,    Knowledge-Based         Systems,    volume 89,     2015,    pp. 14–46,      URL:
     https://www.sciencedirect.com/science/article/pii/S0950705115002336. doi: 10.1016/j.knosys.
     2015.06.015
[6] Rosette Sentiment Analyzer. URL: https://www.rosette.com/capability/sentiment-analyzer/
     #overview
[7] About Social Searcher. URL: https://www.social-searcher.com/about/
[8] Social Searcher API V2.0 Released. URL: https://www.social-searcher.com/2015/08/04/social-
     searcher-api-v2-0-released/
[9] Social Searcher pricing. URL: https://www.social-searcher.com/pricing/
[10] Sentiment analysis. Unlock the meaning in your data. URL: https://www.repustate.com/
     sentiment-analysis/
[11] About Social Mention. URL: http://socialmention.com/about/
[12] Social Mention API. URL: http://socialmention.com/api/
[13] Social Mention. Frequently Asked Questions. URL: http://socialmention.com/faq
[14] MeaningCloud’s Sentiment Analysis API. URL: https://www.meaningcloud.com/developer/
     sentiment-analysis
[15] MeaningCloud pricing. URL: https://www.meaningcloud.com/products/pricing
[16] IBM Watson Natural Language Understanding. URL: https://www.predictiveanalyticstoday.com/
     ibm-watson-alchemyapi/
[17] How NLU pricing works. URL: https://www.ibm.com/cloud/watson-natural-language-
     understanding/pricing
[18] Microsoft       Azure       Cognitive       Service    Text        Analytics      API.     URL:
     https://www.predictiveanalyticstoday.com/microsoft-azure-text-analytics-api/
[19] Cognitive Services pricing – Text Analytics API. URL: https://azure.microsoft.com/en-
     us/pricing/details/cognitive-services/text-analytics/
[20] Google Cloud Natural Language API. URL: https://www.predictiveanalyticstoday.com/google-
     cloud-natural-language-api/
[21] Google Cloud. Cloud Natural Language. URL: https://cloud.google.com/natural-language/ pricing
[22] A. Jurek, M. D. Mulvenna, Y. Bi, Improved lexicon-based sentiment analysis for social media
     analytics. Security Informatics. 4, 9 (2015). URL: https://security-informatics.springeropen.com
     /articles/10.1186/s13388-015-0024-x#article-info doi: 10.1186/s13388-015-0024-x
[23] N.V. Borysova, K.V. Melnyk, Efficiency estimation of methods for sentiment analysis of social
     network messages, Bulletin of National Technical University “KhPI”, Series: System Analysis
     Control and Information Technologies. 2 (2019) 76–81. doi:10.20998/2079-0023.2019.02.13
[24] N. V. Borysova, V. V. Niftilin, Avtomatyzovane stvorennia elektronnogo slovnyka, in: E. I. Sokol
     (Eds.), Proceedings of XXV International scientific-practical conference in Information
     technologies: science, engineering, technology, education, health, MicroCAD-2017: Part 1 (May
     17–19, 2017), NTU “KhPI”, Kharkiv, 2017. p. 32