=Paper=
{{Paper
|id=Vol-2870/paper23
|storemode=property
|title=Usage of Sentiment Analysis to Tracking Public Opinion
|pdfUrl=https://ceur-ws.org/Vol-2870/paper23.pdf
|volume=Vol-2870
|authors=Zoia Kochuieva,Natalia Borysova,Karina Melnyk,Dina Huliieva
|dblpUrl=https://dblp.org/rec/conf/colins/KochuievaBMH21
}}
==Usage of Sentiment Analysis to Tracking Public Opinion==
Usage of Sentiment Analysis to Tracking Public Opinion Zoia Kochuieva, Natalia Borysova, Karina Melnyk and Dina Huliieva National Technical University “Kharkiv Polytechnic Institute”, Kirpichova, 2, Kharkiv, 61002, Ukraine Abstract This study reveals the problems of analysis of public opinion. The description, use cases and efficiency estimation of software for sentiment analysis of public opinion have been presented. The relevance of the problem of sentiment analysis as one of the important tasks of computational linguistics is substantiated. An overview of the existing classical methods of sentiment analysis and some software applications that solve this problem is conducted. The business process model of analysis of public opinion is presented in the form of BPMN- diagram. The principles of operation of the developed classifier that used the lexicon-based method are described. The model of determining the tonality of the news in the form of an activity diagram was considered. The efficiency estimation of the developed lexicon-based classifier has been evaluated based on standard metrics (Recall, Precision). The obtained results have been compared with values of similar metrics based on the using of the Naïve Bayesian Classifier and Recurrent Neural Network Cmeans Classifier. The calculation of the Recall and Precision has been conducted for two cases: the sentiment analyzer used a dictionary of affective words without slang words and with slang words. Conducted numerical studies show increasing of the efficiency of the sentiment analyzer by 5-6% in the case of using a dictionary with slang words. Keywords1 Sentiment analysis, sentiment analysis methods, lexicon-based sentiment analysis, sentiment analysis software, automated analysis of public opinion, classifier efficiency estimation 1. Introduction The problem of public opinion analysis today falls within the interests of many professionals, including marketers, sociologists, political scientists and many others. Public opinion is a form of mass consciousness, which reflects the attitude (hidden or overt) of different groups of people to the events and processes of society that affect their interests and needs. Public opinion has expressed publicly. It affects the functioning of society and political system. At the same time, public opinion is a set of many individual opinions on a specific issue that concerns a group of people. The structure of public opinion includes mass moods, emotions, feelings, as well as evaluations and judgments. In addition, public opinion is a base for a government for the following: an idea of the interests of the population, attitudes to innovations, events, statements of officials, politicians, public figures, mechanisms for presenting the most acute and significant problems for citizens, and others. People at present can express their opinions on the Internet, and the number of statement grows every day. The manual analysis of the opinions is not possible, because the public opinion can change quickly. So, there is an urgent need to automate the process of public opinion analysis. Opinion mining is a research domain dealing with automatic methods of detection and extraction of opinions and sentiments presented in text [1]. This study focuses on sentiment analysis, which can determine the emotional attitude of the author of the statement to any entity (a product, service, the person, the organization, an event) and / or its properties, signs, parts, etc. COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine EMAIL: aliseiko@gmail.com (Z. Kochuieva); borysova.n.v@gmail.com (N. Borysova); karina.v.melnyk@gmail.com (K. Melnyk); dgulieva@ukr.net (D. Huliieva) ORCID: 0000-0002-4300-3370 (Z. Kochuieva); 0000-0002-8834-2536 (N. Borysova); 0000-0001-9642-5414 (K. Melnyk); 0000-0001-8310-745X (D. Huliieva) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2. An overview of existing methods and tools of sentiment analysis Consider the classic methods and some software applications for sentiment analysis that currently exist. 2.1. A synopsis of methods of sentiment analysis All methods of automated sentiment analysis can be divided into the following groups: 1. Rule-based methods. 2. Lexicon-based methods. 3. Supervised machine learning methods. 4. Unsupervised machine learning methods. 5. Hybrid methods. Rule-based methods use sets of rules identified by experts based on the analysis of texts in the subject area. The information system (IS) defines the tone of the texts based on these rules. To obtain high accuracy of the classifier, it is necessary to write a large number of rules. Nevertheless, it is a long and time-consuming process. In addition, the rules describe only specific domain, so the changing the domain needs the re-composing of the rules. However, this approach is most accurate with a good rule base, because rule-based algorithms are closely related to word semantics. Also, these methods give good results in the classification of structured or poorly structured texts, such as texts of scientific articles, or other grammatically correct texts without spelling errors. However, rule-based methods depend heavily on the language of the texts, i.e. they are not universal [2]. Lexicon-based methods use affective lexicons to analyze texts. A tonal dictionary is a list of words with tonality for each one (positive, negative, neutral) and weight coefficients (for example, from -5 to 5, or from -10 to 10). The IS analyzes some text, finds particular words from the dictionary, calculate the overall tone of the whole text according to the weights of these words. There are many methods of calculation the tone of the text, for instance, the using of arithmetic mean. However, these methods are not universal, because they depend on the language of the texts, as well as on the domain area (each domain area nedds own dictionary) [3]. Supervised machine learning methods for training the classifier use a training sample (texts corpora). This set contains from marked texts divided into classes. The classifier or IS can determine the tonality of new texts unknown based on this sample. The most widely used methods of sentiment analysis are the naive Bayesian classifier and the algorithm of support vector machine. The usage of supervised machine learning methods gets good results; the accuracy of the algorithms can exceed 90%. The main difficulty of using these methods is creating a test sample to teach the classifier, because the quality of texts corpora has an influence on the effectiveness of the classifier [4]. Unsupervised machine learning methods for training the algorithm use a training sample (corpora) based on undivided into classes and unmarked texts. The biggest weights allow find the most common words in the text, however, they are presented only in a limited number of texts of the whole set. One of the most used method in practice is the K-means algorithm. However, this group of methods for determination the tonality of the texts is not frequently used because of lower accuracy in comparison with supervised machine learning methods [4]. Hybrid methods are a combination of methods of different groups. They allow using advantages of the selected methods and eliminating their disadvantages. An example of such hybridization is the method of sentiment analysis, which accommodates the syntactic structure of the text and the relationship between words in a sentence. The classifier applies such text structures that used to express a person’s emotional attitude toward an object. The decision tree and the lexicon-based method utilize simultaneously for it. It should also pointed out that the dictionaries can consist of positive, negative words and inverter words. Inverter words are the words that can change the polarity of the whole sentence. The nodes of the tree are the words of the sentence. The values of the higher node are calculated on the following: the values of the lower nodes, the ability of the word to invert the tonality, and the tonality of the word from the dictionary. If IS ignores the sentence structure, it can get the wrong classification result. For example, the attitude to the news can be defined as negative because of two negative words and one positive, while the attitude is neutral based on the content of the message [5]. 2.2. Existing software for sentiment analysis In addition to the existing methods of sentiment-analysis, some sentiment-analysis software has analyzed in the work. This software is based on different approaches for solving the problem and is designed for using in different conditions. Each software has a number of advantages and disadvantages. In order to define best software solution for sentiment analysis, independent organizations and experts create reviews with lists of TOP-10 or TOP-12 sentiment-analysis tools based on surveys of a large number of users. Sometimes such lists are differ, but some tools are presented in all reviews. So, analysis of existing software contains from such tools. As the developers say, Rosette Sentiment Analyzer has a machine learning model that was training on tweets and reviews to detect strong positive and negative sentiments in documents. It also uses an entity extraction to identify set of products from customer review, where customer mentioned two or more products. Rosette has sentiment analysis and entity extraction models for six languages. However, user can add new languages for training Rosette. Rosette Text Analytics is the company owner of Rosette Sentiment analyzer. It has several price plans for customers: Analytics, Full Stack, Enterprise. All these plans include sentiment analysis feature. There are three subplans within the Analytics Plan: Starter for $100 per month, Medium for $400 per month and Large for $1,000 per month. The Full Stack Plan has two subplans: Small for $500 per month and Medium for $1,350 per month. The pricing for Enterprise Plan is revealed upon request [6]. Social Searcher is a free social media search engine. It could be used by users in two possible ways: firstly, for searching in social networks (such as Twitter, Facebook, Youtube, Instagram, Flickr, Vimeo, etc) in a real time, and secondly, for the monitoring of social media. Social Searcher gives such information about posts: sentiment, type of content and language. Its syntax supports phrase searching and operators using. Social media monitoring could be made with Social Searcher API. API concentrates information about brand mentions and provides access to it. This information could be sort by date or popularity, could be filter by social network, sentiment or content type, could be found from chosen posts, could be export to CSV format, etc. Users’ data is stored till their subscription is valid. There are two types of users that could work with Social Searcher: Free and Premium [7, 8]. Social Searcher could be used for free with 100 searches during the day and 2 email alerts. It has three price plans: Basic for 3,49 € per month with 200 searches during the day, 3 email alerts, 3 monitorings, 3000 posts per month, all mentions in the web; Standard for 8,49 € per month with 400 searches during the day, 5 email alerts, 5 monitorings, 20000 posts per month, all mentions in the web; Professional for 19,49 € per month with 800 searches during the day, 10 email alerts, 10 monitorings, 100000 posts per month, all mentions in the web. And now there is a special offer on their site “Start Standard plan 14- day free trial” [9]. Repustate’s sentiment analysis multilingual API uses a combination of machine learning methods to identify sentimental insights in messages from all possible communication channels and users’ data. There are five steps of natural language processing for sentiment analysis by Repustate: Step 1: POS-tagging. Step 2: Lemmatization. Step 3: Prior polarity determining and intensity of the polarity calculating. Step 4: Determining of negations, amplifiers and other grammatical constructs. Step 5: Machine learning using. Repustate offers two price plans: Standard for $299 per month, that provide English language processing only, document sentiment only, standard document volume, basic support by email, cloud API; Custom is available upon request and provide all 23 supported languages processing, document, topic and aspect sentiment analysis, expanded document volume, premium support by phone and email, cloud API / on-premise deployment, customized machine learned models, named entity recognition, data retrieval (news, social, blogs), sentiment analysis dashboard, video/audio/image content retrieval, enterprise semantic search [10]. Following the information from website, Social Mention is a special social media platform for searching and collecting users’ content from the web. Social Mention monitors more than 100 social networks properties. It provides searching, analysis and daily alerts in social media, third-party APIs and applications. Developers can interact with the Social Mention website using special API [11. 12]. Social Mention gives the results by four characteristics: strength, sentiment, passion, reach. Strength is the likelihood of mentioning a certain brand in social networks during the last 24 hours. Sentiment is the ratio of all positive mentions to all negative mentions. Passion is a likelihood of multiple mention of brand by same people. Reach is a measure of the influence diapason. It is a ratio of number of brand mention by unique authors to the total number of mentions [13]. Users can work with API for free if they make less the 100 requests during a day. Usage of Social Mention for commercial purposes is required of contacting with the developers [12]. MeaningCloud’s Sentiment Analysis API is a tool for making a detailed attribute-leveled aspect- based multilingual sentiment analysis of different texts. It separates texts into three classes: positive, negative and neutral texts. Aspect-based analysis means that polarity value for the whole text calculated according to polarity values of all sentences of this text and relationships between them. API could be useful for facts and opinions extraction, irony identification, polarity disagreement finding, etc. It is possible to work with API using users’ sentiment dictionaries and users’ sentiment models [14]. Customers can use MeaningCloud’s Sentiment Analysis API for free and analyze 20000 requests per month with free support and SaaS deployment. There are also four paid plans exist: Start-Up Plan for $99 monthly with 120000 requests per month, standard support and SaaS deployment; Professional Plan for $399 monthly with 700000 requests per month; Business Plan for $999 monthly with 4200000 requests per month; Enterprise Plan for custom paid per month with custom requests per month, premium support, SaaS and On-premises deployment [15]. In addition, we do not overlook the sentiment-analysis software of such global IT giants as IBM, Microsoft and Google. IBM Watson Natural Language Understanding (NLU) allows detecting the insights in structured and unstructured data. The NLU simplifies the text analysis for metadata extracting from content, which includes concepts, keywords, categories, entities, semantic roles and relations. The NLU is a good application to recognize emotions and sentiments, because it returns emotion and sentiment for the whole text and keywords in the text for deeper analysis. The IBM Watson NLU uses Watson Knowledge Studio to understand the texts in nine languages. The NLU also has the conversation feature that enables to build and deploy chatbots and virtual agents across a different communication channels. It provides the infrastructure for matching with individual use cases, therefore it gives users the support they need [16]. The page [17] demonstrates the necessary information about the price and even link for pricing calculator. It is worth noting that IBM Watson can be also used for free. Microsoft Azure Cognitive Service Text Analytics API supplies advanced processing of unstructured natural language texts. The API has four main features: Sentiment Analysis (and Opinion Mining), Key Phrase Extraction, Language Detection and Named Entity Recognition. The API uses classification methods for Sentiment Analysis. Sentiment score is a numeric score between 0 and 1. If the score value close to 1, text is positive. If the score value close to 0, text is negative. English, French, Spanish and Portuguese languages are supported and 11 additional languages in preview. The API uses techniques from Microsoft Office’s sophisticated Natural Language Processing toolkit for Key phrase extraction. English, German, Spanish, and Japanese languages are supported. Key phrases are used for topic detection. The API can detect the language of text for 120 languages. The language detection score is a score between 0 and 1. If the score value close to 1, language is detected 100% certainty [18]. Text Analytics can be purchased in tiers [19]. Free Plan allows doing 5000 transactions free per month with three of four main features without Named Entity Recognition. Standard Plan has the same features as a Free Plan, but the quantity of analyzed text records bigger and price for their processing depends on the quantity. S0-S4 plans have all four main features including Named Entity Recognition and cost from $ 74,71 per month to $ 4999,99 per month. Google Cloud Natural Language API uncovers the structure and text meaning by using machine learning models in a REST API. It could be used for finding mentions about people, places, events, etc., in texts and documents. It allows understanding sentiment about brand or/and product on social media or analyzing customer conversations holding in a call center or a messengers. It searches useful insights on product approbation or user experience from customer conversations in email, chat or social media. It filters inappropriate content and classifies documents by topics; builds relationship graphs of entities extracted from news or Wikipedia articles and extracts tokens and sentences and then identifies parts of speech to create dependency parse trees for each sentence. The Google Cloud Natural Language API supports 11 languages [20]. The API usage is based on the following principle: pay only for the features you use [21]. Free Plan allows using free all features for 5000 units. If the text contains less than 1,000 Unicode characters, it could be considered as one “unit”. Prices in other plans depend on the units’ quantity and features, features differ in price, the more units the cheaper. The analysis has showed that all considered software are multifunctional, but only two of them support the Ukrainian language. Other products allow downloading own model for sentiment analysis and / or dictionary of sentiment words, but this service is paid or developers have set restrictions on the use of models and user dictionaries. Thus, the development of its own sentiment-analyzer of public opinion for Ukrainian-language texts is an urgent task. 3. The model of the sentiment-analyzer of public opinion In this study, it is proposed to conduct the process of determining the tonality of the news using the lexicon-based method. The main idea of this method is to use the tonal dictionaries, where each word has a certain weight coefficient or several weight coefficients. The calculation of the overall tonality of the whole text is based on the weight coefficients of the words from the dictionary. The dictionary of words tonality has been made for developed sentiment analyzer. Calculations of the tonality of the text have been carried out according to the methodology proposed in [22]. The research [22] demonstrates the determining the tonality of the news for English-language texts. However, the authors pointed out the possibility of using their methodology for other languages based on an appropriate tonality dictionary. Thus, consider the use of the proposed methodology for Ukrainian-language texts. Let’s assume 𝑁 is a set of news, which is needed for determination the tone according to the comments on them. Denote 𝑠𝑖𝑁 as the tonality of 𝑖-th news, 𝑖 ∈ 𝑁. Let’s denote 𝑊 as a set of words and collocations of the tonality dictionary, so 𝑤𝑗 (𝑤𝑗 ∈ 𝑊) – 𝑗-th word of this dictionary. Each word has its own tonality, so denote 𝑠𝑗𝑊 as the tonality of 𝑤𝑗 -th word from the dictionary 𝑊. The range of changes of tonality is measured in the range [−100; 100], where negative values characterize the negative tonality, and positive values are positive tonality, respectively. If the word 𝑤𝑗 occurs in the text of the comment with a negation, then it is necessary to use formula (1). The efficiency of formula (1) is proved in [22]: 𝑠𝑗𝑊 + 100 max ( , 10) , 𝑠𝑗𝑊 < 0 2 𝑠𝑗𝑊′ = . (1) 𝑠𝑗𝑊 − 100 min ( , −10) , 𝑠𝑗𝑊 ≥ 0 { 2 Denote 𝐼 as set of words-intensifier, for example: дуже, трохи, доволі, etc. Some words have a positive intensification, then they belong to the subset 𝐼𝑃 ⊂ 𝐼, and words with negative intensificationare contained in the subset 𝐼𝑁 ⊂ 𝐼, respectively. Let’s denote 𝐾 as the set of comments to all news from the set 𝑁, then 𝐾𝑖 is the subset of comments 𝐶 to the 𝑖-th news, 𝑖 ∈ 𝑁, 𝐾𝑖 ⊂ 𝐾. Denote 𝑠𝑘𝑖 (𝑖 ∈ 𝑁, 𝐾𝑖 ⊂ 𝐾) as the tonality of the 𝑘-th comment of the 𝑖-th news. Different groups of people who are public speakers can write comments. There are three categories of comments: the opinion of the media (the opinions of authors of articles in various online publications about particular news); the opinion of the people (the opinions of ordinary citizens about news); the opinion of experts (the opinions of people, who are the experts in domain related with given news). Let’s suggest 𝐾𝑖𝑐 ⊂ 𝐾𝑖 as a subset of the comments of the 𝑐-th category: ⋃ 𝐾𝑖𝑐 = 𝐾𝑖 . ̅̅̅̅ 𝑐=1,3 𝑁𝑐 Then 𝑠𝑖 is the tonality of the 𝑖-th news in the 𝑐-th category. In this paper, it is proposed to determine 𝑁𝑐 𝑠𝑖 and 𝑠𝑖𝑁 by formula (2) and (3), respectively: 𝑁 1 𝐶 𝑠𝑖 𝑐 = ∑ 𝑠𝑘𝑖 , 𝑘 ∈ 𝐾𝑖𝑐 , 𝑖 ∈ 𝑁, (2) 𝑟𝑐 𝑘,𝑖 1 𝑁 𝑠𝑖𝑁 = ∑ 𝑠𝑖 𝑐 , (3) 3 ̅̅̅̅ 𝑐=1,3 where 𝑟𝑐 is the cardinality of the set 𝐾𝑖𝑐 . 𝐶 To determine the tone of the comment 𝑠𝑘𝑖 , it is necessary to find 𝑊𝑘 , (𝑊𝑘 ⊂ 𝑊) , where 𝑊𝑘 is the set of words of the comment 𝑘-th. The words from 𝑊𝑘 are the elements of the set 𝑊 and the sets of words-intensifier of the comment 𝑘-th 𝐼𝑃𝑘 (𝐼𝑃𝑘 ⊂ 𝐼𝑃 ) and 𝐼𝑁𝑘 (𝐼𝑁𝑘 ⊂ 𝐼𝑁 ) simultaneously, if they exist for the 𝑘-th comment. The cardinality of the sets |𝐼𝑃𝑘 | = 𝑞𝑃𝑘 and |𝐼𝑁𝑘 | = 𝑞𝑁𝑘 , respectively. If all selected words have only one tonality, for example, positive, then the whole comment is considered like positive one. In doing so, some methods for determining the tonality offer to find 𝐴𝑃 – the arithmetic mean of all positive words from 𝑘-th comment by formula (4) or 𝐴𝑁 – the arithmetic mean of negative words by formula (5): 1 𝐴𝑃 = ∑ 𝑠𝑗𝑊 , 𝑗 ∈ 𝑊𝑘 , (4) 𝑝 𝑗 1 𝐴𝑁 = ∑ 𝑠𝑗𝑊 , 𝑗 ∈ 𝑊𝑘 , (5) 𝑛 𝑗 where 𝑝 and 𝑛 are the number of positive and negative words in the 𝑘-th comment, respectively. Thus, 𝐶 the tonality of the comment 𝑠𝑘𝑖 is defined as follows: 𝐴 , ∀ 𝑗 ≥ 0, 𝑗 ∈ 𝑊𝑘 𝐶 𝑠𝑘𝑖 ={ 𝑃 , (6) 𝐴𝑁 , ∀ 𝑗 < 0, 𝑗 ∈ 𝑊𝑘 The paper [22] empirically shows the inaccuracy of estimating the tonality of a sentence or text by arithmetic mean. Authors of this methodic propose its own version of determining the tonality of the comment sentence. Consider a model for determining the tonality of news based on the calculation of the tonality of the set of comments to this news, using the model from [22]. Let’s consider additional variables: X P and X N are the overall positive and negative sentiment in k- th comment respectively; EP and EN are the overall positive and negative evidence in k-th comment respectively. 𝐴𝑃 𝑋𝑃 = min { , 100}, (7) 2 − lg(3.5𝑝 + 𝑞𝑃𝑘 ) 𝐴𝑁 𝑋𝑁 = max { , −100}, (8) 2 − lg(3.5𝑛 + 𝑞𝑁𝑘 ) 𝐴𝑃 𝐸𝑃 = min { , 1}, (9) 2 − lg(3.5𝑝) 𝐴𝑁 𝐸𝑁 = max { , −1}. (10) 2 − lg(3.5𝑛) These variables are needed to determine the tonality of a particular comment (Fig. 1). The process of estimation of tonality of news is shown in the form of the activity diagram using the activity element “Defining the tone of the k-th comment”, while the parameters are X P , X N , EP and EN. Thus, the model of determining the tonality of the news in the form of the diagram has considered. Let’s consider the process of tracking public opinion based on using of sentiment analysis in more details. To develop effective classifier for a specific domain, it is necessary to create a model of this process. There are many techniques and case tools for modelling business process. This research propose to use Business Process Modeling Notation (BPMN) for formalizing the process of tracking public opinion. The Fig. 2 presents the business process model of the given process in the form of BPMN-diagram. To start working with the sentiment analyzer, the user has to add a news item. Then administrator or other user has to add comments with defined category for this news. Identification of the input data of i-th news Define a set KiC for appropriate category of comment Defining the tone of the k-th comment the positive and negative sentiment Identification of the input data of k-th comment the positive and n 0 FP FN 0 p 0 EP EN 0.1 EN EP 0.1 FP FN 0 negative evidence yes no no no no no yes yes yes yes yes XP X EP E XN X no EN E X 25 || E 0.5 yes no skiC X skiC 0 Have all comments of appropriate category of i-th news been considered? no yes Define siNc by (2) Have all categories of news been considered? Define siN by (3) yes no Figure 1: The model for determining the tone of the news Figure 2: The BPMN-diagram of the given business process Next step is the tokenization and lemmatization for each comment. This stage of analysis allow comparing the found words with the words available in the sentiment-dictionary. If the word is in the dictionary, its weight coefficient is taken for calculations. The tonality of each comment further is calculated according to the proposed model of determining the tonality of the news. Next step leads for calculation of the tonality of the comments with particular category. The purpose of the step is determination of the attitude of different public opinion leaders to each news item. Finally, classifier estimates the overall tonality for each news item. It means the general tonality of public opinion about the news. To assess the efficiency of the work of the sentiment analyzer according to the standard metrics Recall and Precision, it is necessary to ask the tonality of all comments from experts of considered domain area. Detailed information of this process is presented in paragraph 4 of this article. Let’s consider the functional and non-functional requirements for the sentiment analyzer. There are three roles of user: administrator, user and expert. The administrator can add and delete news, comments, comment categories and user accounts, as well as view the tonality of comments and news and the results evaluating the effectiveness of the sentiment analyzer. The user has the ability to add news and comments to them, view all the sentiment assessment results and the effectiveness assessment results. The expert has the ability to manually set own assessment of the sentiment of comments and view the results of assessing the effectiveness of the program. Non-functional requirements include the following: intuitive and user-friendly interface, reliability of data transfer and storage, usability, performance, high performance. The whole functionalities of the developed sentiment analyzer for different categories of users are presented in the form of a use-case diagram in Fig. 3. News News title «include» «extend» adding adding Management of news «include» «extend» News News text deleting adding Management of news’ News’ «extend» comments comments adding «include» Comment «extend» News’ text adding comments «include» «include» deleting Management News News comments of comments’ choosen category choosen User categories «extend» «extend» Comments’ Comments’ categories categories adding deleting Management of users «extend» Users Administrator «extend» deleting Users Marking the adding By each comments' tonality comments manually Tonality «extend» evaluation results viewing «extend» By comments’ Expert «extend» categories related to news By news Viewing results of classification efficiency estimation Figure 3: The functional possibilities of the sentiment analyzer All data and results of the sentiment analyzer work are stored in the database. The logical structure of the database allow seeing the relationships between different objects or entities of domain area. Every business rule of the work of the sentiment analyzer should be base for creating of model of the database. There are many different models of a database of domains, although the Entity-Relationship Model is the most widely used one. Let’s consider the database model for the sentiment analyzer (Fig. 4). Figure 4: The model of the database Database model consists of following entities: – the entity «Dictionary» represents list of sentiment words with their weight coefficients; – the entity «News» describes all news added by all users with their tonality expressed both in words and numerical value; – the entity «Comment» represents all comments for all news with their categories and tonality expressed both in words and numerical value; – the entity «Category» describes all comments’ categories; – the entity «Evaluation_Range» represents all tonality expressions by words; – the associative entity «Comment_Dictionary» describes the set of words from the comment presented in a dictionary; – the associative entity «News_Category» describes tonality evaluation expressed by numerical value for each comments’ categories that belong all news. Description of these entities and their attributes is presented in the Table 1. Table 1 Description of database model Entity Attribute Attribute description Dictionary id Word’s id word Word word_value Word’s weight coefficient News id_news News’ id title News’ title content News’ content Evaluation_Rangeid_mark News’ general tonality expressed by words news_value News’ general tonality expressed by numerical value Comment id_comment Comment’s id name Comment’s content Categoryid Comment’s category Newsid_news Comment related news id Evaluation_Rangeid_mark Comment’s tonality expressed by words comment_value Comment’s tonality expressed by numerical value Category id Category id name Category name Evaluation_Range id_mark Tonality evaluation id name_mark Tonality evaluation expressed by words Comment_Dictionary Dictionaryid Word’s id Commentid_comment Comment’s id News_Category Newsid_News News’ id Categoryid Category id news_value_category The numerical value of tonality evaluation 4. The efficiency estimation of the developed analyzer According to the aforementioned information, the standard metrics Precision and Recall have been used to evaluate the efficiency of the sentiment analyzer work. To calculate these metrics, it is necessary to find the following indicators: true positive – the number of answers we expected to see and received at the exit; false positive – the number of answers that we did not expect to see, but the analyzer mistakenly returned them at the exit; false negative – the number of answers that we expected to see, but the analyzer did not return them at the exit; true negative – the number of answers that we did not expect to see, and the analyzer did not return them at the exit. The Table 2 presents the examples of the assessments of the sentiment analyzer and experts of several comments with different categories related to one news item, and the matching between results. Table 2 The results of assessments of tonality of comments made by expert and sentiment-analyzer № Evaluation Expert Match of the program assessment status 1 negative negative + 2 positive positive + 3 positive negative - 4 negative negative + 5 negative negative + 6 negative negative + 7 positive very positive - 8 positive positive + 9 negative negative + 10 positive positive + 11 negative positive - 12 negative negative + 13 positive positive + 14 very positive very positive + 15 positive neutral - 16 negative negative + 17 very negative very negative + 18 very negative very negative + 19 positive positive + 20 positive positive + 21 very positive positive - 22 positive positive + 23 positive positive + 24 positive positive + 25 positive positive + 26 positive negative - 27 very positive very positive + 28 positive positive + 29 positive positive + 30 positive positive + Precision is calculated as the proportion of relevant responses in the total volume of all responses issued by the sentiment analyzer by the formula: 𝑇𝑃 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (11) 𝑇𝑃 + 𝐹𝑃 Recall is calculated as the proportion of relevant responses in the total number of relevant responses. Recall is calculated by the formula: 𝑇𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = (12) 𝑇𝑃 + 𝐹𝑁 The Table 3 presents examples of evaluation by the sentiment analyzer of the same several texts of comments that are presented in the Table 2, but the results of the evaluation are presented in the relevant classes. The one point in the Table 2 indicates class of the text obtained by the sentiment analyzer. If there is a letter in parentheses next to the one point, it means that the sentiment analyzer made a mistake and incorrectly assigned the text to this class. The letter in parentheses indicates the correct class for this text chosen by the expert. Table 3 The results of distribution the texts on classes by the sentiment analyzer № Text evaluation Very negative (nn) Negative (n) Neutral (N) Positive (p) Very positive (рр) 1 1 2 1 3 1(n) 4 1 5 1 6 1 7 1(pp) 8 1 9 1 10 1 11 1(p) 12 1 13 1 14 1 15 1(N) 16 1 17 1 18 1 19 1 20 1 21 1(p) 22 1 23 1 24 1 25 1 26 1(n) 27 1 28 1 29 1 30 1 According to the results of calculations, the following metric values have been obtained by (11)- (12): Precision = 0.861; Recall = 0.849. Moreover, the value of the Precision metric within the class has range from 0.807 to 0.921, and the Recall metric – from 0.775 to 0.930. Such results indicates that the sentiment analyzer works adequately in general and within each tonality class as well. The obtained values of metrics Precision and Recall of the lexicon-based classifier have been compared with the values of such metrics for two other classifiers: Naïve Bayesian Classifier and RNN Cmeans Classifier, based on Recurrent Neural Network. Results for Naïve Bayesian Classifier and RNN Cmeans Classifier are taken from [23]. All results of classifiers efficiency evaluation are presented in the Table 4. Table 4 Efficiency estimation of classification results Metrics Lexicon-based Classifier Naïve Bayesian Classifier RNN Cmeans Classifier Precision 0,861 0,869 0,878 Recall 0,849 0,853 0,870 The authors of the research [23] describe an experiment where additional training of classifiers on Slang corpus, which the tonality of slang words have been marked up, allowed to increase the efficiency of classifiers by 10-11%. Therefore, based on this information, we decided to supplement our dictionary with slang words from the dictionary [24] and re-examine the work of the developed lexicon-based sentiment analyzer. The results of metric calculations for the same three classifiers are presented in the Table 5. Results for Naïve Bayesian Classifier and RNN Cmeans Classifier are also taken from [23]. Table 5 Efficiency estimation of classification results with Slang dictionary and Slang corpus Metrics Lexicon-based Classifier Naïve Bayesian Classifier RNN Cmeans Classifier Precision 0,915 0,975 0,982 Recall 0,895 0,948 0,965 Comparing the results of calculations of metrics from the Tables 3 and 4, the increasing of the efficiency of the work of the lexicon-based classifier has not happen by 10-11%, the increasing has occurred by 5-6%. This can be explained by the fact that not every of the analyzed comments contain slang words, they are used only in the comments, that the opinion of the people reflect. 5. Conclusions The paper presents an approach to solving the problem of evaluation of public opinion using the sentiment analyzer. The existing methods and software for sentiment analysis are analyzed. The comparative characteristics of these methods have allowed choosing the lexicon-based methods. The proprietary algorithm for solving the problem of determining the attitude of public opinion representatives based on their comments to news is proposed. To calculate the sentiment of the comments, the technique was used, which was first applied to Ukrainian-language texts, and its own dictionary of sentiment words, subsequently supplemented with slang words. A functional model of the business process of tonality identification by sentiment analyzer and a database model as well as their description are presented. All its functionality has shown in the form of a use-case diagram and described. The efficiency of the developed sentiment analyzer was assessed using the standard Precision and Recall metrics. A comparative analysis of the efficiency of the Lexicon-based Classifier and Naïve Bayesian Classifier, RNN Cmeans Classifier has been carried out. It is shown that adding the of slang words to the sentiment dictionary increases the efficiency of the Lexicon-based Classifier by 5-6%, while additional training of two other classifiers on Slang corpus showed an increase in efficiency by 10-11%. 6. References [1] S. Ion, C. Bucur, Applying Supervised Opinion Mining Techniques on Online User Reviews. Informatica Economica Journal. 16. URL: https://core.ac.uk/download/pdf/27056535.pdf [2] D. Vilares, C. Gómez-Rodríguez, M. A. Alonso, Universal, unsupervised (rule-based), uncovered sentiment analysis, Knowledge-Based Systems, volume 118, 2017, pp. 45–55, URL: https://www.sciencedirect.com/science/article/pii/S0950705116304701. doi: 10.1016/j.knosys. 2016.11.014 [3] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, M. Stede, Lexicon-based methods for sentiment analysis. Computational Linguistics. 37. (2011) 267–307. doi: 10.1162/COLI_a_00049 [4] Z. Rahimi, S. Noferesti, M. Shamsfard, Applying data mining and machine learning techniques for sentiment shifter identification. Language Resources and Evaluation, volume 53, issue 2, 2019, pp. 279–302. doi: 10.1007/s10579-018-9432-0 [5] K. Ravi, V. Ravi, A survey on opinion mining and sentiment analysis: Tasks, approaches and applications, Knowledge-Based Systems, volume 89, 2015, pp. 14–46, URL: https://www.sciencedirect.com/science/article/pii/S0950705115002336. doi: 10.1016/j.knosys. 2015.06.015 [6] Rosette Sentiment Analyzer. URL: https://www.rosette.com/capability/sentiment-analyzer/ #overview [7] About Social Searcher. URL: https://www.social-searcher.com/about/ [8] Social Searcher API V2.0 Released. URL: https://www.social-searcher.com/2015/08/04/social- searcher-api-v2-0-released/ [9] Social Searcher pricing. URL: https://www.social-searcher.com/pricing/ [10] Sentiment analysis. Unlock the meaning in your data. URL: https://www.repustate.com/ sentiment-analysis/ [11] About Social Mention. URL: http://socialmention.com/about/ [12] Social Mention API. URL: http://socialmention.com/api/ [13] Social Mention. Frequently Asked Questions. URL: http://socialmention.com/faq [14] MeaningCloud’s Sentiment Analysis API. URL: https://www.meaningcloud.com/developer/ sentiment-analysis [15] MeaningCloud pricing. URL: https://www.meaningcloud.com/products/pricing [16] IBM Watson Natural Language Understanding. URL: https://www.predictiveanalyticstoday.com/ ibm-watson-alchemyapi/ [17] How NLU pricing works. URL: https://www.ibm.com/cloud/watson-natural-language- understanding/pricing [18] Microsoft Azure Cognitive Service Text Analytics API. URL: https://www.predictiveanalyticstoday.com/microsoft-azure-text-analytics-api/ [19] Cognitive Services pricing – Text Analytics API. URL: https://azure.microsoft.com/en- us/pricing/details/cognitive-services/text-analytics/ [20] Google Cloud Natural Language API. URL: https://www.predictiveanalyticstoday.com/google- cloud-natural-language-api/ [21] Google Cloud. Cloud Natural Language. URL: https://cloud.google.com/natural-language/ pricing [22] A. Jurek, M. D. Mulvenna, Y. Bi, Improved lexicon-based sentiment analysis for social media analytics. Security Informatics. 4, 9 (2015). URL: https://security-informatics.springeropen.com /articles/10.1186/s13388-015-0024-x#article-info doi: 10.1186/s13388-015-0024-x [23] N.V. Borysova, K.V. Melnyk, Efficiency estimation of methods for sentiment analysis of social network messages, Bulletin of National Technical University “KhPI”, Series: System Analysis Control and Information Technologies. 2 (2019) 76–81. doi:10.20998/2079-0023.2019.02.13 [24] N. V. Borysova, V. V. Niftilin, Avtomatyzovane stvorennia elektronnogo slovnyka, in: E. I. Sokol (Eds.), Proceedings of XXV International scientific-practical conference in Information technologies: science, engineering, technology, education, health, MicroCAD-2017: Part 1 (May 17–19, 2017), NTU “KhPI”, Kharkiv, 2017. p. 32