Information System for the Intellectual Assessment
    Customers Text Reviews Tonality Based on Artificial
                     Neural Networks

    Nicolay Rudnichenko1 [0000-0002-7343-8076], Svetlana Antoshchuk1 [0000-0002-9346-145X],
        Vladimir Vychuzhanin1 [0000-0002-6302-1832], Andrii Ben2 [0000-0002-9029-3489],
                            Igor Petrov3 [0000-0002-8740-6198]
                1Odessa National Polytechnic University, Odessa, Ukraine

     nickolay.rud@gmail.com, vint532@yandex.ua, asgonpu@gmail.com
                  2Kherson State Maritime Academy, Kherson, Ukraine

                                     a_ben@i.ua
           3National University "Odessa Maritime Academy", Odessa, Ukraine

                                 firmness@list.ru


       Abstract. This article presents the results of the concept development and soft-
       ware information system for assessing text data tonality implementation by us-
       ers based on artificial neural networks. The main problems in this topic are
       identified, the features of using deep machine learning for the text data mining
       problems are presented. An information system project has been developed, the
       preprocessing procedure and data filtering algorithms have been described, the
       specifics of data normalization for formalizing artificial neural network models
       are formalized. The options for using the information system, the block struc-
       ture, the interface prototype and the procedure for user interaction with the
       software application are developed. The training effectiveness study results and
       the use of an artificial neural network model to solve the tasks are presented, the
       most suitable values of hyperparameters that have a primary impact on the
       model quality are identified and selected.

       Keywords: machine learning, big data, data mining, data science, neural net-
       works, deep learning, nature language processing


1      Introduction

Currently, in the Internet there is a rapid increase in the volume of heterogeneous
data, which is associated with the development and dissemination of social networks,
online stores, thematic blogs and information web systems, which significantly affects
the electronic commerce various areas activity and trade in various electronic goods
(EG) in particular [1].
   In connection with the regular appearance and active development of new com-
mercial and information resources, modern consumers of virtual and physical goods


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). ICST-2020
and services are increasingly experiencing difficulties in choosing companies, organi-
zations, manufacturers of technical gadgets and tools specific models [2].
   This creates the need for additional information about the actual functionality and
features of the EG operation from other users and experts. Additional difficulties are
introduced by the need for filtering and analysis of marketing activities of competing
manufacturing companies to identify the most suitable goods and services for the
specific user’s needs, which requires a large number of data computational operations
[3].
   In order to obtain competitive advantages and for better understanding customer’s
needs vendors also have to obtain the most reliable and relevant data extracted from
large amounts of information based on user opinions analysis [4,5].
   A partial solution to the identified problems is represented by existing systems and
information resources, aggregating text reviews, comments and comparative video
reviews of the characteristics and specifics of using EG in different conditions and
modes [6].
   However, these information platforms do not always have a flexible, convenient
and informative interface, a thin search system and visualization of summary statistics
with the formation of aggregated and crosstab reports [7,8]. The analysis of the data
posted on such information resources is often difficult due to the need to view inter-
esting reviews and comments on products in manual mode, which is associated with
large time costs, i.e. analyzing user-generated opinions on the goods and services
offered is a relevant and time-consuming process [9,10].
   In this regard, it is advisable to automate the evaluation process suitable for the us-
er EG, according to his individual preferences, by searching and analyzing the col-
lected data characterizing various products on the basis of solving the classifying
problem with semantic content into relevant groups.
   To solve this task in practice, natural language processing (NLP) existing ap-
proaches are used, in particular, methods for analyzing the text’s tonality, morpholog-
ical analysis of its constituent entities, and evaluating expressions emotional coloring
[11]. Sentiment analysis refers to the use of computational linguistics to identify and
extract subjective information in source materials [12].
   Existing approaches to the analysis of text’s tonality are divided into the following
main categories: definition of keywords, lexical similarity, statistical and conceptual
methods [13].


2      Description of Problem

   In general terms, the task of user reviews types determining for purchased goods is
not fully clear and unambiguous, therefore it is realized by classifying them into sepa-
rate groups in a linguistic form.
   In various works on the classification of user reviews for various modern products
on existing information resources, both standard text classification methods and modi-
fied methods are often used, which take into account the possible inversion of the
valuation word values, the syntactic structure of sentences, the dependencies between
words [14].
   The specificity and main difficulty of applying the classic NLP methods for differ-
ent sets of user reviews is the need to collect enough adequate data to train the select-
ed classifier model, to perform a number of laborious preparatory procedures for data
preprocessing and cleaning to ensure an acceptable level of accuracy and speed of
use. In this regard, it is advisable to analyze modern promising approaches to the
classification of texts.
   Currently, in practice, 2 approaches are used to solve the problem: methods based
on logical rules and machine learning [15].
   According to the results of a comparative analysis of the algorithms [16-19], the
ANN method was chosen as one of the most used in practice and promising in im-
plementation. An additional advantage of this method is the high functionality of
existing libraries for the neural network models implementation from Google, their
constant support and updating, which will provide opportunities for improving the
system in the future.
   Existing solutions in the text content analysis market have significant limitations in
the amount of input data for processing, do not provide flexible settings for collecting
and processing text in different languages, and do not allow evaluating the accuracy
of reviews taking into account semantic topics [20-26].
   In this regard, the urgent task is to develop our own information system (IS) that
implements the functionality for evaluating user feedback on EG.
   The purpose of the work is to study the possibilities of using the apparatus of arti-
ficial neural networks to assess user preferences for groups of acquired goods by au-
tomating their opinions analyzing process based on the classification problem solu-
tion.
   The task of classifying text information is defined as follows. Let a document de-
scription exist d  X , where X - vector document space, and a fixed set of classes
C = {c1 , c2 ,..., cm } . From the training set (many documents with previously known
classes) D = { d , c | d , c  X  C} using the learning method G it is neces-
sary to obtain a classification function G (D ) =  , which maps documents to classes
 : X →C.

3      Information system development

3.1    System concept

   The concept of the developed system is based on a combination of statistical meth-
ods of intelligence analysis and data preprocessing, as well as the artificial neural
networks (ANN) theory [27].
   The classification problem specificity under consideration is to carry out the fol-
lowing procedures for the text data preprocessing:
• Bringing all characters found in the text to lowercase in order to reduce the total
  unique number of terms in the dictionary.
• Exclusion of non-literal characters from the text. Such a procedure significantly
  reduces the number of unique terms in the dictionary, in cases where the text is
  characterized by an abundance of punctuation that does not carry a fundamental
  semantic load. In the considered problem, this can significantly reduce the amount
  of computational operations.
• Duplicate characters exclusion. This allows us to replace existing in the text se-
  quences of identical characters to reduce the dictionary size.
• Isolation of the word base from a input text data set (stemming).
   The listed actions are performed before the text classification process in order to
increase the speed and reduce the iterative and logical complexity of data processing.
   A formal description of the proposed classification concept in a schematic form of
decomposition is shown in Fig. 1.


                           Fig. 1. Formalized system concept

   The first of the system concept indicated stages consists in parsing data from the
specified local or remote sources, on the basis of which a training sample is generated
for the ANN model.
   The second stage consists in filtering data by language and cleaning out extraneous
characters that do not carry a semantic load in the recall (punctuation marks, unions,
special characters). As a result, by means of vector semantics operations, a vector
representation of given dimension text feedback words is formed for further use by
the ANN model. In general terms, the second stage of the proposed method is shown
in Fig. 2.


                              Fig.2. Data filtering algorithm

   The third stage is to bring all the numerical values of the text' vector representation
to the same area of change, whereby they are reduced to a single set of training data
for the neural network model for classifying reviews.
   Actual, the procedure for normalizing input data is being implemented to convert
all elements of the input data set into binary code, which is acceptable for further
processing by an artificial neural network.
   A generalized algorithm for normalizing data when creating a neural network is
shown in Fig. 3.
   The minimax function performs a linear transformation after determining func-
tion’s minimum and maximum values so that the obtained values are in the desired
range from -1 to 1.


                      Fig.3. Generalized data normalization algorithm

   The fourth step is to break down the processed data sets into separate blocks for
model training, testing and validating, taking into account the nature of the data in a
given ratio.
   The fifth step is to initialize the ANN model to classify text reviews into three dif-
ferent classes (positive, neutral, and negative). Initialization of a neural network mod-
el is the process of creating a neural network object, loading a normalized data set,
initializing the learning process and model saving, which is based on the recursive
ANN models usage.
   The sixth step is to numerically evaluate the accuracy of the created ANN model to
solve the text tonality assessing problem. Conducting calculations, on the basis of the
supplied text string by the user, created model analyzes the input data and classifies
them according to the available classes. It is a test of the ANN model operation on a
test sample. At the same time, the result of the classification is converted, the obtained
values are translated into a text view that is understandable for the user. This stage is
based on the use of a reliability metric to determine the proportion of correctly classi-
fied text reviews and the loss function to assess the dependence of training accuracy
on the weight matrix coefficients.
    To ensure convenient and efficient operation IS implements the proposed concept,
it is necessary to introduce a number of restrictions. Due to the fact that text reviews
are of different sizes and carry different semantic load, and processing too large text
fragments can be time-consuming and expensive in computational resources terms, it
is advisable to limit their volume. In particular, the program should support the ability
to analyze the text in Russian, Ukrainian and English, the total text should be up to
2000 characters, the analysis should not exceed 10 seconds.
    The IS input receives text data of user comments and reviews, as a result of pro-
cessing, a text classes table is formed, estimation accuracy level (classification error
by the ANN model), a summary statistics form, and a file with output classification
results in * .xls format are calculated.
    The main stages of the project are as follows:

• Development of a parser software module for searching, receiving, and collecting a
  data set to form a ANN training samples.
• Filtering data by language and cleaning extraneous characters that do not carry a
  semantic load in the recall.
• Export of the obtained sample to the *.csv format for import into the neural net-
  work structure.
• Creation and configuration ANN structure, the selection of training algorithms and
  its work evaluation.
• IS graphical user interface development that includes the functions of entering a
  text commentary and viewing the classification result.
• Text evaluation in one of the possible recall classes.

   The stage of creating and configuring a neural network in a more detailed form is
divided into a number of the following tasks:

• Getting the input string (array of strings) is the process of writing a input text data
  set into a variable.
• Input data normalization, for converting all data set elements into binary code,
  which is acceptable for further processing by an ANN.
• ANN model initialization is the process of creating a neural network object, load-
  ing a normalized data set and initializing the learning process and saving the mod-
  el.
• Conducting calculations, based on the user-supplied text strings of feedback, the
  ANN model analyzes the input data and classifies them according to the available
  classes.
• Transformation of the result of the classification (denormalization), translation of
  the obtained values into a text form, understandable for the user.
• The output of the obtained value during the execution of this stage in the user inter-
  face displays the classification result.
   As the development language we used Python 3.7, which is expanded by the fol-
lowing data structure processing libraries: Numpy, to support the use of multidimen-
sional data arrays and implement the necessary mathematical functions number for
their processing; Pandas, for the implementation of modeling and analysis functions
during data processing and normalization.
   To normalize and denormalize the data, create, configure and train the ANN mod-
el, the keras library and its components are used (tokenizer, TensorBoard, LSTM
modules).
   The PyQt library and the QtDesigner module were used to create a graphical user
interface, layout the necessary widgets and elements of the program form.


3.2    System project implementation

   When forming requirements for the created IS, a use-case diagram was developed
(Fig. 4), in accordance with which the requirements for user roles are formalized (rep-
resented by a typical user and system administrator).


                            Fig. 4. System’s use case diagram
   The user should have the following options for interacting with IS through a graph-
ical interface (form):

• entering and editing the corresponding text review within the corresponding text
  field;
• viewing the result of the review class analysis (positive, negative, neutral);
• exporting the result to a text file.

   The administrator has the ability to parse data from the specified page URL and set
additional parameters for parsing, as well as configure and train the ANN with view-
ing the results. Based on the analysis and determination of IS requirements, a block
structure has been developed (Fig. 5).


                                 Fig.5. IS block structure

  The designed IS includes the following components:

• Subsystem for processing text reviews (data import module and normalized mod-
  ule for imported data).
• Subsystem for classifying user reviews text (neural network training module and
  module for interpreting the recall class).
• The form of the graphical user interface.

   For convenient user working process with IS, the arrangement of widgets on the
form is done in an adaptive style, when resizing the working window, their location is
scaled in proportion to the screen resolution. IS main form graphical user interface is
shown in Fig. 6. The upper part of the form displays informational messages about the
application process, which are automatically saved as an event log in a *.txt file if it is
necessary to track errors or incorrect data processing by the system. The classification
results are displayed in tabular form, for a detailed view of the review text, user must
select the appropriate line.
                                Fig. 6. Software GUI form

   At the bottom of the form is the input field for the source web page URL, as well
as text labels that display the percentage of positive, negative and neutral reviews.


4      Experiments and results analysis

   To carry out a study created IS functioning specifics on the use of artificial neural
networks, a test texts selection for EG from a number of popular online stores was
prepared and aggregated: 120,000 texts (30,000 texts for each of the possible classes).
   The sample was obtained through the development of a specialized data parser that
performs filtering and data cleaning. The assignment of class types for each record
was carried out manually. The entire volume of the text reviews obtained sample was
divided into training, test and validation sets (60%, 20% and 20%, respectively) in
order to evaluate the quality of the model. As part of the IS research process, classifi-
cation accuracy was assessed, i.e. the number of correctly classified text user reviews.
As the numerical characteristics of the performance assessment IS used:

• ACCURACY is a confidence metric that allows us to evaluate the classification
  accuracy, i.e. determine the proportion of correctly classified texts.
• LOSS is a function of losses during neural network operation, this indicator illus-
  trates the dependence of training accuracy on the weight matrix coefficients.

    To conduct numerical studies of the created neural network model use framework
of the developed information system and obtained results graphic the Tensor Board
data analysis tool was deployed. The dependence of the value of assessing the relia-
bility of the neural network (ordinate axis) by the passed training eras (abscissa axis)
is shown in Fig. 7.


                      Fig. 7. Neural network reliability assessing value

   A thin line marks the results of a training sample of reviews, and a thick line shows
the results of using a neural network in a test sample. The overall accuracy of the
created neural network was about 89%. The dependence of the values of the loss
function on the epoch of neural network training is shown in Fig. 8.


                  Fig. 8. Neural network training loss function dependence

    In order to study the possibility of improving the quality of the solution to the clas-
sification problem created by an artificial neural network (text feedback submitted to
it at the input), it is advisable to evaluate the performance of the developed neural
network model for various values of a number of its parameters. As variable parame-
ters were used: max_features, maxlen and batch_size. The results of the model as-
sessment for various parameters are given in table 1. The best result of the Accurancy
value (0.92) was obtained with the following parameter values: max_features - 7000;
maxlen - 100; batch_size - 64. Based on the analysis of the ANN model characteris-
tics with various parameter values, the dependence of the neural network operation
accurancy metric value and the max_features model parameter was studied (Fig.9).
         Table 1. The results of the model assessment

 Test
           Accuracy max_features maxlen batch_size
Number
   1         0.72371         3000           30           8
  2          0.76229         3500           40          16
  3          0.79847         4000           50          24
  4          0.81451         4500           60          32
  5          0.85012         5000           70          64
  6          0.84832         5500           80          128
  7          0.86233         6000           50          256
  8          0.87431         6500           60           8
  9          0.87985         7000           70          16
  10         0.88615         7500           80          24
  11         0.89132         8000           50          32
  12         0.88434         8500           60          64
  13         0.88934         9000           70          128
  14         0.89091         9500           80          256
  15         0.88347         10000          85          64
  16         0.90831         5000           90          128
  17         0.91217         6000           95          256
  18         0.92331         7000          100          64


      Fig. 9. ANN accuracy and max_features dependence
   It should be noted that the classification confidence level increases with the in-
crease in max_features; the peak is reached in the range from 5500 to 8000. As a re-
sult of a IS operation study based on a neural network (a selected recurrent architec-
ture of the LSTM type), classification accuracy of about 92% was achieved.
   This allows us to conclude that for text reviews of the specifics examined in the
EG field, the most significant ANN parameters from the point of view of influence on
classification accuracy are the weight matrix rewriting border size and the number of
words in the reviews text sample, the maximum length of one review is less im-
portant. With batch_size = 64, the highest accuracy is achieved.


5      Conclusion

   The developed information system implements the proposed concept of assessing
the tonality of electronic goods reviews and is a cross-platform solution providing a
fairly high classification accuracy of more than 90%, which indicates the reliability of
the solution to the problem.
   Based on the results of the user reviews classification, it becomes possible to form
an aggregated integrated indicator for evaluating the relevant goods, which can be
used to prioritize the customers preferences in a ranked form in order to support and
facilitate decision-making processes for choosing and buying.
   Large trading floors can use the results of evaluating user opinions to analyze and
select the most reputable and reliable vendors for further cooperation or stopping
purchases from suppliers whose products are regularly criticized by customers.
   The subsequent logical development of the proposed approach to the classification
of user reviews is the integration of analysis mechanisms for the reliability of data
sampling in order to cut off noise and non-informative data, expanding class types
and implementing a number of quantitative indicators corresponding to them to clari-
fy the estimates formed.


Reference
 1. Rudnichenko, N., Vychuzhanin, V., Shybaieva, N., Shybaiev, D., Otradskaya, Т., Petrov,
    I.: The use of machine learning methods to automate the classification of text data arrays
    large amounts. Information management systems and technologies. Problems and solu-
    tions. Ecology, Odessa, pp.31-46 (2019)
 2. Rudnichenko, N., Vychuzhanin, V., Shybaieva, N., Shybaiev, D.: Big data intellectual
    analysis in the diagnosis of the transportation systems technical condition. Systems an-
    means of transport. Problems of operation and diagnostics. KSMA, Kherson, pp.57-69
    (2019)
 3. Rudnichenko, M.D., Gezha, N.I., Belyaev, K.O., Kuzmin, A.D.: Performance analysis of
    machine learning model ensembles. In III All-Ukrainian scientific-practical conference of
    young scientists, students and cadets “Information protection in information and commu-
    nication systems”. Lviv. pp.259-260 (2019)
 4. Adaskina, Yu. V., Panicheva P.V., Popov, A. M .: Sentimental analysis of tweets based on
    syntactic links. In computer linguistics and intellectual technologies: based on the materi-
    als of the annual international conference Dialogue. Moscow, рр.25–35 (2015)
 5. Vasiliev, V.G., Khudyakova, M.V., Davydov, S.: Classification of user reviews using
    fragment rules. In Computational Linguistics and Intellectual Technologies: Based on the
    materials of the annual Dialog International Conference. Moscow, рр.66-78 (2012)
 6. Garshina, V.V., Kalabukhov, K.S., Stepantsov, V.A., Smotrov, S.V.: Development of a
    system for analyzing the tonality of textual information. Gerald of VSU, series: system
    analysis and information technology. vol. 3, pp.185-194 (2017)
 7. Lysenko, V.D.: Text sentiment analysis for forecasting stock market prices. Young scien-
    tist. vol. 22, pp.420-423 (2018)
 8. Pavlov, Yu.N., Maystruk, K.A .: Comparison of text tonality assessment methods. Young
    scientist. vol. 12, pp.59-64 (2016)
 9. Loukachevitch, N., Kotelnikov, E., Rubtsova, Y.: SentiRuEval: testing object-oriented sen-
    timent analysis systems in Russia. In proceedings of International Conference Dialog-
    2015, Moscow, pp. 313 (2015)
10. Rubova, Y.V.: Building a body of texts for tuning the tone classifier. Software Products
    and Systems. vol. 109, pp.72–78 (2015)
11. Menshikov, I.L., Kudryavtsev, A.G.: A review of systems for analyzing the tonality of a
    text in Russian. Young scientist. vol.12, pp.140-143 (2012)
12. Kotelnikov, E.V., Klekovkina, M.V.: Automatic analysis of tonality of texts based on ma-
    chine learning methods. In Computational Linguistics and Intellectual Technologies:
    Based on the materials of the annual International Conference “Dialogue”. Moscow,
    pp.15–21 (2012)
13. Sboev, A.G., Voronina, I.E., Gudovskikh D.V., Selivanov, A.A.: Advanced neural net-
    work models for solving the problem of determining tonality. Bulletin of the Voronezh
    State University. System Analysis and Information Technology. vol. 4, pp.178–183 (2016)
14. Gorban, A.N.: Training of neural networks. Moscow: ParaGraph (2010)
15. Shybaiev, D.S., Otradskaya, T.V., Stepanchuk, M.V., Shybaieva, N.O., Rudnichenko,
    N.D.: Predicting system for the estimated cost of real estate objects development using
    neural networks. ZhSTU Herald. Technical science. vol.83, pp.154-160 (2019)
16. Silge, J., Robinson, D.: Text Mining with R: A Tidy Approach. O'Reilly Media (2017)
17. Chaudhuri, A.: Visual and Text Sentiment Analysis through Hierarchical Deep Learning
    Networks. Springer (2019)
18. Aggarwal, C.C.: Machine Learning for Text. Springer (2018)
19. Rudy P., Thelwall, M., Sentiment analysis: A combined approach. Journal of Informetrics.
    vol. 3, pp.143-157 (2009)
20. Asad, A., Siti, M. S., Shafaatunnur H., Jalil P.: Machine learning-based multi-documents
    sentiment-oriented summarization using linguistic treatment. Expert Systems with Appli-
    cations. vol. 109, pp.66-85 (2018)
21. Yi, C, Qingbao, H., Zejun, L., Jingyun, X, Zhenhong, C., Qing, L.: Recurrent neural net-
    work with pooling operation and attention mechanism for sentiment analysis: A multi-task
    learning approach. Knowledge-Based Systems (2020)
22. Saerom, P., Jaewook L., Kyoungo K.: Semi-supervised distributed representations of doc-
    uments for sentiment analysis. Neural Networks. vol.119, pp.139-151 (2019)
23. Wang, J., Tao, Q.: Machine Learning: The State of the Art. IEEE Intelligent Systems.
    vol.23, pp. 49-55 (2008)
24. Rahul, A., Surabhi, M.: NLP based Machine Learning Approaches for Text Summariza-
    tion. pp.535-538. (2020)
25. Hung, C.C., Song, E., Lan, Y.: Foundation of Deep Machine Learning in Neural Networks
    (2019)
26. Wu, Z., Ding, X., Xu, X., Ju, C.: ECG arrhythmias classification based on deep learning
    approach. ICIC Express Letters, Part B: Applications. pp.843-850 (2017)
27. Miikkulainen, R. Topology of a Neural Network (2011)