A Comparison of Machine Learning Methods of
Sentiment Analysis Based on Russian Language
                Twitter Data

        Andrey Zvonarev[0000−0003−2191−2167] vodogrey@niuitmo.ru and
           Andrey Bilyi[0000−0002−6133−4368] bilyi andrei@mail.ru

         ITMO University, 49 Kronverksky Pr., St. Petersburg, 197101, Russia


        Abstract. The paper is focused on comparing the performance of dif-
        ferent techniques of text tonality analysis, a widely used approach in
        business to conduct social listening research. However, there are still de-
        bates on what type of models perform better for NLP classification tasks.
        On a corpus of Russian language tweets three models were tested to solve
        binary classification problem: Logistic regression (LR), XGBoost classi-
        fier and Convolutional Neural Network (CNN). The paper descriptively
        overviews main techniques useful for data cleaning and preprocessing for
        these methods and covers its possible fitfalls. Based on the study CNN
        showed best results among chosen models, which goes in line with several
        articles in the field for other than Russian languages. Together with high
        predictive results, neural networks exhibit a computational drawback -
        their performance is poor in terms of timings. Besides this, all meth-
        ods showed high sensitivity to the way data is preprocessed, which is a
        product of Russian language variability. This leads to a conclusion that
        there is still a room for improvement - in future research more emphasis
        will be put on hyper parameter tuning for ”boosting type” models and
        extending list of applied methods.

        Keywords: text tonality · sentiment analysis · logistic regression · CNN
        · XGBoost · natural language processing


1     Field overview and prediction task description

Nowadays, more and more communication, services and goods transfer to the
Internet where the information is basically provided in the form of text. In this
regard, the task of determining the emotional state of a person without personal
communication is becoming increasingly crucial. The perspectives of this field
are that, based on the textual information, it is possible to determine a mood of
a person, or to estimate a success of political or economic reforms, or to check
person’s reaction to particular event or decision.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2         A. Zvonarev, A.Bilyi

    Due to in practice it is not possible to identify the emotional coloring of
large number of texts available in the Internet manually and, in addition, it
is too labor-intensive, one of the possible ways to solve this problem is to use
machine learning algorithms.
    At the moment, there are several key methodologies for determining emo-
tional coloring of the text:

    – Analysis using pre-complied dictionary[1]. Such dictionaries consist of
      pre-prepared template words, phrases and their combinations with emotional
      coloring characteristic of each element. Moreover, for determining emotional
      coloring with improved accuracy of tonality assessment corpus linguistics can
      be used[2]. Nevertheless, in paper[3], the authors encountered differences in
      the expression of emotions in English and Russian languages when using
      a bilingual corpus. The positive experience of translation of dictionaries is
      presented in the paper[4], where the authors translated the dictionary of
      emotionally colored vocabulary in English into Chinese. The assessment is
      made on set of positive and negative patterns found. With the explicit al-
      location of one of them to the text or passage, the class that scored more
      points is set. If there is no obvious predominance, the rating is set to neutral.
      The main disadvantage is the procedure for compiling glossaries of terms in-
      dicating the weight of phrases. Also, these dictionaries must be prepared for
      a specific area.
    – Analysis with the use of machine learning methods has recently be-
      come the most widespread because of reducing the influence factor of hu-
      man impact on the assessment. In comparison with pre-complied dictionary
      method where the assessment is set by person, the assessment in machine
      learning method is set by independently identified patterns in the text even
      with achieving recognizing sarcasm and irony.

    This paper compares several machine learning techniques to solve binary
classification problem of text tonality analysis. The dataset used for this task is
a corpus of Russian language tweets. After several data cleaning stages logistic
regression (LR), XGBoost classifier and convolutional neural network (CNN) are
applied. The study compares those models in terms of accuracy and F1 score
and discusses possible pros and cons for these methods.


2      Data preprocessing and models overview

Proposed methods are based on the use of machine learning methods. The de-
veloped prototype allows to assess texts as ‘positively’ or ‘negatively’ coloured.
To train these models a set of short texts were collected. Using RuTweetCorp[5]
dataset, which is a unique source of Russian-language tweets, collected via twit-
ter API interface. The dataset is valuable for being a single open source col-
lection of Russian language texts on the market with predefined texts tonality.
RuTweetCorp’s database allows Russian researchers to test modern modelling
                                     Title Suppressed Due to Excessive Length          3

approaches of NLP on their native language corpus. The dataset of 225 000
tweets2 was randomly split to the training set (70%) and test set (30%).

2.1     Data cleaning and preparation
Before applying discussed above methods, data cleaning stage should be com-
pleted. To prepare the dataset for the models I used in the paper several classical
for the industry approaches are applied:
 – Capitalisation. The purpose of this stage is to make a proper text cleaning
   and frequency calculation. Actually it makes no difference what type of reg-
   ister level to use for the modelling since, mathematically, models’ algorithms
   transform words into digits, thus any case type can be chosen. In the current
   paper I have used lowercase register;
 – Punctuation cleaning. As a part of data preprocessing researchers usually
   drop all punctuation marks, digits, links, etc. from texts since in most of the
   times it does not give any impact on emotional meaning of tweets. Another
   effect of these symbols on the analysis is irrelevant results in terms of word
   frequencies and affinities. Punctuation marks are comparatively often appear
   in sentences which leads to extremely high metrics for them while in reality
   these symbols almost always make no impact on text tonality;
 – Lemmatisation. Other well-known in the industry approaches to reduce
   the number of words carrying similar emotional meanings are lemmatisation
   and stemming. Both of these methods transform words from its full form
   into their short parent version. The difference between methods is that lem-
   matisation derives an infinitive form of the words while stemming simply
   cuts beginnings and endings of words to obtain a root. In the current paper
   I use lemmatisation as I believe it to be more efficient[6];
 – Stop-words deletion. Next step making data less noisy is dropping so-
   called stop-words from the dataset. “Stop-words”[7] are words that extremely
   frequently appear in text but make no impact on text meanings. Classical
   examples of such words are articles (such as the, is, at, which, and on). An-
   other type of such words may be swear words, which may have one infinitive
   form but tens of word variations with both positive and negative meanings.
    Besides well-known in literature methods of texts cleaning in current paper I
have followed additional step improving algorithm performance - concatenation
of word “not” with the following one. It is intuitively understandable that the
negation word “not” can significantly change the emotional meaning of a tweet.
However, it was practically revealed that analysed models captured high correla-
tion between “not”[8] word and negative target value, which afterwards resulted
in high share of incorrect “negative” predictions. To overcome this issue I have
concatenated the word “not” with the following one and used it as a new word,
which led to performance growth.
2
    Out of which ‘obviously positive’ (114,911 records) and ‘obviously negative’ (111,923
    records.) observations
4         A. Zvonarev, A.Bilyi

    Having done all that, tweets have been transformed to vectors consisting of
cleaned infinitives of words. This still resulted in huge amount of unique words,
which led to high memory and time consumption, while big chunk of words did
not give any impact on model score. To optimise the modelling process words
with frequency less than 3 were dropped. After that, a dictionary was created,
which included remaining words to which a unique ID was assigned. Using such
dictionary we have prepared two input datasets:

    – For LR and XGBoost: a TF-IDF matrix (term frequency-inverse docu-
      ment frequency), which is a method of ”Bag of words”(BoW) class. While
      term frequency denotes how many times a word is in a document (in one
      piece of text input), inverse document frequency calculates number of doc-
      uments where a text has appeared and number of total documents in data.
      In this method all the words in the data are transformed into a list. After-
      wards all the words in this list are assigned to each document as a vector.
      For example if there is a dataset of 5000 words, each document has a vector
      of 1 row and 5000 columns. Each column in this vector defines a word. If
      the word appears in the document, number of times the word is present in
      the text is assigned to the column corresponding to that word. When this
      process is done for all documents, illustration of data as per text frequency is
      obtained. Inverse Document Frequency (IDF) is calculated by using number
      of documents where a term has appeared and total number of documents in
      the data.
    – For CNN input: a Document-Term Matrix of special form - rows of this
      matrix were corresponding to tweet ID, columns corresponded to the position
      of the word in a tweet, while in each cell of the matrix were written an ID of
      the word from dictionary. Such matrix may have large number of columns
      in case of presence of very long texts. This leads to high memory and time
      consumption during computation. To reduce the time, the matrix were cut
      to 23 columns, which significantly improved model calculation time on one
      side and left all relevant information on another (23 columns cover 99.65%
      of all tweet lengths)


2.2     Models description

For the purpose of the analysis we have tested several machine learning meth-
ods and compared its efficiency (more on this in section 3). We have compared
three typical models in the text mining field to conduct semantic analysis of
texts: logistic regression, XGboost and CNN. Short overview of these methods
is provided below.
    Logistic regression This is a well known approach used to solve binary
classification problems.[9] Text mining is just one of many fields where logistic
regression may be applied once the task can be transformed to binary classifica-
tion type. Basically, the idea behind the method is to calculate the probability
of tweet to be positive based on rules identified on a large set of data. Since the
                                   Title Suppressed Due to Excessive Length        5

probability function (1) has a logistic form the model also got its name:

                                               1
                                  f (z) =                                        (1)
                                            1 + e−x

where z is a set of model input factors (in our case these are vectors of TF-IDF
matrix). More on this type of regression is given by Cramer[10].
    XGBoost Another model we have decided to test on the dataset is XG-
Boost, which is highly credited by machine learning competitors[11]. This soft-
ware nowadays exists for all popular data analysis languages and provides gradi-
ent boosting framework. XGBoost is used both for regression and classification
problems, thus is expected to perform considerably well on binary classification
problem and data we use in the current paper. More on the method is in original
paper by Chen and Guestrin[12].
    Convolutional Neural Network (CNN) The last but not least model
type we have tested on the dataset is a CNN. Neural networks recently got high
attention among data scientist due to ability to solve almost any type of problem
once it is stated correctly. Text mining is one of the fields, where CNN showed
high performance[13].


3   Results and discussion

To test the results of models performance we estimate shares of correct and
incorrect predictions on train and test set. It is a practical and reliable way
to deal with over- and under-fitting. To measure the learning performance of
methods we use accuracy and F1-score metrics. These two metrices were chosen
based on research in the literature for binary classification[14].
   Below we demonstrate obtained results for all three models. Table 1 shows
Logistic Regression results:


                        Table 1. Logistic Regression results

                                      Accuracy F1    Time
                         Training set 84.7 % 84.9%
                                                     45.2s
                         Testing set 76.7%     76.9%


    Based on the results one can make a conclusion, that model is over-fitted.
This is for sure not a good sign, however, it can be fine-tuned by changing hyper-
parameters of the model and using cross-validation techniques. Overall, the score
on test is considerably high. Another think to note - training of logistic regression
take very small amount of time. This makes the method a good starting model
to test how data preparation stages influence model performance.
    XGBoost training took much more time. Results you can see in the table
below.
6      A. Zvonarev, A.Bilyi

                              Table 2. XGBoost results

                                  Accuracy F1    Time
                     Training set 75.8%    74.6%
                                                 9h 44m 47s
                     Testing set 72.8%     71.3%


   This model shows worse results than logistic regression. Probably, it hap-
pened because it needs a better setup of hyper parameters. Since in literature
XGBoost proved to be one of the best models for these type of tasks, we are
motivated to pay more attention to work with this model in the future.


                  Table 3. Convolutional neural network results

                                  Accuracy F1    Time
                     Training set 82.9%    84.2%
                                                 6h 11m 24s
                     Testing set 79.5%     78.1%


   Table 3 shows convolutional neural network results. CNN demonstrates the
greatest performance on a test set, but you may also notice that training phase
took a lot of time as well. For even better results another tokenizer algorithm
may be used, such as n-grams and add more words and forms into thesaurus for
better lemmatization.


4   Conclusion

The paper provides with results of applying different machine learning tech-
niques to solve the text tonality binary classification task. Three model types
were estimated: logistic regression, XGBoost classifier and convolutional neural
network. Each of them are well-known models among data scientists to solve
tasks of that type. However, not that much studies exist where those models are
applied and compared on social networks data and Russian language. Results
demonstrate that based on F1 measure CNN performs better. However, training
such a model needs much more time than LR. Hence, depending on available
time and computing power for modelling LR may be preferred. Also, it was
a surprise that XGBoost classifier showed significantly lower result than other
models, while the framework demonstrated high performance on many data sci-
ence competitions. We suspect that more time should be spent to find optimal
hyper parameters of the model.
    For future research we are planning to extend the number of models tested -
at least, LightGBM and Word2Vec models are interesting frameworks showing
good results for this type of problems. Besides that, more focus will be allocated
on optimal hyper parameters search, cross-validation techniques and stacking of
models. These we believe will lead to even greater model performance.
                                   Title Suppressed Due to Excessive Length          7

References
1. Alexeeva S., Koltsova E., Koltcov S. Linis-crowd.org: A lexical resource for Russian
   sentiment analysis of social media [Linis-crowd.org: lexicheskij resurs dl’a analiza
   tonal’nosti sotsial’no-politicheskix tekstov], Computational Linguistics and com-
   putantional ontologies: Proceedings of the XVIII joint Conference “Internet and
   modern society (IMS-2015)” [Kompyuternaya lingvistika i vyichislitelnyie ontologii:
   sbornik nauchnyih statey. Trudyi XVIII ob’edinennoy konferentsii Internet i sovre-
   mennoe obschestvo (IMS-2015)], St. Peterburg, pp. 25–34.
2. Zagibalov, T., Belyatskaya, K., Carroll, J.: In Computational Approaches to Sub-
   jectivity and Sentiment Analysis, 2010. – . 67– 72.
3. Meng X. Lost in translations? building sentiment lexicons using context based ma-
   chine translation / Meng X., Wei F.,etc. // COLING, 2012. – . 829–838.
4. Pazel’skaya, A., Solov’ev, A.: Metod opredeleniia emotsii v tekstakh na russkom
   yazike. Dialog-2011. Sb. Nauchnih statei / Vip. 11 (18).- .: RGGU, 2011.– .510-523.
5. Rubtsova, Y.: A method for development and analysis of short text corpus for the re-
   view classification task. In: Proceedings of Conferences Digital Libraries: Advanced
   Methods and Technologies, Digital Collections, RCDL 2013, pp. 269–275 (2013)
6. Stemming and Lemmatization: A Comparison of Retrieval Performances, http:
   //www.lnse.org/papers/134-I3007.pdf. Last accessed 8 Nov 2019
7. Luhn, H. P.:Keyword-in-Context Index for Technical Literature (KWIC In-
   dex). American Documentation. 11 (4): 288–295. CiteSeerX 10.1.1.468.1425.
   https://doi.org/10.1002/asi.5090110403.
8. Sanjiv D., Chen M. Yahoo! for Amazon: Extracting market sentiment from stock
   message boards // Proceedings of the Asia Pacific finance association annual con-
   ference (APFA). — 2001.
9. Logit-analiz, . Last accessed 12 Oct 2019.
10. Cramer, J. S. (2002). The origins of logistic regression (PDF) (Technical report).
   119. Tinbergen Institute. pp. 167–178. https://doi.org/10.2139/ssrn.360300.
11. Awesome XGBoost, https://github.com/dmlc/xgboost/tree/master/demo#
   machine-learning-challenge-winning-solutions. Last accessed 10 Nov 2019
12. Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System.
   In Proceedings of the 22nd ACM SIGKDD International Conference on Knowl-
   edge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA, 785-794.
   https://doi.org/10.1145/2939672.2939785
13. Ronan Collobert and Jason Weston. 2008. A unified architecture for natural lan-
   guage processing: deep neural networks with multitask learning. In Proceedings of
   the 25th international conference on Machine learning (ICML ’08). ACM, New York,
   NY, USA, 160-167. https://doi.org/10.1145/1390156.1390177
14. Sokolova, M., Lapalme, G. (2009). A systematic analysis of performance mea-
   sures for classification tasks. Information Processing Management, 45(4), 427–437.
   https://doi.org/10.1016/j.ipm.2009.03.002