A Novel Machine Learning-based Sentiment Analysis
 Method for Chinese Social Media Considering Chinese
            Slang Lexicon and Emoticons

                      Da Li1 , Rafal Rzepka1 , Michal Ptaszynski2 ,
                                    and Kenji Araki1
                 1
                  Graduate School of Information Science and Technology
                          Hokkaido University, Sapporo, Japan
      2
        Department of Computer Science, Kitami Institute of Technilogy, Kitami, Japan


       Abstract. Internet slang is an informal language used in everyday online com-
       munication which quickly becomes adopted or discarded by new generations.
       Similarly, pictograms (emoticons/emojis) have been widely used in social media
       as a mean for graphical expression of emotions. People can convey delicate nuances
       through textual information when supported with emoticons. Furthermore, we
       also noticed that when people use new words and pictograms, they tend to express
       a kind of humorous emotion which is difficult to clearly classify as positive or
       negative. Therefore, it is important to fully understand the influence of Inter-
       net slang and emoticons on social media. In this paper, we propose a machine
       learning method considering Internet slang and emoticons for sentiment analysis
       of Weibo, the most popular Chinese social media platform. In the first step, we
       collected 448 frequent Internet slang expressions as a slang lexicon, then we con-
       verted the 109 Weibo emoticons into textual features creating Chinese emoticon
       lexicon. To test the capability of recognizing humorous posts, we utilized both
       lexicons with several machine learning approaches, k-Nearest Neighbors, Deci-
       sion Tree, Random Forest, Logistic Regression, Naı̈ve Bayes and Support Vector
       Machine for detecting humorous expressions on Chinese social media. Our exper-
       imental results show that the proposed method can significantly improve the per-
       formance for detecting expressions which are difficult to polarize into positive-
       negative categories.

       Keywords: sentiment analysis · machine learning · social media · Internet slang
       · emoticons


1   Introduction

Nowadays, people have become increasingly accustomed to expressing their opinions
online, especially on social media such as Twitter, Facebook or Weibo - the biggest
Chinese social media network that was launched in 2009. The rapid growth of such
platforms provides rich multimedia data in large quantities for various research oppor-
tunities as sentiment analysis which focuses on automatic sentiment prediction on given
contents. Microblog data contain a vast amount of valuable sentiment information not
only for the commercial use, but also for psychology, cognitive linguistics or political
science. Sentiment analysis has been widely used in real world applications by ana-
lyzing the online user-generated data, such as election prediction, opinion mining and
business-related activity analysis [25]. Sentiment analysis of microblogs becomes an
important area of research in the field of Natural Language Processing. Study of senti-
ment in microblogs in English language has undergone major developments in recent
years [14]. Chinese sentiment analysis research, on the other hand, is still at relatively
early stage [19] especially when it comes to lexicons and emoticons usage.
     Pictograms (emoticons/emojis) have been widely used in social media as a mean for
graphical expression of emotions. According to the study about Instagram, emojis are
present in up to 57% of online messages in many countries3 . For example, “face with
tears of joy”, an emoji that means that somebody is in an extremely good mood, was
regarded as the 2015 word of the year by The Oxford Dictionary [12]. In our opinion
ignoring emoticons in sentiment research is unjustifiable, because they convey a sig-
nificant emotional information and play an important role in expressing emotions and
opinions in social media [13, 5].
     Internet slang is ubiquitous on the Internet. The emergence of new social contexts
like micro-blogs, question-answering forums, and social networks has enabled slang
and non-standard expressions to abound on the web. Despite this, slang has been tradi-
tionally viewed as a form of non-standard language, a form of language that is not the
focus of linguistic analysis and has largely been neglected [8].
     Furthermore, we also noticed that when people use new words and pictograms, they
tend to express a kind of humorous emotion which is difficult to be easily classified as
positive or negative. It seems that some emoticons are used just for fun, self-mockery
or jocosity which expresses an implicit humor which might be characteristic to Chinese
culture. Emoticons and slang seem to play an important role in expressing this kind of
emotion. There is a high possibility that this phenomenon can cause a significant dif-
ficulty in sentiment recognition task. Figure1 shows an example of a Weibo microblog
posted with emoticons and Internet slang. In the second line of the post,              (lei
jue bu ai) is a Chinese informal contraction meaning                                  (hen
lei, gan jue zi ji bu hui zai ai le which means “too tired for romance”). Such abbrevia-
tions are popular and usually extracted from popular phrases and shorten into four char-
acters in general, and become a new chengyu, a type of traditional Chinese idiomatic
expression most of which consist of four characters. Chengyu are considered as col-
lected wisdom of the Chinese culture. Through the insights learned from chengyu, we
can express and discover wise men’s experiences, moral concepts, or admonishments
from the older generations of Chinese. Nowadays, chengyu still plays an important
role in Chinese conversations and education. When a new chengyu is introduced it can
also convey a humorous content. Examples of such abbreviations are: lei jue bu ai, ren
jian bu chai (life is so hard that some lies are better not exposed), xi da pu ben (news
so exhilarating that everyone is celebrating and spreading it around the world) and so
on. When it comes to emoticons, new ones are introduced by social media companies,
but their meaning can change with time. For example             was originally and emoji
meant for expressing “bye-bye” gesture. However, it seems that gradually Weibo users

 3
     https://www.quintly.com/blog/instagram-emoji-study
started using this emoji for expressing artificial smile and refuse or self-mockery4 . In
the research of [9], it was shown that this emoji expresses humorous emotion rather
than negative polarity. For example, in the following post: “After jogging, I’m starving.
Someone sent me a picture of kebab. I’m too tired for romance              ”.


Fig. 1. Example of Weibo post with Internet slang and emoticons. The entry says “After jogging,
I’m starving. Someone sent me a picture of kebab. I’m too tired for romance”.


   To address this phenomenon, in this paper we focus on the Internet slang and emoti-
cons used on Weibo in order to establish if both slang and emoticons improve sentiment
 4
     https://qz.com/944693
analysis by recognizing humorous entries which are difficult to polarize. To perform
experiments, we collected 448 frequent Chinese Internet slang expressions as a slang
lexicon, then we converted 109 Weibo emoticons into textual features creating Chi-
nese emoticon lexicon. Then we utilized both lexicons with several machine learning
approaches, k-Nearest Neighbors, Decision Tree, Random Forest, Logistic Regression,
Naı̈ve Bayes and Support Vector Machine for detecting humorous expressions on Chi-
nese social media. Our experimental results show that the proposed method can signifi-
cantly improve the performance for detecting expressions which are difficult to polarize
into positive-negative categories.
    Our main contributions are as follows:

    – We collected 448 frequent Chinese Internet slang expressions as a Chinese slang
      lexicon.
    – We converted the 109 Weibo emoticons into textual features creating Chinese emoti-
      con lexicon.
    – We empirically confirmed implicit humor characteristic to Chinese culture visible
      on Weibo and utilized both lexicons with several machine learning approaches for
      detecting humorous expressions on Weibo and confirmed that using both slang and
      emoticons improves previously proposed method.


2     Related Research

At present, the sentiment analysis technology generally can be divided into two cate-
gories: rule-based methods relying on sentiment lexicons, and machine learning-based
methods relying on annotated data.


2.1    Rule-based methods

Zhang et al. [24] proposed a rule-based approach with two phases: a) the sentiment of
each sentence is first decided based on word dependency to aggregate the sentences
sentiments and then b) the sentiment of each document is calculated. Zagibalov et al.
[23] presented a method that does not require any annotated corpus training data and
only requires information on commonly occurring negations and adverbials. Li et al.
stated that polarities and strengths judgment of sentiment words comply with a Gaus-
sian distribution, and thus proposed a Normal distribution-based sentiment computation
method which allows quantitative analysis of semantic fuzziness of sentiment words
in Chinese language [10]. Zhuo et al. presented a novel approach based on the fuzzy
semantic model by using an emotion degree lexicon and a fuzzy semantic model [26].
Their model includes text preprocessing, syntactic analysis, and emotion word process-
ing. However, optimal results of Zhuo’s model were achieved only when the task was
clearly defined. Wu et al. presented an approach to leverage Web resources to construct a
English Slang Sentiment Dictionary (SlangSD) that is easy to expand [21]. They empir-
ically showed the advantages of using SlangSD, the newly-built slang sentiment word
dictionary for sentiment classification, and provided examples demonstrating its ease of
use with a sentiment analysis system.
2.2    Machine learning-based methods

Tan and Zhang conducted an empirical study of sentiment categorization on Chinese
documents [18]. They tested four features – mutual information, information gain, chi-
square, and document frequency; and five learning algorithms: centroid classifier, k-
Nearest Neighbor, Winnow classifier, Naı̈ve Bayes (NB) and Support Vector Machine
(SVM). Their results showed that the information gain and SVM features provided the
best performances for sentiment classification coupled with domain or topic depen-
dent classifiers. There are also researchers who have combined the machine learning
approach with the lexicon-based approach. Chen et al. proposed a novel sentiment clas-
sification method which incorporated existing Chinese sentiment lexicon and convo-
lutional neural network [2]. The results showed that their approach outperforms the
convolutional neural network (CNN) model only with word embedding features [7].
However, all these approaches did not consider emoticons.
     Recently, a powerful system utilizing emoji in Twitter sentiment analysis model
called DeepMoji was proposed [4]. Its creators trained 1,246 million tweets containing
one of 64 common emoticons by Bi-directional Long Short-Term Memory (Bi-LSTM)
model and applied it to interpret the meaning behind the online messages. DeepMoji
is also the most advanced sarcasm-detecting model, with an accuracy rate of 82.4%
even outperforming human detectors who managed to acquire 76.1% accuracy rate.
Sarcasm reverses the emotion of the literal text, therefore sarcasm-detecting capability
can play a significant role in sentiment analysis, especially in case of social media.
Although sarcasm and irony tend to convey negative emotions in general, we found
that in Chinese social media (Weibo in our example), in addition to the expression of
positive and negative emotions, people tend to express a kind of humorous emotion that
escapes the traditional bi-polarity.


                    Table 1. Examples of our Chinese Internet Slang Lexicon.


             Type                     Examples (Origin)              English Translation
           Numbers                                                        “laughter”
 Latin alphabet abbreviations                                               “Damn”
                                                                “Life is so hard that some lies
      Chinese contractions
                                                                are better not exposed.”
           Neologisms                                                      “Loser”
      Phrases with altered
                                                                      “Vulgar tycoon”
      or extended meanings
       Puns and wordplay                                                 “Harmony”
       Slang derived from
                                                                          “Brother”
       foreign language
3     Lexicon of Chinese Online Slang

Chinese Internet slang is informal language used to express ideas on the Chinese Inter-
net in response to events, to mass media and foreign cultures. It also expresses a nat-
ural human desire to simplify and update language. Slang that first appears on-line is
often adopted to become widely used in everyday life. It includes content relating to
all aspects of social life, mass media, economic, political situation etc. Internet slang
is arguably the fastest-changing aspect of a language, created by a number of different
influences, technology, mass media and foreign culture amongst others.
    Because Internet slang is not easy to extract automatically, it can cause a significant
difficulty in sentiment detecting task. For improving the performance of Chinese social
media sentiment analysis, we created a Chinese Internet slang lexicon (examples shown
in Table 1). We manually extracted 448 frequent Internet slang terms from the Internet
New Words Ranking List, Baidu Baike5 , Wikipedia6 and social media systems such as
Baidu Tieba7 and Weibo8 between 2010 and 2018, and stored them as Chinese Internet
Slang Lexicon. After analysis we observed that the entries fall under seven following
categories:

    – Numbers: such as 233 (“laughter/lol”: Chinese use 233 to express “can’t stop laugh-
      ing” because 233 is an emotional sign in a Chinese BBS site9 and the sign is the
      NO.233 in the list of all emoticons); 213 (“a person who is very stupid”); 520/521
      (“I love you”).
    – Latin alphabet abbreviations: Chinese users commonly use a QWERTY keyboard
      with pinyin enabled. Upper case letters are quick to type and require no trans-
      formation. (Lower case letters spell words). Latin alphabet abbreviations (rather
      than Chinese characters) are also sometimes used to evade censorship. Such as SB
      (“dumb cunt”); YY (“fantasizing/sexual thoughts”); TT (“condom”).
    – Chinese contractions: e.g. ren jian bu chai (“life is so hard that some lies are bet-
      ter not exposed”: This comes from the lyrics of a song entitled “Shuo Huang”
      (“Lies”), by Taiwanese singer Yoga Lin. This slang reflects that some people, espe-
      cially young people in China, are disappointed by reality); lei jue bu ai (“too tired
      for romance”: this slang phrase is a literal abbreviation of the Chinese phrase “too
      tired to fall in love anymore”. It originated from an article on the Douban website, a
      Chinese social networking service website allowing registered users to record infor-
      mation and create content related to film, books, music, recent events and activities
      in Chinese cities. The article was posted by a 13-year-old boy who grumbled about
      his single status and expressed his weariness and frustration towards romantic love.
      The article went viral on the Chinese Internet, and the phrase was subsequently
      used as a sarcastic way to convey depression when encountering misfortunes or
      setbacks in life); gao da shang (“high-end, impressive, and high-class”: a popular
 5
   https://baike.baidu.com
 6
   https://en.wikipedia.org
 7
   https://tieba.baidu.com
 8
   https://www.weibo.com
 9
   https://www.mop.com
      meme used to describe objects, people, behavior, or ideas that became popular in
      late 2013).
    – Neologisms: diao si (“loser”: The word diao si is used to describe young males who
      were born into a poor family and are unable to improve their financial status. People
      usually use this phrase in an ironic and self-deprecating way); ye shi zui le (“nothing
      to say”: it is a way to gently express your frustration with someone or something
      that is completely unreasonable and unacceptable); dan shen gou (“single dog”: a
      term which single people in China use to poke fun at themselves for being single).
    – Phrases with altered or extended meanings: hao or tu hao (“vulgar tycoon”: This
      word refers to irritating online game players who buy large amounts of game weapons
      in order to be gloried by others. Starting from late 2013, the meaning has changed
      and now is widely used to describe nouveau riche people in China who are wealthy
      but less cultured.); bei tai (“spare tire”: A girlfriend or boyfriend kept as a “backup”,
      “plan B” , just in case of breaking up with the current partner).
    – Puns and wordplay:        (“river crab”: pun on      , another Chinese characters
      pronounced he xie, meaning “harmony”).
    – Slang derived from foreign language:        (The word gong kou comes from the
      Japanese katakana ero, which translated from English “erotic” into the abbreviation
      of the katakana             , meaning “sensual”).


4     Lexicon of Chinese Social Media Emoticons

In the real-life (offline) dialogue between human beings, besides tone changes, we usu-
ally express emotions with body language. In social networks, this can partially be
achieved by using emoticons [1].
    There are many unknown factors in constantly changing moods of human beings,
but communication with emoticons has become a global phenomenon. On the other
hand, because of different ethnic and cultural differences, misunderstandings when
using facial emoticons is not uncommon [16]. We also have noticed previously men-
tioned humorous emotion in Weibo microblog entries containing emoticons which are
often difficult to interpret as positive or negative. It seems that some emoticons are used
just for fun, self-mockery or jocosity which expresses an implicit humor characteristic
in Chinese culture. Emoticons seem to play an important role in expressing this kind of
emotion. There is a high possibility that this phenomenon can cause a significant diffi-
culty in sentiment detecting task, therefore we decided to build a lexicon of emoticons
before adding them to our system for classifying emotions in Weibo.
    When we collected microblog data, we discovered that Weibo emoticons are trans-
formed by API into Chinese characters, for example,             will be convert into
(“smile”). This provided us with the possibility of building Chinese emoticon lexicon.
Therefore, we selected the 109 Weibo emoticons (see Figure 2) which can be trans-
formed into Chinese characters, and converted them into textual features to create Chi-
nese emoticon lexicon. Several examples are shown in Table 2.
      Table 2. Examples of Chinese Emoticon Lexicon.

Emoticon      Textual Feature         Emotion/Implication
                                             “smile”
                                            “lovely”
                                          “too happy”
                                           “applause”
                                           “hee hee”
                                            “ha-ha”
                                     “face with tears of joy”
                                             “wink”
                                            “greedy”
                                     “speechless/awkward”
                                            “sweat”
                                           “nosepick”
                                             “snort”
                                            “anger”
                                      “upset/fell wronged”
                                           “pathetic”
                                       “disappointment”
                                              “sad”
                                             “weep”
                                              “shy”
                                             “filthy”
                                           “love face”
                                          “kissy face”
                                             “leer”
                                          “lick screen”
                                           “longing”
                                          “dog leash”
                                          “smugshrug”
        Fig. 2. 109 Weibo emoticons which can be transformed into Chinese characters.


5     Machine Learning approaches
Inspired by above mentioned works on Internet slang and emoticons, in order to test the
influence of them, we utilized both lexicons with several machine learning approaches,
k-Nearest Neighbors (k-NN), Decision Tree (DT), Random Forest (RF), Logistic Regres-
sion (LR), Naı̈ve Bayes (NB) and Support Vector Machine (SVM) for detecting humor-
ous expressions on social media. We did not tested deep learning approaches as the data
size was not sufficient.
    In the first step, we add the Chinese slang lexicon and Chinese emoticon lexicon
to segmentation tool for matching new words and emoticons. Then we use the updated
tool to segment the sentences of large data set. Second, we apply the segmentation
results into the word embedding tool for training word vectors. Next, we apply the word
embedding model which considered Internet slang and emoticons to train a machine
learning model with training data. Finally, we input testing data into machine learning
model, and we can obtain the sentiment probability of a Weibo post which considers
the effect of emoticons and Internet slang.


6     Experiments
In order to verify the validity of our proposed method, we performed series of experi-
ments described below.

6.1   Preprocessing
Initializing word vectors with those obtained from an unsupervised neural language
model is a popular method to improve performance in the absence of a large supervised
training set. For our experiment we collected a large dataset (7.6 million posts) from
Weibo API from May 2015 to July 2017 to be used for calculating word embeddings.
First, we deleted the images, and videos treating them as noise. Second, we applied
Chinese Internet slang lexicon and Chinese emoticon lexicon into the dictionary of
Python Chinese word segmentation module Jieba10 . Next, we used Jieba to segment the
sentences of the microblogs, and applied the segmentation results into the word2vec
model [11] for training word vectors. The vectors have dimensionality of 300 and were
trained using the continuous skip-gram model.
     Next, we collected 3,000 Weibo posts containing the emoticons. To use these posts
as our training data, we asked three Chinese native speakers to annotate them into two
categories: “humorous”, and “non-humorous”. After one annotator labelled polarities of
all posts, two other native speakers confirmed correctness of his annotations. Whenever
there was a disagreement, all decided the final polarity through discussion.

6.2    Applied Classifiers
Logistic Regression Logistic regression model is confirmed to be used in many tasks
such as document classification [22]. In Logistic regression model, we generally correct
overfitting with regularization. Regularization adds a penalty term on model to reduce
the freedom of the model. Hence, the model will be less likely to fit the noise of the
training data and will improve the generalization abilities of the model. We train the
model with L2 penalty regularization called Ridge regression in our experiments.

Support Vector Machine Support vector machine [3] is a supervised learning model
with associated learning algorithms that analyzes data used for classification. An SVM
model is a representation of the examples as points in space, mapped so that the exam-
ples of the separate categories are divided by a clear gap that is as wide as possible.
New examples are then mapped into that same space and predicted to belong to a cate-
gory based on which side of the gap they fall. In addition, it uses kernel trick, implicitly
mapping their inputs into high-dimensional feature spaces. In our experiments, we used
the radial basis function kernel.

Naı̈ve Bayes Naı̈ve Bayes classifier is based on applying Bayes theorem with strong
independence assumptions between the features. Naı̈ve Bayes has been studied exten-
sively since the 1950s. It was introduced under a different name into the text retrieval
community in the early 1960s, and remains a baseline method for text categorization
[15], the problem of judging documents as belonging to one category or the other with
word frequencies as the features. With appropriate pre-processing, it is competitive in
text classification task with more advanced methods including support vector machines.
In our experiments, we set the parameter of alpha to 0.01.

k-Nearest Neighbors In pattern recognition, the k-Nearest Neighbors algorithm is a
non-parametric method used for classification and regression. In both cases, the input
10
     https://github.com/fxsjy/jieba
consists of the k closest training examples in the feature space [20]. The output depends
on whether k-NN is used for classification or regression. The number of neighbors is
set to 5 in our experiments.

Random Forest Random forests is an ensemble learning method for classification,
regression and other tasks, that operate by constructing a multitude of decision trees at
training time and outputting the class that is the mode of the classes (classification) or
mean prediction (regression) of the individual trees [6]. Random decision forests correct
for decision trees habit of overfitting to their training set.

Decision Tree Decision tree is a decision support tool that uses a tree-like graph or
model of decisions and their possible consequences, including chance event outcomes,
resource costs, and utility. Decision tree is commonly used in operations research,
specifically in decision analysis to help identify a strategy most likely to reach a goal,
but is also a popular tool in machine learning [17].

6.3   Performance Test
Using trained word2vec model, we passed word vectors of training data into the machine
learning models to train the model. We collected and annotated 300 Weibo entries with
emoticons as a test set, deleted images, and videos. Then we used the above mentioned
methods to calculate scores of the precision, recall and F1-score. We compared the
results of humorous detecting by machine learning only, machine learning considering
Internet slang only, and machine learning approaches considering emoticons only. The
results are shown in Table 3, Table 4 and Table 5, respectively. The Table 6 introduces
results of the experiment where both Internet slang and emoticons were used, and Table
7 shows the results of F1-score with above methods.

Table 3. Comparison results of machine learning without emoticons and Internet slang lexicon.

                                 Precision         Recall         F1-score
                    DT            56.21%           55.56%         55.88%
                    RF            60.26%           54.97%         57.49%
                   k-NN           63.48%           66.08%         64.76%
                    NB            59.17%           83.04%         69.10%
                    LR            60.16%           86.55%         70.98%
                   SVM            57.00%          100.00%         72.61%


   The results show that considering Internet slang and emoticons. Limited to small
annotated data, the precision of the humor / non-humor classification was relatively
low, but by considering Internet slang and emoticons, the F1-score of each classifier
outperformed previous method by 1.39% (LR), 2.13% (SVM), 2.90% (NB), 0.69%
(k-NN), 0.84% (RF) and 3.89% (DT). Our proposed approach has improved the per-
formance showing that low-cost, small-scale data labeling is able to outperform widely
  Table 4. Comparison results of machine learning with Internet slang lexicon only.

                             Precision             Recall        F1-score
                RF               59.76%            57.31%            58.51%
               k-NN              60.10%            69.59%            64.50%
                DT               66.46%            63.74%            65.07%
                NB               59.92%            83.04%            69.61%
                LR               60.16%            88.30%            71.56%
               SVM               57.00%            100.0%            72.61%


Table 5. Comparison results of machine learning with Chinese emoticons lexicon only.

                             Precision             Recall        F1-score
                RF               61.59%            54.39%            57.76%
                DT               63.37%            63.74%            63.56%
               k-NN              60.70%            71.35%            65.59%
                NB               59.92%            83.04%            69.61%
                LR               60.08%            88.89%            71.70%
               SVM               58.44%           100.00%            73.77%


   Table 6. Comparison results of machine learning with both emoticons and slang.

                             Precision             Recall        F1-score
                RF               61.29%            54.75%            58.33%
                DT               62.50%            57.26%            59.77%
               k-NN              61.37%            70.11%            65.45%
                NB               63.60%            82.96%            72.00%
                LR               62.30%            86.31%            72.37%
               SVM               59.67%           100.00%            74.74%


            Table 7. Comparison results of F scores between feature sets.

                      Baseline            Slang         Emoticons              Both
        RF            57.49%              58.51%            57.76%            58.33%
        DT            55.88%              65.07%            63.56%            59.77%
       k-NN           64.76%              64.50%            65.59%            65.45%
        NB            69.10%              69.61%            69.61%            72.00%
        LR            70.98%              71.56%            71.70%            72.37%
       SVM            72.61%              72.61%            73.77%            74.74%
used state-of-the-art when emoticon and slang information is added to the learning pro-
cess.


7   Considerations


In our proposed approach, we paid more attention to the emoticons and Internet slang
in microblogs and investigated how adding these features separately and together influ-
ences the previously proposed method for recognizing humorous posts which are prob-
lematic when it comes to semantic analysis. Figure 3) shown an example of a microblog
which was correctly classified by our proposed method as “humorous” while the base-
line recognized it incorrectly as non-humorous. This post contains word               (yi
ke sai ting which is a homophone of English word “exciting”). The baseline does not
know this expression and the parser divides it as                (yi ke / sai ting which
means “a rowing boat”). When this expression is accompanied by            emoticon, they
both improve the performance of classification and predict the implicit humorous mean-
ing.


                 Fig. 3. Example of correct classification of humorous post.


    Error analysis showed that some posts were wrongly predicted due to proper nouns
missing in the parser’s dictionary which brought clearly negative impact on the results.
In Figure 4 we show an example of such misclassification into “non-humorous” cate-
gory annotated as “humorous” by annotators. Name of a ticketing website Da mai wang
was parsed incorrectly, and one shifted character caused mis-recognition of humorous
word. Weibo microblogs contain numerous ideograms deliberately altered from their
everyday meaning, what makes them difficult to parse and match. We think that adding
new named entities into the parser’s dictionary may significantly improve the results in
the future. We observed that when emotions are expressed online, emoticons might play
a greater role than it is usually considered, therefore we will experiment with weight of
the emoticons in the future.
           Fig. 4. Example of wrong classification into “non-humorous” category.


8   Conclusions and Future Work


In this paper, we proposed adding Chinese Internet slang and emoticons for automatic
classification of humorous posts on social media platform Weibo in order to sepa-
rate them from clearly positive and negative ones. We collected 448 frequent Internet
slang expressions and created a slang lexicon, then we converted the 109 Weibo emoti-
cons into textual features creating Chinese emoticon lexicon. To test the influence of
slang and emoticons on sentiment analysis task, we utilized both lexicons with several
machine learning-based classifiers, namely k-Nearest Neighbors, Decision Tree, Ran-
dom Forest, Logistic Regression, Naı̈ve Bayes and Support Vector Machine for detect-
ing humorous expressions on Chinese social media. Our experimental results show that
the proposed additions can significantly improve the F1-score for detecting humorous
expressions which are difficult to polarize into positive-negative categories.
    For improving the performance of the proposed method, in near future we are going
to increase the size of both slang and emoticon lexicons to improve further classifi-
cation results. Furthermore, we plan to add image processing for classifying stickers
which also seem to convey rich emotional information. Our ultimate goal is to investi-
gate how much the newly introduced features are beneficial for sentiment analysis by
feeding them to a deep learning model which should allow us to construct a high-quality
sentiment recognizer for wider spectrum of sentiment in Chinese language.
9    Acknowledgment
This work was supported by JSPS KAKENHI Grant Number 17K00295.


References
 1. Aldunate, N., González-Ibáñez, R.: An integrated review of emoticons in computer-mediated
    communication. Frontiers in psychology 7, 2061 (2017)
 2. Chen, Z., Xu, R., Gui, L., Lu, Q.: Combining convolution neural network and word senti-
    ment sequence features for chinese text sentiment analysis. Journal of Chinese Information
    Processing (2015)
 3. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995)
 4. Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., Lehmann, S.: Using millions of emoji occur-
    rences to learn any-domain representations for detecting sentiment, emotion and sarcasm.
    arXiv preprint arXiv:1708.00524 (2017)
 5. Guibon, G., Ochs, M., Bellot, P.: From emojis to sentiment analysis. In: WACAI 2016 (2016)
 6. Ho, T.K.: Random decision forests. In: Document analysis and recognition, 1995., proceed-
    ings of the third international conference on. vol. 1, pp. 278–282. IEEE (1995)
 7. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint
    arXiv:1408.5882 (2014)
 8. Kulkarni, V., Wang, W.Y.: Tfw, damngina, juvie, and hotsie-totsie: On the linguistic and
    social aspects of internet slang. arXiv preprint arXiv:1712.08291 (2017)
 9. Li, D., Rzepka, R., Ptaszynski, M., Araki, K.: Emoticon-aware recurrent neural network
    model for chinese sentiment analysis. In: The Ninth IEEE International Conference on
    Awareness Science and Technology (iCAST 2018) (2018)
10. Li, R., Shi, S., Huang, H., Su, C., Wang, T.: A method of polarity computation of chinese
    sentiment words based on gaussian distribution. In: International Conference on Intelligent
    Text Processing and Computational Linguistics. pp. 53–61. Springer (2014)
11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in
    vector space. arXiv preprint arXiv:1301.3781 (2013)
12. Moschini, I.: The” face with tears of joy” emoji. a socio-semiotic and multimodal insight into
    a japan-america mash-up. HERMES-Journal of Language and Communication in Business
    (55), 11–25 (2016)
13. Novak, P.K., Smailović, J., Sluban, B., Mozetič, I.: Sentiment of emojis. PloS one 10(12),
    e0144296 (2015)
14. Peng, H., Cambria, E., Hussain, A.: A review of sentiment analysis research in chinese lan-
    guage. Cognitive Computation 9(4), 423–435 (2017)
15. Rish, I., et al.: An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on
    empirical methods in artificial intelligence. vol. 3, pp. 41–46. IBM New York (2001)
16. Rzepka, R., Okumura, N., Ptaszynski, M.: Worlds linking faces – meaning and possibili-
    ties of contemporary pictograms. Journal of the Japanese Society for Artificial Intelligence
    (2017)
17. Sharma, P., Kaur, M.: Classification in pattern recognition: A review. International Journal
    of Advanced Research in Computer Science and Software Engineering 3(4) (2013)
18. Tan, S., Zhang, J.: An empirical study of sentiment analysis for chinese documents. Expert
    Systems with applications 34(4), 2622–2629 (2008)
19. Wang, X., Zhang, C., Ji, Y., Sun, L., Wu, L., Bao, Z.: A depression detection model based on
    sentiment analysis in micro-blog social network. In: Pacific-Asia Conference on Knowledge
    Discovery and Data Mining. pp. 201–213. Springer (2013)
20. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor
    classification. Journal of Machine Learning Research 10(Feb), 207–244 (2009)
21. Wu, L., Morstatter, F., Liu, H.: Slangsd: Building and using a sentiment dictionary of slang
    words for short-text sentiment classification. arXiv preprint arXiv:1608.05129 (2016)
22. Yu, H.F., Huang, F.L., Lin, C.J.: Dual coordinate descent methods for logistic regression and
    maximum entropy models. Machine Learning 85(1-2), 41–75 (2011)
23. Zagibalov, T., Carroll, J.: Automatic seed word selection for unsupervised sentiment clas-
    sification of chinese text. In: Proceedings of the 22nd International Conference on Compu-
    tational Linguistics-Volume 1. pp. 1073–1080. Association for Computational Linguistics
    (2008)
24. Zhang, C., Zeng, D., Li, J., Wang, F.Y., Zuo, W.: Sentiment analysis of chinese documents:
    From sentence to document level. Journal of the American Society for Information Science
    and Technology 60(12), 2474–2487 (2009)
25. Zhao, P., Jia, J., An, Y., Liang, J., Xie, L., Luo, J.: Analyzing and predicting emoji usages
    in social media. In: Companion of the The Web Conference 2018 on The Web Conference
    2018. pp. 327–334. International World Wide Web Conferences Steering Committee (2018)
26. Zhuo, S., Wu, X., Luo, X.: Chinese text sentiment analysis based on fuzzy semantic model.
    In: Cognitive Informatics & Cognitive Computing (ICCI* CC), 2014 IEEE 13th International
    Conference on. pp. 535–540. IEEE (2014)