[CL-AFF Shared Task]
         Multi-label Text Classification Using an
               Emotion Embedding Model

          Jiwung Hyun1 , Byung-Chull Bae2 , and Yun-Gyung Cheong1
     1
         College of Computing, Sungkyunkwan University, Suwon-si, South Korea
                           {kabbi159, aimecca}@skku.edu
             2
               School of Games, Hongik University, Sejong-si, South Korea
                                 byuc@hongik.ac.kr


         Abstract. In this paper, we propose a deep learning model-based ap-
         proach that combines a language embedding model and an emotion em-
         bedding model in the classification of text for the CL-AFF Shared Task
         2020. The task aims to predict the disclosure and supportiveness labels
         of the comments (to the posts) in the OffMyChest dataset which consists
         of a small labeled dataset and a large unlabeled dataset. We investigate
         the effectiveness of the BERT, Glove, and Emotional Glove embedding
         models, to represent the text for label prediction. We also propose to use
         the original posts in the dataset as contextual information. We evaluated
         our approach and report the results.

         Keywords: Semisupervised learning · Emotion embedding · BERT.


1    Introduction
This paper describes our approach to solve the CL-AFF (Computational Lin-
guistics - Affect Understanding) Shared Task 2020. In the CL-AFF 2020 task,
the OffMyChest conversation dataset is introduced to help understand the role
of emotion in conversations (see [3] for details). The dataset consists of top
posts and the comments to these posts collected from the CasualConversations
and the OffMyChest communities on Reddit. A small portion of the comments
are labeled as informational disclosure, emotional disclosure, and supportive-
ness, where supportiveness is further characterized as general, informational,
and emotional. As a result, a comment text is annotated with a total of 6 labels,
where a single comment can have multiple labels. Therefore, this task involves
multi-label classification problems.
    Text classification in natural language processing has traditionally employed
classification algorithms in machine learning such as support vector machines,
Bayesian classifiers, decision trees, etc. A variety of features in text are given
as the input of the classifiers, which are crucial to traditional text classification
algorithms. For the representation of text, word frequency-based approaches
(e.g., bag-of-words feature) or sequence-based approaches (e.g., N-grams feature)
have been commonly used.


 Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License
 Attribution 4.0 International (CC BY 4.0). In: N. Chhaya, K. Jaidka, J. Healey, L. H. Ungar, A. Sinha
 (eds.): Proceedings of the 3rd Workshop of Affective Content Analysis, New York, USA, 07-
 FEB-2020, published at http://ceur-ws.org
2      J. Hyun et al.

    Text classification using neural models are comprised of embedding models
and classification models. Along with the success of word embedding models
such as Word2vec [6] and Glove [7] in text classification, advances of various
deep learning algorithms have lead to more complex embedding models, such
as contextual language model, also known as ELMo [8] and BERT [1]. Deep
neural networks are used not only to extract features from text but also to
construct classifiers. Kim [4], for example, presented great performances in text
classification by applying 1D CNN on sentence classification problems.
    In this paper, focusing on emotional words in the data, we propose a deep
learning model-based text classification approach using an embedding model,
which has the combination of a language model and an emotion embedding
model. We particularly focus on emotional words from OffMyChest dataset to
improve learning emotional labels (emotional disclosure, emotional support). We
combine labeled and unlabeled comments data for our semi-supervised method
to help supervised learning. And then the sentence features extracted from the
embedding model are given as TextCNN[4] input for text classification.
    Furthermore, to improve the classification performance, we apply EDA (Easy
Data Augmentation)[10] on small labeled data in the training step. This paper
presents our baseline and variables in our experiment, as well as the evaluation
models for binary classification of disclosure and supportiveness (Task 1).


2   The OffMyChest Conversation Dataset

OffMyChest conversation dataset is comprised of three sets - labeled training
set, unlabeled training set, and test set: unlabeled training data set includes
unlabeled posts and comments; labeled training set and unlabeled test set include
about 10,000 labeled sentences and 3,000 unlabeled sentences respectively from
the top commented posts. Table 1 lists several excerpts sampled from the labeled
data.

           Text                                                       id  ED ID S GS IS ES
           Hope you have a nice day                                 91px39 0 0 1 1 0 0
           My wife came in when I was around half way through 91px39 1 1 0 0 0 0
           this and asked why I was all choked up and watery eyed,
           so we read it together and now we’re both crying.
           I am crying a lot of happy tears right now.              91px39 1 0 0 0 0 0
           He’s my father in every sense of the word but name, I 946qw9 1 0 0 0 0 0
           still call him by his first name but only because we are
           both used to it and he doesn’t mind a bit.
           Stepdad will be the one walking me down the aisle when 946qw9 1 1 0 0 0 0
           I get married.
           dThat’s wonderful. :) My step-dad has been around for 946qw9 1 1 1 1 0 1
           30 yrs now.

Table 1: Example Sentences sampled from the labeled dataset, where ED de-
notes Emotional Disclosure, ID denotes Informational Disclosure, S denotes
Support, GS denotes General Support, IS denotes Informational Support, and
ES denotes Emotional Support.
                                   Title Suppressed Due to Excessive Length      3

         Category                 Number of label 1 Percentage in the category
         Emotional Disclosure          3,948                   31%
         Informational Disclosure      4,891                   38%
         Support                       3,226                   25%
         General Support                680                     5%
         Informational Support         1,250                   10%
         Emotional Support             1,006                    8%
         None (all label is 0)         4,157                   32%
         All                          12,860

       Table 2: A simple statistical analysis of the labeled training data


                      No. of Occurrences in No. of Occurrences in
                                                                  Normalized
           Words      Emotional Disclosure Emotional Support
                                                                  Frequency
                           (N =3,948)            (N =1,006)
           know                203                   66             117.02
           really              227                   55             112.17
           get                 213                   57             110.61
           hope                 99                   82             106.59
           sorry                73                   88             105.97
           people              219                   45             100.20
           you                  82                   65             85.382
           even                139                   32             67.017
           right                83                   36             56.808
           things              111                   28             55.949

Table 3: Top 10 words in the emotional categories, ranked by the frequency the
occurrences normalized by the number of labels in each category. N denotes the
number of 1 labels in the category.


    To understand the data, we examine the number of labeled data in each
category (Table 1). As seen in the table, the number of the data labeled as class
1 in the ‘Support’ category groups (e.g., support, general support, informational
support, emotional support) are far less than the number of data labeled as class
0. For instance, the data labeled as 1 in the ‘General Support’ category occupy
only 5% of the a the data. This means that the data exhibits class imbalance
problems, severely in the labels of ‘general support’, ‘informational support’,
‘emotional support’. Although these categories seem to be the sub-types of the
‘Support’ category, we treat them independently for the classification because
the sum of all their instances with label 1 does not match with the number of
instances in the ‘Support’ category (N =3,226).
    Furthermore, we investigate the word usage in the emotional categories. First,
we extract the top 100 most frequently used words in each emotional category.
Then, we remove the words that also appear frequently in the non-emotional
categories (e.g., informational disclosure, support, general support, informational
support). Finally, we normalize the raw count by the number of labels of the
4         J. Hyun et al.


    (a) Emotional disclosure   (b) Informational disclosure   (c) Emotional support


(d) Informational support          (e) general support             (f) no label

                       Fig. 1: Word clouds for each label group


corresponding category and sum up the two numbers. Table 4 shows the top 10
words ranked by the normalized frequencies. Therefore, these words can be used
to characterize the emotional categories.
    Finally, Figure 1 shows the word cloud for each category, visualizing the
frequently used words in the categories. While only nouns are generally used to
construct word clouds, we also use adjectives as they can represent emotion. The
word clouds show that many words (e.g., life, time, good) overlap across different
categories. There are some words unique to each group: for example, ‘way’ in the
informational disclosure category, and ‘sorry’ in the emotional support category.


3      Approach
This section details our approach which employs pre-trained embedding models
to generate the vectors representing the text. These vectors serve as the input
for the textCNN [4] model for multi-label classification. We also investigate if
the use of the post as contextual information can enhance the label prediction
performance Figure 2 illustrates the overall architecture of our system.

3.1     Word Embedding Models
As our word embedding model, we utilize the pre-trained BERT [1] and an
emotional embedding scheme [9]. The original posts of comments are generally
lengthy. Therefore, we summarize them into 3∼5 sentences using LexRank [2]
before text processing. As the first text pre-processing step, the sentences are
                                 Title Suppressed Due to Excessive Length      5


                      Fig. 2: Overall System Architecture


tokenized using the BERT tokenizer. We set the max sequence length of tokens
as 64, because the average sequence length of comments data is 19 and only 1%
of the total training data exceeds 64 tokens. As a result, the feature vector of
a single comment consists of the 64 tokens representing the comment itself and
the 64 tokens representing its corresponding post.
    Additionally, we use an emotional embedding model that incorporates emo-
tional information into a word embedding model such as Word2vec [6], Glove
[7], etc. Emotional embedding refers to a new vector space by fitting emotional
information into pre-trained word vectors. Constraint set constructed by all
word/emotion relations are used for training. It is learned to get closer to the
pairs of words that are in positive relation. We use pre-trained emotional em-
bedding Glove [9]. As the vocabulary in BERT is different from that of Glove,
we set zero embedding when the token tokenized by BERT is not present in the
Glove vocabulary. Features from the pre-trained BERT and emotional embed-
ding Glove are concatenated. As a result, the features vector for a comment has
(62, 1068) shape, representing 62 tokens of 1068 dimensions (768 BERT dimen-
sion + 300 emotional Glove dimension 300). When the post is used as a context,
the input of our CNN model has (124, 1068) shape.


3.2   TextCNN

Our textCNN model uses one-dimension convolutions with the filter size of 256
and the kernel size of 3, 4, 5. After max pooling on convolutions result, the
features are concatenated and flatten to be connected with a fully-connected
layer as the final layer for the classification. The output dimension is 6, and
the sigmoid function is used as the activation function. The output represents
prediction probability of each class; the probability of greater than or equal to
6      J. Hyun et al.

0.5 is labeled as 1, and otherwise labeled as 0. We adopt the Adam optimizer
with learning rate 1e-4 and epsilon 1e-8 in this study. The binary cross entropy
loss function is used to train our model.


3.3   Data Augmentation for Training

Via informal experimentations, we discovered that the prediction performance
of ‘General support’ is very low. We attribute this to the small number of the
label 1 data. On the other hand, the numbers of the ‘Info support’ and the
‘Emo support’ labels are 1250 and 1006 respectively. Those labels show half
the performance of the rest of labels. To address this class imbalance issue, we
apply EDA (Easy Data Augmentation) [10] to the general support, informational
support, and the emotional support categories. EDA uses synonym replacement,
random insertion, random swap, and random deletion for augmentation. We
augment 9 sentences for each sentence of the support group categories during
the training stage in the system run.


         Fig. 3: Our Data Augmentation and Semi-supervised Method


3.4   Semi-supervised Learning Method

The OffMyChest dataset also provides unlabeled comment data, which contain
over 420,000 sentences. To make use of the data, we assign pseudo-labels to the
unlabeled comments data using the best classification model. Then we re-train
the model with the labeled data along with the pseudo-labeled data to improve
the classification performance following the semi-supervised method in [5].
    When training, we want the model to learn from the labeled data more than
the pseudo-labels as the pseudo-labels can be incorrect. Thus, we build the ini-
tial model trained with the augmented labeled data on 10 epochs. Then, we
re-train the model with the pseudo-labeled data on 3 epochs. While we applied
the semi-supervised learning for the system run submission, the experiment re-
sults presented below are obtained without the semi-supervised learning scheme
because our limited computing facility cannot allow the 10-fold cross validation.
                                 Title Suppressed Due to Excessive Length       7

4     Evaluation

To evaluate the proposed approach in this paper, we use 10-fold cross validation
on the labeled training data. In this experimentation, EDA and semi-supervised
learning were not used due to limited computing resources for 10-fold cross
validation. The different conditions examined in our experiments are described
as below.

   Glove [7]: use of the pre-trained Glove trained with Twitter data. That
was our first approach to solve this problem, however not used exclude this
experiment. Since the data we use in this study were collected from the Internet
community Reddit, we expected that the pre-trained Glove model may show
good performance. The pre-trained model uses a 200 dimension vector with 27B
tokens.
   Emotional Glove [9]: emotional embedding model using an approach com-
bining emotional information with Glove embedding. This model uses a 300
dimension vector with 6B tokens.
   BERT [1]: using the state-of-the-art language model to generate the feature
vector. Embedding dimension is 768, and max sequence length is 64.
   Context: utilizing the posts as a contextual information to test if it can
enhance the prediction performance.

4.1   Results

Table 4 reports the accuracy of label prediction. Overall, the BERT embed-
ding model without using the post shows the best performance in accuracy for
three categories. The combination of BERT with Emotional Glove results in the
best performance in the emotional disclosure category (without context) and
the general support category (with context). The combination of BERT with
Glove results outperforms the other models in the information support category.
The performance in accuracy seems promising in the support group categories,
ranging from 0.814(support) to 0.938(general support). Yet, their F1 scores show
different findings.
    Table 5 shows the F1 scores of our model in each category. Overall, the
‘BERT + Emotional Glove’ model and the ‘BERT + Glove’ model show the best
performance. The performance of the support subgroups (i.e., General Support,
Informational Support, and Emotional Support) are poor, as low as 0.05 (for the
general support label prediction), which are not sufficient for its practical use.
Meanwhile, its corresponding accuracy is 0.934. This means that accuracy is not
a good metric when the class is imbalanced. Precision and recall performances
of the selected model are described in the following Table 6.
8         J. Hyun et al.

                       Emotional Information         General Information Emotional Micro
Methods                                      Support
                       Disclosure Disclosure         Support Support      Support Average
Glove (Baseline)          0.662      0.645    0.767   0.935      0.882     0.915   0.807
Emotional Glove           0.672      0.666    0.773   0.937      0.891     0.919   0.816
BERT                      0.693     0.673     0.814   0.935      0.897     0.924   0.829
BERT + context            0.682      0.664    0.785   0.937      0.893     0.915   0.819
BERT + Glove              0.684      0.663    0.808   0.934     0.898      0.922   0.825
BERT + Emotional Glove   0.696       0.662    0.811   0.935      0.896     0.922   0.827
BERT + Emotional Glove
                          0.677      0.664    0.789   0.938      0.892     0.915   0.819
+ context

                               Table 4: Accuracy of our models


                       Emotional Information         General Information Emotional Micro
Methods                                      Support
                       Disclosure Disclosure         Support Support      Support Average
Glove (Baseline)          0.383      0.470    0.464   0.033      0.161     0.163   0.422
Emotional Glove           0.323      0.505    0.422   0.021      0.089     0.137   0.406
BERT                      0.438      0.534    0.544   0.049      0.222     0.215   0.494
BERT + context            0.270      0.448    0.500   0.023      0.189     0.206   0.412
BERT + Glove             0.464      0.544     0.525   0.050      0.228     0.222   0.503
BERT + Emotional Glove    0.445      0.526    0.557   0.045     0.251      0.237   0.499
BERT + Emotional Glove
                          0.350      0.467    0.485   0.006      0.216     0.163   0.429
+ context

                                Table 5: F1 score of our models


                       Emotional Information                General Information Emotional
                                                Support                                        Micro Average
Methods                Disclosure Disclosure                Support      Support    Support
                         P     R     P     R     P     R     P     R     P     R     P     R     P      R
BERT + Emotional Glove 0.507 0.396 0.575 0.484 0.591 0.527 0.051 0.040 0.285 0.224 0.264 0.215 0.575  0.441
BERT + Emotional Glove
                       0.430 0.295 0.577 0.392 0.546 0.437 0.007 0.005 0.250 0.190 0.192 0.142 0.555  0.349
+ context

                   Table 6: Precision and recall of our selected model


4.2     Discussions

The evaluation results indicate the followings:

    – As for the evaluation measures, accuracy is a poor metric to evaluate the
      proposed approach due to class imbalance problems. Therefore, we propose
      to use an F1 score, a harmonic mean of precision and recall, instead.
    – For the word embedding model, the BERT models perform better than the
      Glove models.
                                 Title Suppressed Due to Excessive Length      9

 – The combination of BERT and Glove enhances the classification perfor-
   mance.
 – The use of emotional Glove improves the classification performance for the
   categories where classes are imbalanced (i.e., support, information support,
   emotional support).
 – Opposed to our initial assumption, the use of the original post as a context
   did not contribute to enhancing the prediction performance. We postulate
   that it is because our method of concatenating one comment to one context
   (i.e., a post associated with the comment) fails to give proper weight to
   contexts, as the comments data are presumably labeled without considering
   the contexts.
    For the CL-Aff Shared task 2020 competition, we used ‘BERT + Emotional
Glove’ models as the system run model, as it reports the best F1 scores without
considering context. In the system run we applied EDA (Easy Data Augmen-
tation) and semi-supervised learning as described in Section 3.3 and 3.4. Our
submission contains the labels generated using 4 different settings from the com-
bination of [‘with context’ and ‘without context’] and [‘with pseudo-label’ and
‘without pseudo-label’].


5   Conclusion
This paper describes our approach that combines the word and emotion embed-
ding models to predict ‘Disclosure’ and ‘Supportiveness’ in OffMyChest dataset.
Three language embedding models - BERT, Glove, and Emotional Glove - were
compared, and BERT showed a better performance (about 10%) for the label
prediction than Glove. Our evaluation results also indicate that the combination
of embedding models can improve the performance. It is particularly noted that
Emotional Glove better represents the text than Glove when the class is im-
balanced. We adopted the original posts along with their associated comments
to increase the prediction performance. However, the result shows that using
the post makes no contribution to increasing the model’s classification perfor-
mance. In the future, we plan to investigate how to convey the context of a
post in an efficient manner rather than just concatenating to increase prediction
performance.


Acknowledgement
This research was partially supported by Basic Science Research Program through
the National Research Foundation of Korea(NRF) funded by the Ministry of
Education(2016R1D1A1B03933002). This work was partially supported by the
National Research Foundation of Korea(NRF) grant funded by the Korea govern-
ment(MEST) (No. 2019R1A2C1006316). This work was also partially supported
by Basic Science Research Program through the National Research Foundation of
Korea (NRF) funded by the Ministry of Science and ICT (2017R1A2B4010499).
10      J. Hyun et al.

References
 1. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
    rectional transformers for language understanding (2018)
 2. Erkan, G., Radev, D.R.: Lexrank: Graph-based lexical centrality as salience
    in text summarization. J. Artif. Int. Res. 22(1), 457–479 (Dec 2004),
    http://dl.acm.org/citation.cfm?id=1622487.1622501
 3. Jaidka, K., Singh, I., Jiahui, L., Chhaya, N., Ungar, L.: A report of the CL-Aff
    OffMyChest Shared Task at Affective Content Workshop @ AAAI. In: Proceedings
    of the 3rd Workshop on Affective Content Analysis @ AAAI (AffCon2020). New
    York, New York (February 2020)
 4. Kim, Y.: Convolutional neural networks for sentence classification. In: Pro-
    ceedings of the 2014 Conference on Empirical Methods in Natural Lan-
    guage Processing (EMNLP). pp. 1746–1751. Association for Computational
    Linguistics, Doha, Qatar (Oct 2014). https://doi.org/10.3115/v1/D14-1181,
    https://www.aclweb.org/anthology/D14-1181
 5. Lee, D.H.: Pseudo-label : The simple and efficient semi-supervised learning method
    for deep neural networks. ICML 2013 Workshop : Challenges in Representation
    Learning (WREPL) (07 2013)
 6. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed rep-
    resentations of words and phrases and their compositionality. In: Proceedings
    of the 26th International Conference on Neural Information Processing Sys-
    tems - Volume 2. pp. 3111–3119. NIPS’13, Curran Associates Inc., USA (2013),
    http://dl.acm.org/citation.cfm?id=2999792.2999959
 7. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre-
    sentation. In: Empirical Methods in Natural Language Processing (EMNLP). pp.
    1532–1543 (2014), http://www.aclweb.org/anthology/D14-1162
 8. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle-
    moyer, L.: Deep contextualized word representations. In: Proc. of NAACL (2018)
 9. Seyeditabari, A., Tabari, N., Gholizadeh, S., Zadrozny, W.: Emotional embed-
    dings: Refining word embeddings to capture emotional content of words. CoRR
    abs/1906.00112 (2019), http://arxiv.org/abs/1906.00112
10. Wei, J.W., Zou, K.: EDA: easy data augmentation techniques for boost-
    ing performance on text classification tasks. CoRR abs/1901.11196 (2019),
    http://arxiv.org/abs/1901.11196