A Convolutional Neural Network for Ranking Advice Quality in Texts
                              for a Motivational Dialogue System

                           Patrycja Swieczkowska, Rafal Rzepka and Kenji Araki
               Graduate School of Information Science and Technology, Hokkaido University, Japan
                                {swieczkowska,rzepka,araki}@ist.hokudai.ac.jp


                             Abstract                                section 2.2. of this paper. Since they proved to be useful in
                                                                     selecting texts with advisory content, we now studied their
        In this paper, we present research into advisory texts
                                                                     correlation with quality of the advice contained in the text.
        which eventually will be used to create a dialogue
                                                                     Using the same data as previous studies, namely online com-
        system providing motivational support to the user.
                                                                     ments provided by users of Reddit1, we created a neural net-
        We studied advisory comments from an online plat-            work that ranks the quality of advice texts within groups of
        form Reddit, including those containing motiva-
                                                                     three. In other words, given any 3 advice texts, our algorithm
        tional advice. Utilizing advice features identified in
                                                                     is able to select the best one according to points awarded by
        previous studies, we were able to correctly rank
                                                                     Reddit users. That advice can then be given to the user of the
        these comments within groups of three based on the
                                                                     end-goal motivational dialogue system. For example, in re-
        quality of their advice content. Our convolutional
                                                                     sponse to the user’s problem, the system can produce 3 dif-
        neural network achieved mean accuracy of 0.97 in             ferent candidate advice texts (by choosing appropriate advice
        10-fold cross validation experiments. The contribu-
                                                                     sentences from a corpus) and then use the ranking component
        tions of this research are gaining further insight into
                                                                     to select the best one. The ranking algorithm is one of the
        advice features possessed by advisory comments
                                                                     main contributions of this paper; another one is deeper insight
        and creating a novel way of ranking advisory texts.
                                                                     into the 14 advice features and their relation to advice quality.
                                                                     The paper is organized as follows. Section 2 describes related
1       Introduction                                                 work in the field of motivation in dialogue systems as well as
Lack of motivation is an important issue and there have been         text ranking. Section 3 presents our datasets and features.
numerous studies on the topic conducted in professional              Section 4 describes the architecture of our system. Section 5
[Badubi, 2017; Gerhart and Fang, 2015] and academic                  presents the details of our experiments and their results. Sec-
[Elmelid et al., 2015, Litalien et al., 2015] settings, as well as   tion 6 provides error analysis and discussion about our find-
within the context of mental health issues [Fussner et al.,          ings. Section 7 concludes the paper.
2018; Hershenberg, 2017]. However, up till today there were
few experiments involving motivational dialogue agents.              2   Related Work
Therefore, the main goal of our research is to create a dia-
logue system that would be able to motivate the user to per-         2.1 User Motivation
form their everyday tasks. It was already established that cre-      There are numerous studies suggesting approaches to influ-
ating such a system is not trivial [Swieczkowska et al., 2017].      encing motivational states in users; however, few of them
Previous studies [Swieczkowska et al., 2018] have identified         contain actual experiments. Most of them propose frame-
14 features that distinguish advisory texts from regular ones.       works without verifying their usefulness (for example
It was successfully proven that there are significant differ-        [Callejas and Griol, 2016] or [He et al., 2010]). Papers de-
ences in feature scores between these two types of texts and         scribing empirical studies include research on motivating us-
that these features can be used to classify online user com-         ers to do indoor cycling every day for a specified period of
ments as motivational/advisory or regular. This was done to          time with a robot companion [Süssenbach et al., 2014] or en-
create a classification and selection algorithm for data that        couraging users to perform longer planking exercise by giv-
will then be used in training and testing a motivational dia-        ing them acknowledging feedback from a robot that exercised
logue system.                                                        together with them [Schneider and Kummert, 2016]. How-
In this paper, we describe further research involving the 14         ever, authors of these studies scripted the agent dialogue and
advice features. They were chosen through a quality analysis         limited it to a handful of topics relevant to the task. Since both
of online comments containing advice; details are given in           studies dealt only with exercise, their very specific findings

    1
        https://www.reddit.com/


 Copyright © 2019 for this paper by its authors.
 Use permitted under Creative Commons License
 Attribution 4.0 International (CC BY 4.0).

                                                                                                                               35
cannot be generalized to other everyday activities. In contrast,
                                                                    Subreddit                        Posts #       Comments #
we plan our system to not be limited to one or a few topics.
Among studies not involving exercise, Kaptein et al. [2012]         r/getdisciplined                 624           1,872
describe a study where subjects were being persuaded to re-         r/Advice                         3,066         5,850
duce snacking via personalized short text messages. The mes-        Total                            3,690         11,070
sages were tailored to the user based on the user’s score on                        Table 1: Datasets used in our study
the Susceptibility to Persuasion Scale [Kaptein et al., 2009].
However, these messages were again crafted by the research-        In our experiments, we have studied comments downloaded
ers and involved no natural language processing. Our goal is       specifically from subreddits where authors of posts ask for
to create a system that produces motivational advice by itself,    advice, so the comments are bound to contain advisory con-
composing it from fragments of highly rated advisory texts         tent. This data was chosen because such user posts are closest
obtained with crowdsourcing.                                       to what we imagine as input to our system while other users’
                                                                   comments are closest to the ideal output. The subreddits we
2.2 Text Ranking                                                   used were r/getdisciplined 2 , where people ask for motiva-
Studies concerning text ranking usually involve criteria like      tional advice, and r/Advice3, where people ask for general ad-
relevance to the user’s query and are conducted as part of re-     vice on a variety of topics. From each subreddit, we obtained
search in information retrieval. Documents in a given data-        posts, each with 3 best rated comments; such a comment tri-
base are ranked according to their usefulness in providing the     ple was our basic unit of data. This method ensured that all
user with information about a particular topic. Most recent        comments within the triple contained advice on the same
developments in this field include improving the tf-idf            topic and as such could be compared against each other. Ta-
weighted ranking method with association rules [Jabri et al.,      ble 1 present the breakdown of the amount of data down-
2018], incorporating user browsing patterns into sorting           loaded from each subreddit.
query search results [Sethi and Dixit, 2018] and utilizing Ha-
doop and MapReduce platforms to improve search precision           2.2 Features
[Malhotra et al., 2018]. However, these studies are only par-      We have utilized the 14 advice features that proved useful in
tially relevant to our research problem. An effective docu-        previous studies [Swieczkowska et al., 2018]. The features
ment ranking algorithm, such as the ones mentioned above,          were determined through a quality analysis of top r/getdisci-
would be useful in retrieving appropriate advice for the user      plined comments, which were the closest to ideal data used
based on their query and will be a point of focus for our next     in previous research. Such comments usually contained im-
step. In contrast, this paper describes an algorithm for ranking   perative or advice expressions and their content was rather
advice quality, which is a different issue and therefore a dif-    specific. The authors of the comments also said that they re-
ferent approach must be used. The main difference is that our      lated to the problem personally. We coded these qualities into
method ranks the texts against each other rather than by rele-     Imperative Score, Advice Score, Specificity Scores and Re-
vance to some external search term.                                latability Score described below. We then added sentiment
There are also numerous papers describing approaches to            analysis using Sentic scores to complete the list of features.
ranking texts for purposes other than answering the user’s         All comments were pre-processed by detecting sentence
query. Fang et al. [2017] present a sentence ranking algo-         boundaries and assigning part-of-speech tags. For some fea-
rithm for extractive text summarization. Vajjala and Meurers       ture calculations, we also removed stopwords. In the follow-
[2016] rank sentences in the text based on their readability       ing paragraphs, wordlist_withstops means a list of all words
while studying text simplification. Myangah and Rezai              in the comment, wordlist_nostops means the same list with
[2016] rank Persian texts based on their vocabulary richness       stopwords removed and sent_list means a list of sentences in
and use this information to determine the genre of the text.       the comment. The features were calculated on each comment
However, to the best of our knowledge, there is no proposed        separately in the following manner.
algorithm for ranking advice texts based on advice quality.        Sentics scores of aptitude, attention, pleasantness and sen-
                                                                   sitivity were measured automatically using the Sentic library
3       Datasets and Features                                      for Python4 on wordlist_nostops. The library is an API to the
                                                                   SenticNet knowledge base [Cambria et al., 2018] containing
2.1 Datasets                                                       information on sentiment values of words. All the sentics val-
As our source of advice texts, we have used the online dis-        ues fall on the scale between -1 and 1.
cussion platform Reddit. It is a place where people share
opinions, discuss matters or ask for advice on different topics.
The platform is divided into numerous so-called subreddits,
each with a different purpose and topic. A user can post in
any appropriate subreddit and other users can comment on
their post. Users can also vote on both posts and comments.

    2                                                                 4
        https://www.reddit.com/r/getdisciplined/                          https://pypi.org/project/senticnet/
    3
        https://www.reddit.com/r/Advice/


                                                                                                                            36
Relatability score was measured by the percentage of first-
                                                                         you need to              have you thought about
person pronouns, including possessive pronouns, in word-
                                                                         op needs to              have you tried
list_withstops. The score range was 0 to 1.
                                                                         you have to              how about
Imperative score was measured by the percentage of imper-
                                                                         op has to                if I were you
ative expressions in the comment text. Specifically, we
                                                                         it might be worth        recommend
looked for expressions such as clauses beginning with non-
                                                                         I would                  suggest
infinitive verbs, the word please preceding a verb and phrases
                                                                         it would be good to      advise
comprised of you or OP (which stands for Original Poster
                                                                         it might be good to      you could always
and is a popular way of referring to the author of the post on
                                                                         you had better           have you considered
Reddit) and a modal verb. We excluded sentences that started
                                                                         you'd better             why not
with auxiliary verbs do and have to avoid counting in ques-
                                                                         your only option is      why don’t you
tions with syntax similar to imperative expressions. We
                                                                         [link]
counted the percentage on number of all words divided by 2,
since most of the imperative expressions we looked for were          Table 2: A complete list of advice expressions used by the
bigrams. The score range was 0 to 1.                                                  Advice Score feature.
Advice Score was defined as the number of advisory expres-
sions in the text. For this purpose, we prepared a list of pos-   In contrast to [Deshpande et al., 2010], we added a new fea-
sible advisory expressions, including phrases like you need to,   ture ASHD, which combined Average Semantic Depth and
it might be worth, or if I were you, along with words like rec-   Average Semantic Height by deducting the overall ASH
ommend or suggest. The full list is given in Table 2. This fea-   score from the overall ASD score for each comment. This
ture also counted website links, which we discovered to be a      was done to reflect the difference between the two features,
way of offering advice in many comments in our datasets. In       which were found to be important in previous studies.
the preprocessing stage we replaced all links with the token      Another change compared to [Deshpande et al., 2010] was
[link], which then also counted as an advisory expression.        that we did not change all content words into nouns before
The overall comment Advice Score was divided by 10 to             looking them up in WordNet. Nominalization was supposed
scale it down to the level of other features.                     to help with the lookup, as in 2010 the WordNet ontology
Specificity Scores that included six different features. They     was rather developed for nouns but scarce for any other parts
were first proposed by Deshpande et al. [2010] to extract sug-    of speech. We felt no need to do this anymore in 2019 be-
gestions and complaints from employee surveys and online          cause WordNet grew immensely since that time. We only
product reviews. The goal was to find sentences containing        lemmatized the words.
specific content. We adapted these features into our study be-    Total Ocurrence Count (TOC) meant the number of times the
cause initial analysis showed that comments containing spe-       word occurs in the WordNet ontology. We measured it by
cific advice were among the best rated on any given post. The     obtaining occurrence count from WordNet for each lemma-
                                                                  tized content word and adding up three lowest scores in a sen-
calculations were performed for each sentence in the com-
                                                                  tence. Count of Named Entities (CNE) meant the number of
ment using sent_list, and the final scores for the entire com-
                                                                  named entities in the sentence, determined with an NLTK
ment were obtained by adding up all the sentence scores. The      Named Entities tagger6. Count of Proper Nouns (CPN) was
features were: Average Semantic Depth (ASD), Average              measured by the number of proper nouns (tagged as NNP,
Semantic Height (ASH), Total Occurrence Count (TOC),              NNPS or CD) in the sentence with stopwords removed. CD
Count of Named Entities (CNE), Count of Proper Nouns              stands for Cardinal Number and as such is not a proper noun,
(CPN) and Sentence Length (LEN).                                  but it was included in calculations provided by [Deshpande
We only slightly modified the calculations provided by            et al., 2010], so we decided to keep it. Sentence Length
[Deshpande et al., 2010]. For both ASD and ASH, we had to         (LEN), was the number of words in the sentence with stop-
retrieve hypernymy/hyponymy hierarchies from WordNet              words removed.
ontology5 for each content word (meaning nouns, verbs, ad-        We divided ASD, ASH, ASHD, TOC and LEN by 100 and
jectives and adverbs). For each word, the longest path in the     CNE and CPN by 10 to put the scores in the same numerical
hierarchy that led from the word to its highest hypernym de-      range as other features.
termined the Semantic Depth of that word. Similarly, the          Table 3 contains an example taken from the r/getdisciplined
shortest path from the word to its lowest hyponym deter-          portion of our dataset. Both the original post and its three
mined its Semantic Height. To obtain the ASD score for the        comments are included, as well as their respective feature
entire sentence, we added all the ASDs for all the content        scores and general scores given to them by Reddit users. To
words and divided the sum by the total number of content          conserve space in the table, we removed the new line breaks,
words in the sentence (by which we mean a sentence from           but otherwise we kept the text intact.
sent_list with stopwords removed). We performed the same          In addition to our 14 advice features, we used word2vec word
calculations for ASHs.                                            embeddings of the texts. Specifically, we obtained a vector

   5                                                                 6
       https://wordnet.princeton.edu                                     https://www.nltk.org/


                                                                                                                             37
    Post text                                                     Post score: 15
    I am addicted to sleeping. I think the reason for that is because I cannot tolerate my thoughts and the real world. But
    after spending years like this, I feel awful for sleeping so much. It’s not like I sleep 15 hours a day but this habit of mine
    leads to being absent for classes twice a month and skipping half of gym sessions. Above all I don’t bother to improve
    my life style. With this attitude of mine seeing any kind of future for myself is impossible! Can you give me tips and
    suggestions how to overcome this bad addiction? If you introduce a reading source, also I would appreciate it a lot.
    Edit: I don’t sleep 15 hours a Day but I am sure I am addicted to sleeping!
    Comment text                                                  Comment score: 20 (rank 0)
    If you’re sleeping 15 hours a Day regularly for no apparent reason you need to see a doctor
    aptitude      attention      pleasantness     sensitivity     Relatability score     Imperative score     Advice score
    0.094         -0.119         0.123            -0.035          0.000                  0.000                0.100
    ASD           ASH            ASHD             TOC             CNE                    CPN                  LEN
    0.157         0.152          0.005            0.000           0.000                  0.000                0.090
    Comment text                                                  Comment score: 7 (rank 1)
    You associate your sleeping to not tolerating your thoughts. To achieve higher capacity in managing your thoughts, have
    you heard of Mindfulness work? It's a simple technique with effects showing already after a short while.
    aptitude      attention      pleasantness     sensitivity     Relatability score     Imperative score     Advice score
    0.179         0.245          0.136            -0.013          0.000                  0.048                0.000
    ASD           ASH            ASHD             TOC             CNE                    CPN                  LEN
    0.460         0.364          0.096            0.000           0.000                  0.000                0.180
    Comment text                                                  Comment score: 3 (rank 2)
    I can relate a little bit, as I too love sleep and try to avoid being alone with my own thoughts. I still love sleep, but
    finding podcasts I really like has helped me with the avoiding my thoughts part. Then I can use them as a bribe to
    myself..."I can only listen to this on my drive to work/walk to class." "I can only listen to this one at the gym." YMMV,
    but if you can find an addictive one, or one you find genuinely funny/entertaining, the bribery works. And then if you are
    one of those "I'm fine as long as I GET there" people for class/gym/work, you can look forward to the getting there part.
    aptitude      attention      pleasantness     sensitivity     Relatability score     Imperative score     Advice score
    -0.041        0.029          -0.084           0.114           0.100                  0.042                0.000
    ASD           ASH            ASHD             TOC             CNE                    CPN                  LEN
    1.188         1.106          0.082            0.080           0.000                  0.400                0.540

                                             Table 3: A single data example from our dataset.

for each word in the text and took the average to represent the        Each comment text had 114 features. To construct our input
entire text. The word2vec embeddings had 100 dimensions.               data, in each comment triple we concatenated the feature vec-
We concatenated them with our advice features, ending up               tors into one vector of length 342 (=3x114). Before concate-
with 114 features in total for each comment text.                      nation, we shuffled each triple and obtained all 6 permuta-
                                                                       tions of their order. Therefore, each triple was present in the
4     System Architecture                                              dataset 6 times, each time with different order of comments.
                                                                       This was done to lessen the impact of order on the results of
The basic unit of our data was a triple of comments coming             the network. As output, the network produced a vector of
from the same post. Each comment had been rated by users,              length 3, where each position gave a number 0, 1 or 2 depend-
so by comparing their scores we were able to rank the com-             ing on the rank and order of the comments. For example, if
ments with numbers 0, 1 and 2, where 0 represented the best            the first 114 features of the input vector represented a com-
rated comment and 2 represented the lowest rated one. The              ment of rank 2, the next 114 features represented a comment
ratings were not representative across the entire dataset; for         of rank 0 and the last 114 represented a comment of rank 1,
example, a comment ranked 0 in its own comment triple may              then the expected output was a vector of [2, 0, 1].
have been ranked 2 in a different triple (given they pertained         To ensure that each comment text went through the same in-
to the same topic). However, this was not an issue, since our          itial calculations, we constructed a convolutional neural net-
purpose was to select the best comment in the given fixed set          work. The first layer had 342 units that matched our input
of three.                                                              vector. Then, we used a filter of length 114 and stride of 114,
                                                                       which essentially meant that each set of 114 comment fea-


                                                                                                                              38
          Layer               Units #          Output shape              Fold        Training      Training       Test         Test
          Input                 ---            (m, 1, 1, 342)                          loss        accuracy       loss      accuracy
          Conv1             114*(1, 114)       (m, 114, 1, 3)          1              0.016          0.995       0.037        0.991
         Reshape                ---            (m, 3, 1, 114)          2              0.038          0.989       0.056        0.984
          Conv2               3*(1, 3)          (m, 3, 1, 38)          3              0.276          0.903       0.423        0.889
         Reshape                ---              (m, 3, 38)            4              0.031          0.992       0.049        0.989
           Fc1                  20               (m, 3, 20)            5              0.072          0.980       0.081        0.975
           Fc2                  10               (m, 3, 10)            6              0.060          0.990       0.059        0.987
           Fc3                   3                (m, 3, 3)            7              0.084          0.947       0.166        0.931
         Reshape                ---                (3m, 3)             8              0.032          0.994       0.043        0.992
         Argmax                 ---                (3m, 1)             9              0.020          0.993       0.052        0.989
                                                                       10             0.036          0.989       0.066        0.986
   Table 4: Overview of the network. Conv stands for convolu-
                                                                     Average          0.067          0.977       0.103        0.971
    tional layers and Fc stands for fully connected layers. For
  convolutional layers, the number of units is the number of fil-       Table 5: Results of loss and accuracy values across all folds.
          ters multiplied by filter size used on the layer.
                                                                    of length 3, where the rank was indicated by the position of
tures went through the same filter. This ensured that no mat-       1. For example, if the softmax produced output of [0, 1, 0] for
ter the order of the comments, each one received equal treat-       a comment, that comment got rank 1, and if the output was
ment and had equal chances of being assigned any of the three       [0, 0, 1] then the comment got rank 2. Therefore, for each
ranks.                                                              training example the output shape was (3, 3) where the first 3
On top of the first convolutional layer, we had a second one        represents the three comments and the second 3 represents
followed by three fully connected layers. Table 4 gives an          the length of softmax. We then reshaped this output so that
overview of the network along with data shape produced by           each comment became its own entry (shape of (3m, 3)). At
each layer and operation. As we conducted the research using        the very end, we used the argmax function to reduce the out-
PyTorch, the order of dimensions for convolutional layers           put to shape (3m, 1), meaning one rank for each comment in
follows the PyTorch convention, which is: depth (= number           the dataset.
of channels), height, width. The kernel size gives only height
and width; depth is exactly the same as the input to the given      5     Experiments and Results
layer.
We reshaped the output of the first convolutional layer before      5.1 Experiment Setup
passing it on to the next layer. The first layer gave output of
                                                                    We trained the network for 1000 epochs using the Adadelta
shape (114, 1, 3) for each data entry. Essentially, this was a
                                                                    [Zeiler, 2012] optimization algorithm with no changes to its
vector of length 3, where each position represented one com-
                                                                    default hyperparameters (this means that we did not set a
ment and had a depth of 114, because each comment had been
                                                                    learning rate manually). We also divided our training data
convolved by all 114 kernels. We reshaped the output into (3,
                                                                    into minibatches of 512. These hyperparameters were de-
1, 114) and fed it as input to Conv2. This way, the kernels in
                                                                    cided based on performance.
the second convolutional layer operated on a vector of length
                                                                    Overall, we had 22,140 examples in our dataset (total of
114, where each position represented one Conv1 kernel and
                                                                    3,690 downloaded triples where each triple was present in the
had a depth of 3 representing the three comments. Each
                                                                    dataset 6 times). We performed 10-fold cross validation with
Conv2 kernel processed three of the 114 positions (with each
                                                                    19,926 training examples and 2,214 test examples in each
position incorporating calculations from all comments),
                                                                    fold. Before training and testing, the features were normal-
yielding 38 results from the processing. The point of this re-
                                                                    ized using L2 normalization.
shaping operation was to allow the Conv2 kernels to process
subsets of Conv1 kernel results with all three comments each        5.2 Results
instead of subsets of comments with all 114 Conv1 kernel re-
sults each. It was important to include information about all       We measured the performance of our system with accuracy.
three comments for each Conv2 kernel operation, because our         Table 5 presents the results broken down by fold. We also
results are dependent on all feature relationships within the       looked at our 14 advice features to see whether they corre-
comment triple. We then reshaped the Conv2 layer results to         lated with the ranks. Although no single feature showed a sig-
reduce the number of dimensions from four to three so that          nificant linear correlation with ranks (as measured by Pearson
we could pass them to a fully connected layer.                      coefficient), there are small differences in their mean and me-
Each layer had the tanh activation function except for the last     dian values between ranks. Table 6 shows the comparison of
one, which used a softmax. The output shape from the last           raw feature values across the three ranks. We included more
layer was (m, 3, 3) where m represents batch size. Essentially,     decimal points in the table to better reflect the differences.
for each training example there were three comments to rank         Table 7 presents statistical significance scores of the differ-
and each of these comments received its own softmax vector          ences between feature values across rank pairs.


                                                                                                                                  39
    Rank              aptitude    attention     pleasantness     sensitivity   Relatability           Imperative          Advice score
                                                                               score                  score
    0       Mean      0.119163    0.087951      0.087890         0.051194      0.025531               0.069907            0.025881
            Median    0.128197    0.082036      0.087306         0.039533      0.009709               0.053333            0.000000
    1       Mean      0.130136    0.087693      0.099144         0.051375      0.027642               0.066100            0.027425
            Median    0.141360    0.084107      0.107313         0.037422      0.012500               0.048780            0.000000
    2       Mean      0.134223    0.082651      0.100406         0.050223      0.028502               0.066240            0.026667
            Median    0.139275    0.082451      0.095032         0.037500      0.012085               0.048780            0.000000
    Rank              ASD         ASH           ASHD             TOC           CNE                    CPN                 LEN
    0       Mean      1.000910    0.943619      0.057292         0.141257      0.000108               0.049593            0.354938
            Median    0.622520    0.588571      0.030000         0.000000      0.000000               0.000000            0.200000
    1       Mean      1.047904    0.988323      0.059582         0.140745      0.000108               0.052602            0.378932
            Median    0.657738    0.620119      0.032348         0.000000      0.000000               0.000000            0.220000
    2       Mean      1.004613    0.947719      0.056894         0.159900      0.000054               0.047940            0.359827
            Median    0.640000    0.602750      0.031667         0.000000      0.000000               0.000000            0.220000
                     Table 6: Feature values across different ranks. We bolded highest mean value for each feature.


                                                                      drop in performance is usually caused by a learning rate that
5       Error Analysis and Discussion                                 is too large for the given stage of training. It can be assumed
                                                                      that after Adadelta adjusted the learning rate, the network
For a research problem posed in this way, it was important            performance was able to rise up again in the last epochs. Per-
that each of the three comment texts went through the same            haps this is the reason why this particular optimization algo-
initial calculations. This could have been achieved by using a        rithm worked best in our case: other algorithms like Adam
recurrent neural network, where each timestep – in our case,          [Kingma and Ba, 2014] or SGD [Robbins and Monro, 1951]
comment text – is processed by the same unit (for example             would get stuck and be unable to overcome this issue.
GRU [Cho et al., 2014] or LSTM [Hochreiter and                        As is evident from results presented in Table 5, we were able
Schmidthuber, 1997]) that has its parameters adjusted during          to achieve very high accuracy on our task. Perhaps this was
the training process. However, our attempts at using an RNN           caused by the relatively big amount of data; at over 22,000
were unsuccessful. Although the algorithm trained well                examples the network had more than enough data to learn
(training set accuracy was usually above 0.95), these results
did not generalize to the test set. Test accuracy was always             Feature          Ranks 0-1       Ranks 1-2        Ranks 0-2
around 0.33, which in this setting is random chance level.
One reason for this may be that RNNs are particularly sensi-             Aptitude             0.035           0.422          0.003
tive to the order of the timesteps and even shuffling the com-           Attention            0.948           0.209          0.174
ments did not help in alleviating this issue. The network kept           Pleasantness         0.028           0.803          0.014
overfitting the training set over the course of many epochs,
but then was not able to make correct predictions on the pre-            Sensitivity          0.956           0.720          0.761
viously unseen data of the test set. With the RNN, prediction            Relatability         0.017           0.354          0.001
for each comment relied heavily on calculations made for the             score
previous one(s) instead of the network looking at the triple as
                                                                         Imperative           0.042           0.940          0.044
a group and not as a sequence. Using a convolutional network
                                                                         score
solved this problem.
Furthermore, we made some interesting observations during                Advice score         0.234           0.560          0.542
the training. First, the network worked only with very specific          ASD                  0.113           0.157          0.901
settings, namely with the Adadelta optimization algorithm
and the tanh activation function. While searching for the best           ASH                  0.112           0.162          0.884
optimization method and activation function is routinely per-            ASHD                 0.228           0.163          0.837
formed to yield the best results for the given network, the dif-         TOC                  0.970           0.178          0.193
ferences in accuracy between various choices were unusually
large in our case. Despite repeated training, the algorithm did          CNE                  1.000           0.564          0.414
not converge with any other optimization algorithm and the               CPN                  0.378           0.147          0.579
ReLU activation function, which we initially tried instead of            LEN                  0.054           0.111          0.682
tanh, caused the network to get stuck in a local minimum at a
high error level. Second, around epoch 700-800 error would
                                                                        Table 7: Statistical significance of differences between feature
briefly rise and then fall down again to an even lower level.           values across rank pairs. We bolded p values of 0.05 or lower.
The tendency can be observed in Figure 1. This temporary


                                                                                                                                  40
                                                                          Fold        Training      Training       Test         Test
                                                                                        loss        accuracy       loss      accuracy
                                                                         1             0.263          0.888       0.403        0.866
                                                                         2             0.347          0.866       0.375        0.840
                                                                         3             0.150          0.945       0.181        0.929
                                                                         4             0.353          0.849       0.439        0.828
                                                                         5             0.184          0.919       0.250        0.898
                                                                         6             0.190          0.943       0.211        0.932
                                                                         7             0.212          0.923       0.282        0.901
                                                                         8             0.207          0.920       0.245        0.903
                                                                         9             0.376          0.880       0.390        0.849
                                                                         10            0.248          0.923       0.287        0.899
                                                                       Average         0.253          0.906       0.306        0.885
                                                                      Table 8: Results of loss and accuracy values across all folds for the
                                                                                  model trained only on word2vec features.
   Figure 1: Overview of cost progression across 1000 training
  epochs for all folds. Each fold is marked with a different color.
                                                                      may be a bit troublesome for our algorithm, although the per-
                                                                      centage of such comments in the overall dataset is negligible.
                                                                      Moreover, once again we found no clear characteristics of
how to rank the comment texts. However, this task was per-
                                                                      such misranked comments compared to those that were
formed on text triples, which means that the results are valid
                                                                      ranked correctly. This was also the case with other comments
only in a very specific setting. Ideally, we would like to have
                                                                      that got misranked only once. Such findings suggest that our
a network able to rank the quality of advice contained in any
                                                                      algorithm has no defined bias in ranking, but makes mistakes
given single text. However, our network specifically takes a
                                                                      randomly, which can be expected with a neural network.
triple of texts as input and it was not trained to recognize ob-
                                                                      Other misranked comments were of the [deleted] or [re-
jective advice quality, but rather to select the best advice text
                                                                      moved] kind. Comments with this kind of content were either
from a given triple regardless of the overall quality level in
                                                                      removed by the moderators of the subreddit or deleted by the
that triple. Therefore, right now it cannot be used to judge
                                                                      user themselves. Such comments are usually inappropriate or
how good a piece of advice is without any comparison. Con-
                                                                      rude and would not receive a lot of points, which means they
structing a network capable of accomplishing this task based
                                                                      would usually rank the lowest in any given triple in our da-
on our current findings is a topic for future studies. Likewise,
                                                                      taset. However, it is possible that some such comments con-
we assumed that all advice comments were on topic, because
                                                                      tained content that was upvoted by people agreeing with the
they were downloaded from their respective threads as re-
                                                                      rude or inappropriate message and at the time of our down-
sponses to another user’s post, but this may not always be the
                                                                      loading the data this comment had a high score despite being
case with raw data obtained in a different manner. Therefore,
                                                                      already deleted or removed. It is also possible that some user
further experiments will involve determining whether the ad-
                                                                      shared their advice in the comment and that advice was good
vice is thematically appropriate for the given problem.
                                                                      enough to get a lot of upvotes, but then was deleted from dis-
We performed error analysis on all the folds. First, we pre-
                                                                      cussion by the user themselves because they decided it re-
pared confusion matrices to see which ranks were most com-
                                                                      vealed too much about them after all. This is an occasional
monly confused with each other. We found no clear tendency
                                                                      occurrence not only in the advice subreddits, but also in sub-
across all folds. However, we calculated mean numbers of
                                                                      reddits concerning other personal issues, for example mental
misranked comments for all confusion matrices and we found
                                                                      health. Whatever the cause, the [deleted] and [removed]
out that the numbers were slightly higher for comments that
                                                                      comments were misleading to our algorithm, as the features
were misranked as 2 despite the true label being 0 or 1. In
                                                                      were calculated not from the original content of the comment
other words, for comments with true label 0 there were more
                                                                      (which was no longer available), but from the single words
comments misranked as 2 than those misranked as 1, and
                                                                      deleted and removed respectively. As a result, the features
likewise, for comments with true label 1 there were more mis-
                                                                      were not informative enough to perform the ranking correctly.
ranked as 2 than as 0. While this result is not conclusive in
                                                                      We did foresee the problems that such comment might pose
any way, it shows an interesting quality about our algorithm.
                                                                      when gathering data, but removing any of them from the da-
We have also performed an analysis of the content of mis-
                                                                      taset would result in removing the entire triple, which we
ranked comments. Since in our setup each triple of comments
                                                                      wanted to avoid. Moreover, the [deleted] and [removed]
was present in the dataset six times, there was a possibility of
                                                                      comments were only a small fraction of our misranked com-
each comment to be present multiple times in the test set. In
                                                                      ments from the test set. They most likely did not hinder the
such cases, comments that were misranked once tended to be
                                                                      training process either, as neural networks are rather robust
misranked again on some of their subsequent appearances in
                                                                      against occasional noise in data.
the test set as well. This suggests that some single comments
                                                                      It must be noted here that the algorithm performed the final
                                                                      ranking by taking argmax of a 3x3 matrix with one-hot vector


                                                                                                                                   41
rows, so the rankings were not interchangeable, but inde-           had the highest Relatability Score. This suggests that while it
pendent at this point. In other words, a misranked comment          is important to use imperative expressions when giving ad-
in the triple did not translate to another comment being mis-       vice and to relate to the given problem, too much self-talk
ranked by exchanging their mutual rankings. This means that         deducts from that advice’s quality.
even if a triple contained a misranked comment, other com-          Finally, the data from Table 5 sheds a new light on previous
ments in that triple could be (and usually were) ranked cor-        findings concerning the features. Error analysis of experi-
rectly.                                                             ments conducted in [Swieczkowska et al., 2018] revealed dif-
As can be seen in Tables 6 and 7, the differences in feature        ferences in feature values between texts containing advice
values between ranks were relatively small. For many fea-           and regular ones. For almost all advice features except Aver-
tures, like Advice Score or Specificity Scores, those differ-       age Semantic Height (which at that point was calculated dif-
ences were not statistically significant. This suggests that per-   ferently than described here), they had perceptibly higher val-
haps the network would not be able to rank advice texts based       ues for advice texts than for regular texts. This is why they
on these features alone and that it benefitted from the             could be used for a classification task mentioned in the Intro-
word2vec features as well.                                          duction section. We assumed that similarly, the values for ad-
Following up on these findings, we conducted additional ex-         vice features would be higher for good quality advice com-
periments using solely word2vec features to see how much            pared to lower quality advice. However, the difference in fea-
impact our 14 advice features had on the algorithm. We              tures between rank 0 and rank 2 is small. Interestingly, some
slightly adjusted the network architecture to accommodate           features are highest for the middle rank 1, for example sensi-
the new input shape, which was (m, 1, 1, 300) instead of (m,        tivity, Advice Score or ASHD. All these findings suggest that
1, 1, 342). Therefore, the Conv1 layer had 114 filters of shape     the relationship between these features and advice quality is
(1, 100) instead of (1, 114). After that point, the input/output    complicated and not readily visible, which is in line with our
dimensions and further layers remained the same as in our           findings about the lack of linear correlations between any
models for all features. We trained the model with exactly the      given feature and advice rank.
same hyperparameters and exactly the same number of
epochs, which was 1000. The results of these experiments can        7   Conclusions
be seen in Table 8. The average test accuracy was only 0.89
as compared to 0.97 from Table 5. This proves that even             In this paper we have presented a convolutional neural net-
though word2vec features were important in our study, our           work able to rank online comments containing advice based
14 advice features also played a significant role in achieving      on advice quality, as judged by other online users. While this
good accuracy in the experiments.                                   method cannot be used to determine objective quality of a
We were also able to identify most important features in our        piece of advice, it is useful for selecting the best advice in a
study: aptitude, pleasantness, Relatability Score and Impera-       given group of texts. This can be useful in creating a motiva-
tive Score, since the differences in their values between ranks     tional dialogue system, for example by choosing the best ad-
were statistically significant. For the sentics, they are associ-   vice from three candidate outputs from the system and pre-
ated with dichotomies such as ecstasy-grief for pleasantness        senting that advice to the user as the final output. We were
and admiration-loathing for aptitude. Our findings suggest          also able to identify specific measurable qualities of a good
that these emotions may be more important in advice texts           advice text, such as scoring high on aptitude, pleasantness
than others like vigilance-amazement associated with atten-         and Imperative Score while maintaining Relatability Score
tion or rage-terror associated with sensitivity. A lot of posts     on a lower level.
in our dataset described the author’s dissatisfaction with their    Acknowledgements
current life and desire to change and be happier. Because of        The authors of this paper would like to thank prof. Hiroyuki
this, emotions such as fear, anger or anticipation may have         Iizuka of Hokkaido University for his invaluable advice on
been less present in the advice comments, while talk about          this project. This work was supported by JSPS KAKENHI
sadness, trust or joy seemed to be more prominent, especially       Grant Number 17K00295.
when it comes to motivational advice. The statistical signifi-
cance of Imperative Score and Relatability Score is notewor-        References
thy as well; we designed these features to reflect the fact that
best rated advice comments tended to contain a lot of imper-        [Badubi, 2017] Reuben M. Badubi. Theories of motivation
ative expressions and were authored by people who related to           and their application in organizations: A risk analysis. In-
the given problem. On the other hand, lack of significant dif-         ternational Journal of Innovation and Economic Devel-
ferences between ranks in Advice Score is not surprising; all          opment, 3(3):43–50, 2017.
comments contained some form of advice, so obviously ad-            [Callejas and Griol, 2016] Zoraida Callejas and David Griol.
vice expressions were present in all of them. It seems that,           An affective utility model of user motivation for counsel-
rather than the presence of advice expressions, the manner of          ling dialogue systems. In International Workshop on Fu-
giving advice was more significant, specifically how many              ture and Emerging Trends in Language Technology,
first-person pronouns and imperative phrases were included             pages 86–97, 2016. Springer.
in the text. As can be seen in Table 5, best ranked comments
had the highest Imperative Score and lowest rated comments


                                                                                                                            42
[Cambria et al., 2018] Erik Cambria, Soujanya Poria, De-        [Kaptein et al., 2012] Maurits Kaptein, Boris De Ruyter,
   vamanyu Hazarika, and Kenneth Kwok. SenticNet 5: Dis-           Panos Markopoulos, and Emile Aarts. Adaptive persua-
   covering conceptual primitives for sentiment analysis by        sive systems: A study of tailored persuasive text messages
   means of context embeddings. In Proceedings of Thirty-          to reduce snacking. ACM Transactions on Interactive In-
   Second AAAI Conference on Artificial Intelligence, pages        telligent Systems (TiiS), 2(2):10, 2012.
   1975–1802, 2018.                                             [Kaptein et al., 2009] Maurits Kaptein, Panos Markopoulos,
[Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer,            Boris de Ruyter, and Emile Aarts. Can you be persuaded?
   Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,              Individual differences in susceptibility to persuasion. In
   Holger Schwenk, and Yoshua Bengio. Learning phrase              Proceedings of IFIP Conference on Human-Computer In-
   representations using RNN encoder-decoder for statistical       teraction, pages 115–118, 2009. Springer.
   machine translation. arXiv preprint arXiv:1406.1078.         [Kingma and Ba, 2014] Diederick Kingma and Jimmy Ba.
[Deshpande et al., 2010] Shailesh Deshpande, Girish Keshav         Adam: A method for stochastic optimization. arXiv pre-
   Palshikar, and G. Athiappan. An unsupervised approach           print arXiv:1412.6980, 2014.
   to sentence classification. In International Conference on   [Litalien et al., 2015] David Litalien, Frederic Guay, and Al-
   Management of Data COMAD, page 88, 2010.                         exandre J. Morin. Motivation for PhD studies: Scale de-
[Elmelid et al., 2015] Andrea Elmelid, Andrew Stickley,             velopment and validation. Learning and Individual Dif-
   Frank Lindblad, Mary Schwab-Stone, Christopher C Hen-            ferences, 41:1–13, 2015.
   rich, and Vladislav Ruchkin. Depressive symptoms, anx-       [Malhotra et al., 2018] Dheeraj Malhotra, Monica Malhotra,
   iety and academic motivation in youth: Do schools and          and O. P. Rishi. An innovative approach of web page
   families make a difference? Journal of adolescence,            ranking using Hadoop and MapReduce-based cloud
   45:174–182, 2015.                                              framework. In Proceedings of Big Data Analytics, pages
[Fang et al., 2017] Changjian Fang, Dejun Mu, Zhenghong           421–427, 2018. Springer.
   Deng, and Zhiang Wu. Word-sentence co-ranking for au-        [Myangah and Rezai, 2016] Tayebeh Mosavi Myangah and
   tomatic extractive text summarization. Expert Systems          Mohammad Javad Rezai. Persian text ranking using lexi-
   with Applications, 72:189–195, 2017.                           cal richness indicators. Glottometrics, 35:6–15, 2016.
[Fussner et al., 2018] Lauren M. Fussner, Kathryn J. Mancini,   [Robbins and Monro, 1951] Herbert Robbins and Sutton
   and Aaron M. Luebbe. Depression and approach motiva-            Monro. A stochastic approximation method. The annals
   tion: Differential relations to monetary, social, and food      of mathematical statistics, 1951:400–407, 1951.
   reward. Journal of Psychopathology and Behavioral As-
   sessment, 40(1):117–129, 2018.                               [Schneider and Kummert, 2016] Sebastian Schneider and
                                                                   Franz Kummert. Motivational effects of acknowledging
[Gerhart and Fang, 2015] Barry Gerhart and Meiyu Fang. Pay,        feedback from a socially assistive robot. In Proceedings
   intrinsic motivation, extrinsic motivation, performance,        of International Conference on Social Robotics, pages
   and creativity in the workplace: Revisiting long-held be-       870–879, 2016. Springer.
   liefs. Annual Review of Organizational Psychology and
   Organizational Behavior, 2(1):489–521, 2015.                 [Sethi and Dixit, 2019] Shilpa Sethi and Ashutosh Dixit. A
                                                                   novel page ranking mechanism based on user browsing
[He et al., 2010] Helen Ai He, Saul Greenberg, and Elaine M.       patterns. In Proceedings of Software Engineering, pages
   Huang. One size does not fit all: Applying the transtheo-       37–49, 2019. Springer.
   retical model to energy feedback technology design. In
   Proceedings of the SIGCHI Conference on Human Fac-           [Süssenbach et al., 2014] Luise Süssenbach, Nina Riether,
   tors in Computing Systems, pages 927–936, 2010. ACM.            Sebastian Schneider, Ingmar Berger, Franz Kummert,
                                                                   Ingo Lütkebohle, and Karola Pitsch. A robot as a fitness
[Hershenberg, 2017] Rachel Hershenberg. Activating happi-          companion: Towards and interactive action-based moti-
   ness: A jump-start guide to overcoming low motivation,          vation model. In Proceedings of The 23rd IEEE Interna-
   depression, or just feeling stuck. New Harbinger Publica-       tional Symposium on Robot and Human Interactive Com-
   tions, 2017.                                                    munication, pages 286–293, 2014. IEEE.
[Hochreiter and Schmidthuber, 1997] Sepp Hochreiter and         [Swieczkowska et al., 2017] Patrycja Swieczkowska, Jolanta
   Jürgen Schmidthuber. Long short-term memory. Neural             Bachan, Rafal Rzepka, and Kenji Araki. Asystent – A
   Computation, 9(8):1735–1780, 1997.                              prototype of a motivating electronic assistant. In Proceed-
[Jabri et al., 2018] Siham Jabri, Azzeddine Dahbi, Taoufiq         ings of the Linguistic And Cognitive Approaches To Dia-
   Gadi, and Abdelhak Bassir. Ranking of text documents            log Agents (LaCATODA 2017), pages 11–19, 2017.
   using tf-idf weighting and association rules mining. In         CEUR Workshop Proceedings.
   Proceedings of 4th International Conference on Optimi-       [Swieczkowska et al., 2018] Patrycja Swieczkowska, Rafal
   zation and Applications (ICOA), pages 1–6, 2018. IEEE.          Rzepka, and Kenji Araki. Analyzing motivation tech-
                                                                   niques in emotionally intelligent dialogue systems. In


                                                                                                                      43
   Proceedings of Biologically Inspired Cognitive Architec-
   tures Meeting, pages 355–360, 2018. Springer, Cham.
[Vajjala and Meurers, 2016] Sowmya Vajjala and Detmar
   Meurers. Readability-based sentence ranking for evaluat-
   ing text simplification. arXiv preprint arXiv: 1603.06009,
   2016.
[Zeiler, 2012] Matthew Zeiler. ADADELTA: An adaptive
   learning rate method. arXiv preprint arXiv:1212.5701,
   2012.


                                                                44