A Convolutional Neural Network for Ranking Advice Quality in Texts for a Motivational Dialogue System Patrycja Swieczkowska, Rafal Rzepka and Kenji Araki Graduate School of Information Science and Technology, Hokkaido University, Japan {swieczkowska,rzepka,araki}@ist.hokudai.ac.jp Abstract section 2.2. of this paper. Since they proved to be useful in selecting texts with advisory content, we now studied their In this paper, we present research into advisory texts correlation with quality of the advice contained in the text. which eventually will be used to create a dialogue Using the same data as previous studies, namely online com- system providing motivational support to the user. ments provided by users of Reddit1, we created a neural net- We studied advisory comments from an online plat- work that ranks the quality of advice texts within groups of form Reddit, including those containing motiva- three. In other words, given any 3 advice texts, our algorithm tional advice. Utilizing advice features identified in is able to select the best one according to points awarded by previous studies, we were able to correctly rank Reddit users. That advice can then be given to the user of the these comments within groups of three based on the end-goal motivational dialogue system. For example, in re- quality of their advice content. Our convolutional sponse to the user’s problem, the system can produce 3 dif- neural network achieved mean accuracy of 0.97 in ferent candidate advice texts (by choosing appropriate advice 10-fold cross validation experiments. The contribu- sentences from a corpus) and then use the ranking component tions of this research are gaining further insight into to select the best one. The ranking algorithm is one of the advice features possessed by advisory comments main contributions of this paper; another one is deeper insight and creating a novel way of ranking advisory texts. into the 14 advice features and their relation to advice quality. The paper is organized as follows. Section 2 describes related 1 Introduction work in the field of motivation in dialogue systems as well as Lack of motivation is an important issue and there have been text ranking. Section 3 presents our datasets and features. numerous studies on the topic conducted in professional Section 4 describes the architecture of our system. Section 5 [Badubi, 2017; Gerhart and Fang, 2015] and academic presents the details of our experiments and their results. Sec- [Elmelid et al., 2015, Litalien et al., 2015] settings, as well as tion 6 provides error analysis and discussion about our find- within the context of mental health issues [Fussner et al., ings. Section 7 concludes the paper. 2018; Hershenberg, 2017]. However, up till today there were few experiments involving motivational dialogue agents. 2 Related Work Therefore, the main goal of our research is to create a dia- logue system that would be able to motivate the user to per- 2.1 User Motivation form their everyday tasks. It was already established that cre- There are numerous studies suggesting approaches to influ- ating such a system is not trivial [Swieczkowska et al., 2017]. encing motivational states in users; however, few of them Previous studies [Swieczkowska et al., 2018] have identified contain actual experiments. Most of them propose frame- 14 features that distinguish advisory texts from regular ones. works without verifying their usefulness (for example It was successfully proven that there are significant differ- [Callejas and Griol, 2016] or [He et al., 2010]). Papers de- ences in feature scores between these two types of texts and scribing empirical studies include research on motivating us- that these features can be used to classify online user com- ers to do indoor cycling every day for a specified period of ments as motivational/advisory or regular. This was done to time with a robot companion [Süssenbach et al., 2014] or en- create a classification and selection algorithm for data that couraging users to perform longer planking exercise by giv- will then be used in training and testing a motivational dia- ing them acknowledging feedback from a robot that exercised logue system. together with them [Schneider and Kummert, 2016]. How- In this paper, we describe further research involving the 14 ever, authors of these studies scripted the agent dialogue and advice features. They were chosen through a quality analysis limited it to a handful of topics relevant to the task. Since both of online comments containing advice; details are given in studies dealt only with exercise, their very specific findings 1 https://www.reddit.com/ Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 35 cannot be generalized to other everyday activities. In contrast, Subreddit Posts # Comments # we plan our system to not be limited to one or a few topics. Among studies not involving exercise, Kaptein et al. [2012] r/getdisciplined 624 1,872 describe a study where subjects were being persuaded to re- r/Advice 3,066 5,850 duce snacking via personalized short text messages. The mes- Total 3,690 11,070 sages were tailored to the user based on the user’s score on Table 1: Datasets used in our study the Susceptibility to Persuasion Scale [Kaptein et al., 2009]. However, these messages were again crafted by the research- In our experiments, we have studied comments downloaded ers and involved no natural language processing. Our goal is specifically from subreddits where authors of posts ask for to create a system that produces motivational advice by itself, advice, so the comments are bound to contain advisory con- composing it from fragments of highly rated advisory texts tent. This data was chosen because such user posts are closest obtained with crowdsourcing. to what we imagine as input to our system while other users’ comments are closest to the ideal output. The subreddits we 2.2 Text Ranking used were r/getdisciplined 2 , where people ask for motiva- Studies concerning text ranking usually involve criteria like tional advice, and r/Advice3, where people ask for general ad- relevance to the user’s query and are conducted as part of re- vice on a variety of topics. From each subreddit, we obtained search in information retrieval. Documents in a given data- posts, each with 3 best rated comments; such a comment tri- base are ranked according to their usefulness in providing the ple was our basic unit of data. This method ensured that all user with information about a particular topic. Most recent comments within the triple contained advice on the same developments in this field include improving the tf-idf topic and as such could be compared against each other. Ta- weighted ranking method with association rules [Jabri et al., ble 1 present the breakdown of the amount of data down- 2018], incorporating user browsing patterns into sorting loaded from each subreddit. query search results [Sethi and Dixit, 2018] and utilizing Ha- doop and MapReduce platforms to improve search precision 2.2 Features [Malhotra et al., 2018]. However, these studies are only par- We have utilized the 14 advice features that proved useful in tially relevant to our research problem. An effective docu- previous studies [Swieczkowska et al., 2018]. The features ment ranking algorithm, such as the ones mentioned above, were determined through a quality analysis of top r/getdisci- would be useful in retrieving appropriate advice for the user plined comments, which were the closest to ideal data used based on their query and will be a point of focus for our next in previous research. Such comments usually contained im- step. In contrast, this paper describes an algorithm for ranking perative or advice expressions and their content was rather advice quality, which is a different issue and therefore a dif- specific. The authors of the comments also said that they re- ferent approach must be used. The main difference is that our lated to the problem personally. We coded these qualities into method ranks the texts against each other rather than by rele- Imperative Score, Advice Score, Specificity Scores and Re- vance to some external search term. latability Score described below. We then added sentiment There are also numerous papers describing approaches to analysis using Sentic scores to complete the list of features. ranking texts for purposes other than answering the user’s All comments were pre-processed by detecting sentence query. Fang et al. [2017] present a sentence ranking algo- boundaries and assigning part-of-speech tags. For some fea- rithm for extractive text summarization. Vajjala and Meurers ture calculations, we also removed stopwords. In the follow- [2016] rank sentences in the text based on their readability ing paragraphs, wordlist_withstops means a list of all words while studying text simplification. Myangah and Rezai in the comment, wordlist_nostops means the same list with [2016] rank Persian texts based on their vocabulary richness stopwords removed and sent_list means a list of sentences in and use this information to determine the genre of the text. the comment. The features were calculated on each comment However, to the best of our knowledge, there is no proposed separately in the following manner. algorithm for ranking advice texts based on advice quality. Sentics scores of aptitude, attention, pleasantness and sen- sitivity were measured automatically using the Sentic library 3 Datasets and Features for Python4 on wordlist_nostops. The library is an API to the SenticNet knowledge base [Cambria et al., 2018] containing 2.1 Datasets information on sentiment values of words. All the sentics val- As our source of advice texts, we have used the online dis- ues fall on the scale between -1 and 1. cussion platform Reddit. It is a place where people share opinions, discuss matters or ask for advice on different topics. The platform is divided into numerous so-called subreddits, each with a different purpose and topic. A user can post in any appropriate subreddit and other users can comment on their post. Users can also vote on both posts and comments. 2 4 https://www.reddit.com/r/getdisciplined/ https://pypi.org/project/senticnet/ 3 https://www.reddit.com/r/Advice/ 36 Relatability score was measured by the percentage of first- you need to have you thought about person pronouns, including possessive pronouns, in word- op needs to have you tried list_withstops. The score range was 0 to 1. you have to how about Imperative score was measured by the percentage of imper- op has to if I were you ative expressions in the comment text. Specifically, we it might be worth recommend looked for expressions such as clauses beginning with non- I would suggest infinitive verbs, the word please preceding a verb and phrases it would be good to advise comprised of you or OP (which stands for Original Poster it might be good to you could always and is a popular way of referring to the author of the post on you had better have you considered Reddit) and a modal verb. We excluded sentences that started you'd better why not with auxiliary verbs do and have to avoid counting in ques- your only option is why don’t you tions with syntax similar to imperative expressions. We [link] counted the percentage on number of all words divided by 2, since most of the imperative expressions we looked for were Table 2: A complete list of advice expressions used by the bigrams. The score range was 0 to 1. Advice Score feature. Advice Score was defined as the number of advisory expres- sions in the text. For this purpose, we prepared a list of pos- In contrast to [Deshpande et al., 2010], we added a new fea- sible advisory expressions, including phrases like you need to, ture ASHD, which combined Average Semantic Depth and it might be worth, or if I were you, along with words like rec- Average Semantic Height by deducting the overall ASH ommend or suggest. The full list is given in Table 2. This fea- score from the overall ASD score for each comment. This ture also counted website links, which we discovered to be a was done to reflect the difference between the two features, way of offering advice in many comments in our datasets. In which were found to be important in previous studies. the preprocessing stage we replaced all links with the token Another change compared to [Deshpande et al., 2010] was [link], which then also counted as an advisory expression. that we did not change all content words into nouns before The overall comment Advice Score was divided by 10 to looking them up in WordNet. Nominalization was supposed scale it down to the level of other features. to help with the lookup, as in 2010 the WordNet ontology Specificity Scores that included six different features. They was rather developed for nouns but scarce for any other parts were first proposed by Deshpande et al. [2010] to extract sug- of speech. We felt no need to do this anymore in 2019 be- gestions and complaints from employee surveys and online cause WordNet grew immensely since that time. We only product reviews. The goal was to find sentences containing lemmatized the words. specific content. We adapted these features into our study be- Total Ocurrence Count (TOC) meant the number of times the cause initial analysis showed that comments containing spe- word occurs in the WordNet ontology. We measured it by cific advice were among the best rated on any given post. The obtaining occurrence count from WordNet for each lemma- tized content word and adding up three lowest scores in a sen- calculations were performed for each sentence in the com- tence. Count of Named Entities (CNE) meant the number of ment using sent_list, and the final scores for the entire com- named entities in the sentence, determined with an NLTK ment were obtained by adding up all the sentence scores. The Named Entities tagger6. Count of Proper Nouns (CPN) was features were: Average Semantic Depth (ASD), Average measured by the number of proper nouns (tagged as NNP, Semantic Height (ASH), Total Occurrence Count (TOC), NNPS or CD) in the sentence with stopwords removed. CD Count of Named Entities (CNE), Count of Proper Nouns stands for Cardinal Number and as such is not a proper noun, (CPN) and Sentence Length (LEN). but it was included in calculations provided by [Deshpande We only slightly modified the calculations provided by et al., 2010], so we decided to keep it. Sentence Length [Deshpande et al., 2010]. For both ASD and ASH, we had to (LEN), was the number of words in the sentence with stop- retrieve hypernymy/hyponymy hierarchies from WordNet words removed. ontology5 for each content word (meaning nouns, verbs, ad- We divided ASD, ASH, ASHD, TOC and LEN by 100 and jectives and adverbs). For each word, the longest path in the CNE and CPN by 10 to put the scores in the same numerical hierarchy that led from the word to its highest hypernym de- range as other features. termined the Semantic Depth of that word. Similarly, the Table 3 contains an example taken from the r/getdisciplined shortest path from the word to its lowest hyponym deter- portion of our dataset. Both the original post and its three mined its Semantic Height. To obtain the ASD score for the comments are included, as well as their respective feature entire sentence, we added all the ASDs for all the content scores and general scores given to them by Reddit users. To words and divided the sum by the total number of content conserve space in the table, we removed the new line breaks, words in the sentence (by which we mean a sentence from but otherwise we kept the text intact. sent_list with stopwords removed). We performed the same In addition to our 14 advice features, we used word2vec word calculations for ASHs. embeddings of the texts. Specifically, we obtained a vector 5 6 https://wordnet.princeton.edu https://www.nltk.org/ 37 Post text Post score: 15 I am addicted to sleeping. I think the reason for that is because I cannot tolerate my thoughts and the real world. But after spending years like this, I feel awful for sleeping so much. It’s not like I sleep 15 hours a day but this habit of mine leads to being absent for classes twice a month and skipping half of gym sessions. Above all I don’t bother to improve my life style. With this attitude of mine seeing any kind of future for myself is impossible! Can you give me tips and suggestions how to overcome this bad addiction? If you introduce a reading source, also I would appreciate it a lot. Edit: I don’t sleep 15 hours a Day but I am sure I am addicted to sleeping! Comment text Comment score: 20 (rank 0) If you’re sleeping 15 hours a Day regularly for no apparent reason you need to see a doctor aptitude attention pleasantness sensitivity Relatability score Imperative score Advice score 0.094 -0.119 0.123 -0.035 0.000 0.000 0.100 ASD ASH ASHD TOC CNE CPN LEN 0.157 0.152 0.005 0.000 0.000 0.000 0.090 Comment text Comment score: 7 (rank 1) You associate your sleeping to not tolerating your thoughts. To achieve higher capacity in managing your thoughts, have you heard of Mindfulness work? It's a simple technique with effects showing already after a short while. aptitude attention pleasantness sensitivity Relatability score Imperative score Advice score 0.179 0.245 0.136 -0.013 0.000 0.048 0.000 ASD ASH ASHD TOC CNE CPN LEN 0.460 0.364 0.096 0.000 0.000 0.000 0.180 Comment text Comment score: 3 (rank 2) I can relate a little bit, as I too love sleep and try to avoid being alone with my own thoughts. I still love sleep, but finding podcasts I really like has helped me with the avoiding my thoughts part. Then I can use them as a bribe to myself..."I can only listen to this on my drive to work/walk to class." "I can only listen to this one at the gym." YMMV, but if you can find an addictive one, or one you find genuinely funny/entertaining, the bribery works. And then if you are one of those "I'm fine as long as I GET there" people for class/gym/work, you can look forward to the getting there part. aptitude attention pleasantness sensitivity Relatability score Imperative score Advice score -0.041 0.029 -0.084 0.114 0.100 0.042 0.000 ASD ASH ASHD TOC CNE CPN LEN 1.188 1.106 0.082 0.080 0.000 0.400 0.540 Table 3: A single data example from our dataset. for each word in the text and took the average to represent the Each comment text had 114 features. To construct our input entire text. The word2vec embeddings had 100 dimensions. data, in each comment triple we concatenated the feature vec- We concatenated them with our advice features, ending up tors into one vector of length 342 (=3x114). Before concate- with 114 features in total for each comment text. nation, we shuffled each triple and obtained all 6 permuta- tions of their order. Therefore, each triple was present in the 4 System Architecture dataset 6 times, each time with different order of comments. This was done to lessen the impact of order on the results of The basic unit of our data was a triple of comments coming the network. As output, the network produced a vector of from the same post. Each comment had been rated by users, length 3, where each position gave a number 0, 1 or 2 depend- so by comparing their scores we were able to rank the com- ing on the rank and order of the comments. For example, if ments with numbers 0, 1 and 2, where 0 represented the best the first 114 features of the input vector represented a com- rated comment and 2 represented the lowest rated one. The ment of rank 2, the next 114 features represented a comment ratings were not representative across the entire dataset; for of rank 0 and the last 114 represented a comment of rank 1, example, a comment ranked 0 in its own comment triple may then the expected output was a vector of [2, 0, 1]. have been ranked 2 in a different triple (given they pertained To ensure that each comment text went through the same in- to the same topic). However, this was not an issue, since our itial calculations, we constructed a convolutional neural net- purpose was to select the best comment in the given fixed set work. The first layer had 342 units that matched our input of three. vector. Then, we used a filter of length 114 and stride of 114, which essentially meant that each set of 114 comment fea- 38 Layer Units # Output shape Fold Training Training Test Test Input --- (m, 1, 1, 342) loss accuracy loss accuracy Conv1 114*(1, 114) (m, 114, 1, 3) 1 0.016 0.995 0.037 0.991 Reshape --- (m, 3, 1, 114) 2 0.038 0.989 0.056 0.984 Conv2 3*(1, 3) (m, 3, 1, 38) 3 0.276 0.903 0.423 0.889 Reshape --- (m, 3, 38) 4 0.031 0.992 0.049 0.989 Fc1 20 (m, 3, 20) 5 0.072 0.980 0.081 0.975 Fc2 10 (m, 3, 10) 6 0.060 0.990 0.059 0.987 Fc3 3 (m, 3, 3) 7 0.084 0.947 0.166 0.931 Reshape --- (3m, 3) 8 0.032 0.994 0.043 0.992 Argmax --- (3m, 1) 9 0.020 0.993 0.052 0.989 10 0.036 0.989 0.066 0.986 Table 4: Overview of the network. Conv stands for convolu- Average 0.067 0.977 0.103 0.971 tional layers and Fc stands for fully connected layers. For convolutional layers, the number of units is the number of fil- Table 5: Results of loss and accuracy values across all folds. ters multiplied by filter size used on the layer. of length 3, where the rank was indicated by the position of tures went through the same filter. This ensured that no mat- 1. For example, if the softmax produced output of [0, 1, 0] for ter the order of the comments, each one received equal treat- a comment, that comment got rank 1, and if the output was ment and had equal chances of being assigned any of the three [0, 0, 1] then the comment got rank 2. Therefore, for each ranks. training example the output shape was (3, 3) where the first 3 On top of the first convolutional layer, we had a second one represents the three comments and the second 3 represents followed by three fully connected layers. Table 4 gives an the length of softmax. We then reshaped this output so that overview of the network along with data shape produced by each comment became its own entry (shape of (3m, 3)). At each layer and operation. As we conducted the research using the very end, we used the argmax function to reduce the out- PyTorch, the order of dimensions for convolutional layers put to shape (3m, 1), meaning one rank for each comment in follows the PyTorch convention, which is: depth (= number the dataset. of channels), height, width. The kernel size gives only height and width; depth is exactly the same as the input to the given 5 Experiments and Results layer. We reshaped the output of the first convolutional layer before 5.1 Experiment Setup passing it on to the next layer. The first layer gave output of We trained the network for 1000 epochs using the Adadelta shape (114, 1, 3) for each data entry. Essentially, this was a [Zeiler, 2012] optimization algorithm with no changes to its vector of length 3, where each position represented one com- default hyperparameters (this means that we did not set a ment and had a depth of 114, because each comment had been learning rate manually). We also divided our training data convolved by all 114 kernels. We reshaped the output into (3, into minibatches of 512. These hyperparameters were de- 1, 114) and fed it as input to Conv2. This way, the kernels in cided based on performance. the second convolutional layer operated on a vector of length Overall, we had 22,140 examples in our dataset (total of 114, where each position represented one Conv1 kernel and 3,690 downloaded triples where each triple was present in the had a depth of 3 representing the three comments. Each dataset 6 times). We performed 10-fold cross validation with Conv2 kernel processed three of the 114 positions (with each 19,926 training examples and 2,214 test examples in each position incorporating calculations from all comments), fold. Before training and testing, the features were normal- yielding 38 results from the processing. The point of this re- ized using L2 normalization. shaping operation was to allow the Conv2 kernels to process subsets of Conv1 kernel results with all three comments each 5.2 Results instead of subsets of comments with all 114 Conv1 kernel re- sults each. It was important to include information about all We measured the performance of our system with accuracy. three comments for each Conv2 kernel operation, because our Table 5 presents the results broken down by fold. We also results are dependent on all feature relationships within the looked at our 14 advice features to see whether they corre- comment triple. We then reshaped the Conv2 layer results to lated with the ranks. Although no single feature showed a sig- reduce the number of dimensions from four to three so that nificant linear correlation with ranks (as measured by Pearson we could pass them to a fully connected layer. coefficient), there are small differences in their mean and me- Each layer had the tanh activation function except for the last dian values between ranks. Table 6 shows the comparison of one, which used a softmax. The output shape from the last raw feature values across the three ranks. We included more layer was (m, 3, 3) where m represents batch size. Essentially, decimal points in the table to better reflect the differences. for each training example there were three comments to rank Table 7 presents statistical significance scores of the differ- and each of these comments received its own softmax vector ences between feature values across rank pairs. 39 Rank aptitude attention pleasantness sensitivity Relatability Imperative Advice score score score 0 Mean 0.119163 0.087951 0.087890 0.051194 0.025531 0.069907 0.025881 Median 0.128197 0.082036 0.087306 0.039533 0.009709 0.053333 0.000000 1 Mean 0.130136 0.087693 0.099144 0.051375 0.027642 0.066100 0.027425 Median 0.141360 0.084107 0.107313 0.037422 0.012500 0.048780 0.000000 2 Mean 0.134223 0.082651 0.100406 0.050223 0.028502 0.066240 0.026667 Median 0.139275 0.082451 0.095032 0.037500 0.012085 0.048780 0.000000 Rank ASD ASH ASHD TOC CNE CPN LEN 0 Mean 1.000910 0.943619 0.057292 0.141257 0.000108 0.049593 0.354938 Median 0.622520 0.588571 0.030000 0.000000 0.000000 0.000000 0.200000 1 Mean 1.047904 0.988323 0.059582 0.140745 0.000108 0.052602 0.378932 Median 0.657738 0.620119 0.032348 0.000000 0.000000 0.000000 0.220000 2 Mean 1.004613 0.947719 0.056894 0.159900 0.000054 0.047940 0.359827 Median 0.640000 0.602750 0.031667 0.000000 0.000000 0.000000 0.220000 Table 6: Feature values across different ranks. We bolded highest mean value for each feature. drop in performance is usually caused by a learning rate that 5 Error Analysis and Discussion is too large for the given stage of training. It can be assumed that after Adadelta adjusted the learning rate, the network For a research problem posed in this way, it was important performance was able to rise up again in the last epochs. Per- that each of the three comment texts went through the same haps this is the reason why this particular optimization algo- initial calculations. This could have been achieved by using a rithm worked best in our case: other algorithms like Adam recurrent neural network, where each timestep – in our case, [Kingma and Ba, 2014] or SGD [Robbins and Monro, 1951] comment text – is processed by the same unit (for example would get stuck and be unable to overcome this issue. GRU [Cho et al., 2014] or LSTM [Hochreiter and As is evident from results presented in Table 5, we were able Schmidthuber, 1997]) that has its parameters adjusted during to achieve very high accuracy on our task. Perhaps this was the training process. However, our attempts at using an RNN caused by the relatively big amount of data; at over 22,000 were unsuccessful. Although the algorithm trained well examples the network had more than enough data to learn (training set accuracy was usually above 0.95), these results did not generalize to the test set. Test accuracy was always Feature Ranks 0-1 Ranks 1-2 Ranks 0-2 around 0.33, which in this setting is random chance level. One reason for this may be that RNNs are particularly sensi- Aptitude 0.035 0.422 0.003 tive to the order of the timesteps and even shuffling the com- Attention 0.948 0.209 0.174 ments did not help in alleviating this issue. The network kept Pleasantness 0.028 0.803 0.014 overfitting the training set over the course of many epochs, but then was not able to make correct predictions on the pre- Sensitivity 0.956 0.720 0.761 viously unseen data of the test set. With the RNN, prediction Relatability 0.017 0.354 0.001 for each comment relied heavily on calculations made for the score previous one(s) instead of the network looking at the triple as Imperative 0.042 0.940 0.044 a group and not as a sequence. Using a convolutional network score solved this problem. Furthermore, we made some interesting observations during Advice score 0.234 0.560 0.542 the training. First, the network worked only with very specific ASD 0.113 0.157 0.901 settings, namely with the Adadelta optimization algorithm and the tanh activation function. While searching for the best ASH 0.112 0.162 0.884 optimization method and activation function is routinely per- ASHD 0.228 0.163 0.837 formed to yield the best results for the given network, the dif- TOC 0.970 0.178 0.193 ferences in accuracy between various choices were unusually large in our case. Despite repeated training, the algorithm did CNE 1.000 0.564 0.414 not converge with any other optimization algorithm and the CPN 0.378 0.147 0.579 ReLU activation function, which we initially tried instead of LEN 0.054 0.111 0.682 tanh, caused the network to get stuck in a local minimum at a high error level. Second, around epoch 700-800 error would Table 7: Statistical significance of differences between feature briefly rise and then fall down again to an even lower level. values across rank pairs. We bolded p values of 0.05 or lower. The tendency can be observed in Figure 1. This temporary 40 Fold Training Training Test Test loss accuracy loss accuracy 1 0.263 0.888 0.403 0.866 2 0.347 0.866 0.375 0.840 3 0.150 0.945 0.181 0.929 4 0.353 0.849 0.439 0.828 5 0.184 0.919 0.250 0.898 6 0.190 0.943 0.211 0.932 7 0.212 0.923 0.282 0.901 8 0.207 0.920 0.245 0.903 9 0.376 0.880 0.390 0.849 10 0.248 0.923 0.287 0.899 Average 0.253 0.906 0.306 0.885 Table 8: Results of loss and accuracy values across all folds for the model trained only on word2vec features. Figure 1: Overview of cost progression across 1000 training epochs for all folds. Each fold is marked with a different color. may be a bit troublesome for our algorithm, although the per- centage of such comments in the overall dataset is negligible. Moreover, once again we found no clear characteristics of how to rank the comment texts. However, this task was per- such misranked comments compared to those that were formed on text triples, which means that the results are valid ranked correctly. This was also the case with other comments only in a very specific setting. Ideally, we would like to have that got misranked only once. Such findings suggest that our a network able to rank the quality of advice contained in any algorithm has no defined bias in ranking, but makes mistakes given single text. However, our network specifically takes a randomly, which can be expected with a neural network. triple of texts as input and it was not trained to recognize ob- Other misranked comments were of the [deleted] or [re- jective advice quality, but rather to select the best advice text moved] kind. Comments with this kind of content were either from a given triple regardless of the overall quality level in removed by the moderators of the subreddit or deleted by the that triple. Therefore, right now it cannot be used to judge user themselves. Such comments are usually inappropriate or how good a piece of advice is without any comparison. Con- rude and would not receive a lot of points, which means they structing a network capable of accomplishing this task based would usually rank the lowest in any given triple in our da- on our current findings is a topic for future studies. Likewise, taset. However, it is possible that some such comments con- we assumed that all advice comments were on topic, because tained content that was upvoted by people agreeing with the they were downloaded from their respective threads as re- rude or inappropriate message and at the time of our down- sponses to another user’s post, but this may not always be the loading the data this comment had a high score despite being case with raw data obtained in a different manner. Therefore, already deleted or removed. It is also possible that some user further experiments will involve determining whether the ad- shared their advice in the comment and that advice was good vice is thematically appropriate for the given problem. enough to get a lot of upvotes, but then was deleted from dis- We performed error analysis on all the folds. First, we pre- cussion by the user themselves because they decided it re- pared confusion matrices to see which ranks were most com- vealed too much about them after all. This is an occasional monly confused with each other. We found no clear tendency occurrence not only in the advice subreddits, but also in sub- across all folds. However, we calculated mean numbers of reddits concerning other personal issues, for example mental misranked comments for all confusion matrices and we found health. Whatever the cause, the [deleted] and [removed] out that the numbers were slightly higher for comments that comments were misleading to our algorithm, as the features were misranked as 2 despite the true label being 0 or 1. In were calculated not from the original content of the comment other words, for comments with true label 0 there were more (which was no longer available), but from the single words comments misranked as 2 than those misranked as 1, and deleted and removed respectively. As a result, the features likewise, for comments with true label 1 there were more mis- were not informative enough to perform the ranking correctly. ranked as 2 than as 0. While this result is not conclusive in We did foresee the problems that such comment might pose any way, it shows an interesting quality about our algorithm. when gathering data, but removing any of them from the da- We have also performed an analysis of the content of mis- taset would result in removing the entire triple, which we ranked comments. Since in our setup each triple of comments wanted to avoid. Moreover, the [deleted] and [removed] was present in the dataset six times, there was a possibility of comments were only a small fraction of our misranked com- each comment to be present multiple times in the test set. In ments from the test set. They most likely did not hinder the such cases, comments that were misranked once tended to be training process either, as neural networks are rather robust misranked again on some of their subsequent appearances in against occasional noise in data. the test set as well. This suggests that some single comments It must be noted here that the algorithm performed the final ranking by taking argmax of a 3x3 matrix with one-hot vector 41 rows, so the rankings were not interchangeable, but inde- had the highest Relatability Score. This suggests that while it pendent at this point. In other words, a misranked comment is important to use imperative expressions when giving ad- in the triple did not translate to another comment being mis- vice and to relate to the given problem, too much self-talk ranked by exchanging their mutual rankings. This means that deducts from that advice’s quality. even if a triple contained a misranked comment, other com- Finally, the data from Table 5 sheds a new light on previous ments in that triple could be (and usually were) ranked cor- findings concerning the features. Error analysis of experi- rectly. ments conducted in [Swieczkowska et al., 2018] revealed dif- As can be seen in Tables 6 and 7, the differences in feature ferences in feature values between texts containing advice values between ranks were relatively small. For many fea- and regular ones. For almost all advice features except Aver- tures, like Advice Score or Specificity Scores, those differ- age Semantic Height (which at that point was calculated dif- ences were not statistically significant. This suggests that per- ferently than described here), they had perceptibly higher val- haps the network would not be able to rank advice texts based ues for advice texts than for regular texts. This is why they on these features alone and that it benefitted from the could be used for a classification task mentioned in the Intro- word2vec features as well. duction section. We assumed that similarly, the values for ad- Following up on these findings, we conducted additional ex- vice features would be higher for good quality advice com- periments using solely word2vec features to see how much pared to lower quality advice. However, the difference in fea- impact our 14 advice features had on the algorithm. We tures between rank 0 and rank 2 is small. Interestingly, some slightly adjusted the network architecture to accommodate features are highest for the middle rank 1, for example sensi- the new input shape, which was (m, 1, 1, 300) instead of (m, tivity, Advice Score or ASHD. All these findings suggest that 1, 1, 342). Therefore, the Conv1 layer had 114 filters of shape the relationship between these features and advice quality is (1, 100) instead of (1, 114). After that point, the input/output complicated and not readily visible, which is in line with our dimensions and further layers remained the same as in our findings about the lack of linear correlations between any models for all features. We trained the model with exactly the given feature and advice rank. same hyperparameters and exactly the same number of epochs, which was 1000. The results of these experiments can 7 Conclusions be seen in Table 8. The average test accuracy was only 0.89 as compared to 0.97 from Table 5. This proves that even In this paper we have presented a convolutional neural net- though word2vec features were important in our study, our work able to rank online comments containing advice based 14 advice features also played a significant role in achieving on advice quality, as judged by other online users. While this good accuracy in the experiments. method cannot be used to determine objective quality of a We were also able to identify most important features in our piece of advice, it is useful for selecting the best advice in a study: aptitude, pleasantness, Relatability Score and Impera- given group of texts. This can be useful in creating a motiva- tive Score, since the differences in their values between ranks tional dialogue system, for example by choosing the best ad- were statistically significant. For the sentics, they are associ- vice from three candidate outputs from the system and pre- ated with dichotomies such as ecstasy-grief for pleasantness senting that advice to the user as the final output. We were and admiration-loathing for aptitude. Our findings suggest also able to identify specific measurable qualities of a good that these emotions may be more important in advice texts advice text, such as scoring high on aptitude, pleasantness than others like vigilance-amazement associated with atten- and Imperative Score while maintaining Relatability Score tion or rage-terror associated with sensitivity. A lot of posts on a lower level. in our dataset described the author’s dissatisfaction with their Acknowledgements current life and desire to change and be happier. Because of The authors of this paper would like to thank prof. Hiroyuki this, emotions such as fear, anger or anticipation may have Iizuka of Hokkaido University for his invaluable advice on been less present in the advice comments, while talk about this project. This work was supported by JSPS KAKENHI sadness, trust or joy seemed to be more prominent, especially Grant Number 17K00295. when it comes to motivational advice. The statistical signifi- cance of Imperative Score and Relatability Score is notewor- References thy as well; we designed these features to reflect the fact that best rated advice comments tended to contain a lot of imper- [Badubi, 2017] Reuben M. Badubi. Theories of motivation ative expressions and were authored by people who related to and their application in organizations: A risk analysis. In- the given problem. On the other hand, lack of significant dif- ternational Journal of Innovation and Economic Devel- ferences between ranks in Advice Score is not surprising; all opment, 3(3):43–50, 2017. comments contained some form of advice, so obviously ad- [Callejas and Griol, 2016] Zoraida Callejas and David Griol. vice expressions were present in all of them. It seems that, An affective utility model of user motivation for counsel- rather than the presence of advice expressions, the manner of ling dialogue systems. In International Workshop on Fu- giving advice was more significant, specifically how many ture and Emerging Trends in Language Technology, first-person pronouns and imperative phrases were included pages 86–97, 2016. Springer. in the text. As can be seen in Table 5, best ranked comments had the highest Imperative Score and lowest rated comments 42 [Cambria et al., 2018] Erik Cambria, Soujanya Poria, De- [Kaptein et al., 2012] Maurits Kaptein, Boris De Ruyter, vamanyu Hazarika, and Kenneth Kwok. SenticNet 5: Dis- Panos Markopoulos, and Emile Aarts. Adaptive persua- covering conceptual primitives for sentiment analysis by sive systems: A study of tailored persuasive text messages means of context embeddings. In Proceedings of Thirty- to reduce snacking. ACM Transactions on Interactive In- Second AAAI Conference on Artificial Intelligence, pages telligent Systems (TiiS), 2(2):10, 2012. 1975–1802, 2018. [Kaptein et al., 2009] Maurits Kaptein, Panos Markopoulos, [Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer, Boris de Ruyter, and Emile Aarts. Can you be persuaded? Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Individual differences in susceptibility to persuasion. In Holger Schwenk, and Yoshua Bengio. Learning phrase Proceedings of IFIP Conference on Human-Computer In- representations using RNN encoder-decoder for statistical teraction, pages 115–118, 2009. Springer. machine translation. arXiv preprint arXiv:1406.1078. [Kingma and Ba, 2014] Diederick Kingma and Jimmy Ba. [Deshpande et al., 2010] Shailesh Deshpande, Girish Keshav Adam: A method for stochastic optimization. arXiv pre- Palshikar, and G. Athiappan. An unsupervised approach print arXiv:1412.6980, 2014. to sentence classification. In International Conference on [Litalien et al., 2015] David Litalien, Frederic Guay, and Al- Management of Data COMAD, page 88, 2010. exandre J. Morin. Motivation for PhD studies: Scale de- [Elmelid et al., 2015] Andrea Elmelid, Andrew Stickley, velopment and validation. Learning and Individual Dif- Frank Lindblad, Mary Schwab-Stone, Christopher C Hen- ferences, 41:1–13, 2015. rich, and Vladislav Ruchkin. Depressive symptoms, anx- [Malhotra et al., 2018] Dheeraj Malhotra, Monica Malhotra, iety and academic motivation in youth: Do schools and and O. P. Rishi. An innovative approach of web page families make a difference? Journal of adolescence, ranking using Hadoop and MapReduce-based cloud 45:174–182, 2015. framework. In Proceedings of Big Data Analytics, pages [Fang et al., 2017] Changjian Fang, Dejun Mu, Zhenghong 421–427, 2018. Springer. Deng, and Zhiang Wu. Word-sentence co-ranking for au- [Myangah and Rezai, 2016] Tayebeh Mosavi Myangah and tomatic extractive text summarization. Expert Systems Mohammad Javad Rezai. Persian text ranking using lexi- with Applications, 72:189–195, 2017. cal richness indicators. Glottometrics, 35:6–15, 2016. [Fussner et al., 2018] Lauren M. Fussner, Kathryn J. Mancini, [Robbins and Monro, 1951] Herbert Robbins and Sutton and Aaron M. Luebbe. Depression and approach motiva- Monro. A stochastic approximation method. The annals tion: Differential relations to monetary, social, and food of mathematical statistics, 1951:400–407, 1951. reward. Journal of Psychopathology and Behavioral As- sessment, 40(1):117–129, 2018. [Schneider and Kummert, 2016] Sebastian Schneider and Franz Kummert. Motivational effects of acknowledging [Gerhart and Fang, 2015] Barry Gerhart and Meiyu Fang. Pay, feedback from a socially assistive robot. In Proceedings intrinsic motivation, extrinsic motivation, performance, of International Conference on Social Robotics, pages and creativity in the workplace: Revisiting long-held be- 870–879, 2016. Springer. liefs. Annual Review of Organizational Psychology and Organizational Behavior, 2(1):489–521, 2015. [Sethi and Dixit, 2019] Shilpa Sethi and Ashutosh Dixit. A novel page ranking mechanism based on user browsing [He et al., 2010] Helen Ai He, Saul Greenberg, and Elaine M. patterns. In Proceedings of Software Engineering, pages Huang. One size does not fit all: Applying the transtheo- 37–49, 2019. Springer. retical model to energy feedback technology design. In Proceedings of the SIGCHI Conference on Human Fac- [Süssenbach et al., 2014] Luise Süssenbach, Nina Riether, tors in Computing Systems, pages 927–936, 2010. ACM. Sebastian Schneider, Ingmar Berger, Franz Kummert, Ingo Lütkebohle, and Karola Pitsch. A robot as a fitness [Hershenberg, 2017] Rachel Hershenberg. Activating happi- companion: Towards and interactive action-based moti- ness: A jump-start guide to overcoming low motivation, vation model. In Proceedings of The 23rd IEEE Interna- depression, or just feeling stuck. New Harbinger Publica- tional Symposium on Robot and Human Interactive Com- tions, 2017. munication, pages 286–293, 2014. IEEE. [Hochreiter and Schmidthuber, 1997] Sepp Hochreiter and [Swieczkowska et al., 2017] Patrycja Swieczkowska, Jolanta Jürgen Schmidthuber. Long short-term memory. Neural Bachan, Rafal Rzepka, and Kenji Araki. Asystent – A Computation, 9(8):1735–1780, 1997. prototype of a motivating electronic assistant. In Proceed- [Jabri et al., 2018] Siham Jabri, Azzeddine Dahbi, Taoufiq ings of the Linguistic And Cognitive Approaches To Dia- Gadi, and Abdelhak Bassir. Ranking of text documents log Agents (LaCATODA 2017), pages 11–19, 2017. using tf-idf weighting and association rules mining. In CEUR Workshop Proceedings. Proceedings of 4th International Conference on Optimi- [Swieczkowska et al., 2018] Patrycja Swieczkowska, Rafal zation and Applications (ICOA), pages 1–6, 2018. IEEE. Rzepka, and Kenji Araki. Analyzing motivation tech- niques in emotionally intelligent dialogue systems. In 43 Proceedings of Biologically Inspired Cognitive Architec- tures Meeting, pages 355–360, 2018. Springer, Cham. [Vajjala and Meurers, 2016] Sowmya Vajjala and Detmar Meurers. Readability-based sentence ranking for evaluat- ing text simplification. arXiv preprint arXiv: 1603.06009, 2016. [Zeiler, 2012] Matthew Zeiler. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012. 44