Quote Recommendation for Dialogs and Writings Yeonchan Ahn*, Hanbit Lee*, Heesik Jeon**, Seungdo Ha* and Sang-goo Lee* *School of Computer Science and Engineering, Seoul National University, Seoul, Korea **Artificial Intelligence Team, Samsung Electronics Co., Ltd., Seoul, Korea *{skcheon, acha21, seungtto, sglee}@europa.snu.ac.kr, **heesik.jeon@samsung.com ABSTRACT Citing proverbs and (famous) statements of other people can provide support, shed new perspective, and/or add humor to one’s arguments in writings or dialogs. Recommending quote for dialog or writing can be done by considering the various features of the current text called context. We present five new approaches to quote recommendation: 1) methods to adjust the matching granularity for better context matching, 2) random forest based approach that utilizes word discrimination, 3) convolutional neural network based approach that captures important local semantic features, 4) recurrent neural network based approach that Figure 1 An example of quote usage in Twitter thread reflects the ordering of sentences and words in the context, and 5) On investigating our collected datasets, we found that various rank aggregation of these algorithms for maximum performance. features of context, such as keywords, topic, n-grams, latent We adopt as baseline state-of-the-arts in citation recommendation semantics, etc., can be exploited in the recommendation. For and quote recommendation. Experiments show that our rank example, word matching-based algorithm such as ranking with aggregation method outperforms the best baseline by up to 46.7%. cosine similarity between query and context of quote was able to As candidate quotes, we use famous proverbs and famous find the correct quote in Figure 1, since many contexts of the statement of other person in dialogs and writings. The quotes and same quote in training dataset mention the keywords such as their contexts were extracted from Twitter, Project Gutenberg, and casino and luck but the others do not. Also, some of the quotes are Web blog corpus. closely related to specific situations, topic or semantics behind query not to only keywords. CCS Concepts In this paper, we present five new approaches for quote • Information systems ~ Recommender systems • Natural recommendation based on observations in our datasets: 1) language processing. methods to adjust matching granularity for better context Keywords matching, 2) Random Forest (RF) based approach that utilizes Quote recommendation; Context matching; Random forest; word discrimination, 3) convolutional neural network (CNN) Convolutional Neural Network; Recurrent Neural Network; Rank based approach that captures important local semantic features, 4) aggregation recurrent neural network (RNN) based approach that reflects the ordering of sentences and words in the context, and 5) rank 1. INTRODUCTION aggregation of these algorithms for maximum performance. As Citing proverbs and (famous) statements of other people is an baseline, we adopt previous works on citation recommendation [1, important part in conversation and writing. Such quotes or 3] and quote recommendation [7]. Experiments show that the quotations can provide support, shed new perspective, and/or add proposed approaches significantly outperform baseline methods in humor to one’s arguments. However, it is not easy for a person to real world datasets. find from a large number of quotes an appropriate one for a given context since the words in quote are usually metaphorical. 2. RELATED WORKS Quote recommendation can be viewed as task of searching or Quote recommendation in writing has been introduced in Tan, recommending short texts which are appropriate to given current et al. [7]. Quote recommendation is a task of recommending a writing or dialog context. Most related works are citation ranked list of quotes which are relevant to the current body of text recommendation for academic articles [1, 3], which recommends which we call context. We separate context into pre-context and relevant reference articles for academic writing. For citation post-context, which refer to texts that appear before and after a recommendation, rich information on paper such as title, abstract, quote within certain fixed length respectively. For dialogs, unlike full text and venue can be exploited. In contrast, in quote for writings, we only use pre-contexts because post-contexts are recommendation such rich information is not available. This usually unavailable for on-the-fly recommendation of quotes makes quote recommendation more challenging. Tan, et al. [7] during a conversation in real world applications. We define query present a method for recommending quote for the first time. They as a context for which the user desires a list of recommended apply learning-to-rank approach with several features which is quotes. Figure 1 shows an example of quote usage in our Twitter quote-context, context-context (or context-query), and quote dataset. In this example, the block of text that appears before the feature. In their experiments, they show that the algorithm heavily quote ‘Strike while the iron is hot’ is the pre-context. depends on context-context feature. However, we argue that enough exploration on the context-context features is not CBRecSys 2016, September 16, 2016, Boston, MA, USA. Copyright remains with the authors and/or original copyright holders conducted. For this, we focus on how to mine the semantics on its resilience to over-fitting and tendency to exhibit low variance contexts of quote for recommending quote. and bias. RF constructs decision trees by training each tree with samples of random subset of features. The method is able to 3. APPROCHES populate each decision tree with the most discriminating quotes at In this section, we describe four approaches and our rank each state and aggregate the results by voting. In the case of our aggregation method which combines the four approaches for dataset, we view contexts as ‘documents’ and use bag-of-words quote recommendation. TF-IDF as features for each context. Then, we train the Random Forest classifier using the vectors of TF-IDFs and their correct 3.1 Matching Granularity Adjustment labels i.e. quote. To the best of our knowledge, this is the first In this section we discuss methods to deal with the contexts of time RF classification has been used for quote recommendation. quotes when measuring relevance between query and a set of contexts of quotes, which we call matching granularity adjustment. 3.3 Convolutional Neural Network As usage of words or words themselves in the quote are different Word matching-based methods such as context-aware relevance from that in context, the state-of-the-arts in quote/citation model [1] and citation translation model [3] have difficulty in recommendation [1, 3, 7] measures the relevance between query exploiting n-gram features because of sparsity problem, so they and contexts of a quote. More specifically, all of them attempt to only use unigram-based features. But n-gram features are examine individual context of a quote to the query. A drawback of important because there are many phrases which are meaningful this approach is that it suffers from sparsity problem that words in only when the terms in phrase stay together. For example, a query do not match the individual context of the correct quote. In phrasal verb give up loses the meaning when it is tokenized into order to alleviate this sparsity problem, we propose methods to give and up. Unlike matching-based methods, CNN based adjust the matching unit of contexts to the given query. We approach can exploit important n-gram features in the context by believe that more semantics can be exploited if the contexts of a learning the parameters of fixed size filters for each n-grams. quote are treated collectively. Generally, CNN is composed of several pairs of convolution layer Firstly, we propose a method called context clustering, which and max-pooling layer which capture the local patterns from the group the context by context cluster which represent (latent) topic. training example and down-sample extracted features in order to In the collected dataset, we observed that there exist a number of prevent overfitting. When CNN is applied to natural language quotes that can be used in different topic. For example, the quote sentences, it captures the significant local semantics, i.e., n-gram. ‘All work and no play makes jack a dull boy’ can be used in very We adopted a single-layer CNN, mainly inspired by [4] which different situations such as ‘overworking in workplace’ or reports that simple CNN model shows similar performance to ‘educating children’. Thus when dealing with query about complex one with several convolutions-pooling layers in order to specific topic, we need to consider the contexts related to it among capture distinguished n-gram features in contexts of quotes. Our different topics of quote. In the context clustering, we first clusters CNN model takes a context in the form of a list of word contexts of each quote. And we exploit the context clusters to embedding vectors of the words in the context. Then the input measure the relevance of a quote. For context clustering, we adopt matrix, a list of context vectors, are fed to the layer which is affinity propagation clustering algorithm, which is known to composed of single convolution layer and max-pooling layer. perform better than others in short text clustering [6]. Based on After that the output vector is fed to fully connected softmax layer context clustering, we propose a scoring function given query : in order to compute the probability of candidate quotes and rank the quotes. We use filter size of 3 and 500 hidden nodes in the , max , )) hidden layer. We also exploit dropout method to prevent where is concatenated text in th context cluster of quote t overfitting. and is cosine similarity with their TF-IDF vector representation. 3.4 Recurrent Neural Network We use RNN to tackle our quote recommendation problem in In order to solve the sparsity problem, we present another perspective of language modeling, which means that we treat each method called context lumping to adjust the matching granularity. quote as a special token or word and compute the probability of it In context lumping, we simply concatenate all the context of each given context. While none of above approaches uses order quote and make it a matching unit to the query. Then the lumped information of words in the context, RNN based approach can context of quote is compared to query with cosine similarity with model such sequence of words recursively. We use long short- TF-IDF vector representation. In both of context clustering and term memory unit (LSTM) [2] which is a recurrent neural network lumping, quotes are sorted by the proposed similarities in consists of three gates (forget, input, output) those control the descending order respectively. networks to learn long-term dependencies without loss of information. The input vector of each time step passes through the 3.2 Random Forest three gates and updates latent vectors which LSTM is retaining. In In the collected dataset, we observe that some simple rules such our model we recurrently feed LSTM with a sequence of words in as checking whether the given context contains certain words are the context in the form of list of word embedding vectors. We use reliable cursors to its correct label. For example, in Twitter dataset, pre-trained word embedding for mapping each word to word given that a context contains the keywords invite, join, come over vector. The output vector of LSTM layer is passed to fully or any of the morphemes, there is 40.2% probability that the connected layer and softmax layer in order to compute the context is labeled with the proverb ‘the more the merrier’. From probability of target quotes to be recommended. We also use 500 this observation, we explore the possibility of adopting tree based dimension hidden vector in LSTM and also use dropout method. classification algorithm into the quote recommendation task. Among various decision tree algorithms, RF [5] is an ensemble 3.5 Rank Aggregation learning method that had notable success in various fields due to We observed that previously proposed algorithms show different recommendation results according to queries (we will discuss this in the experiment section). This suggests that instead Table 1 number of contexts for each quote in datasets of relying on the single best ranking algorithm, it is better to Datasets Avg Std dev Max Min aggregate rank values of all of the single algorithms to produce Twitter 556 971 10764 15 accurate and robust ranking, called rank aggregation (RA). Gutenberg 89 122 1366 14 We propose two methods which can aggregate the individual Blog 230 543 5923 24 ranking results of previously proposed algorithms. Traditional rank aggregation method Borda [8] assigns a score to candidate quote inversely proportional to its position in a ranked list of Table 1 shows the number of context for each quote in each individual algorithm, and the scores of each quotes are added up datasets, which describes average, maximum, and minimum to the final score. We observed that Borda cannot handle the case number of context for each quote and standard deviation of them. where one or two inaccurate rank of individual algorithms lowers From Table 1, we see that the most frequently appeared quotes accuracy of final aggregated rank. In order to cope with this issue, from each corpus cover large range of quotes of varying we propose a rank aggregation method called Rank Multiplication frequencies, helping us deal with the situation recommending (RM) to multiply the ranks of each quotes submitted by individual quotes by using small number of contexts as well as large number algorithm. By using this method, we can get the effect that of contexts. We divide dataset to the proportion of 8:1:1, as maintaining case that all of the individual ranker rank consistently training set, validation set, and test set. We create test sets by high, it can give less weight to result of inaccurate ranking hiding the quotes which the contexts are paired with. algorithm. Thus final score by using RM can be defined as follows: 4.2 Evaluation Metric 1 We consider a plausible application that recommends only a , ∏| | , few number of quotes. In such application since the position of correct quote is not important, we use Recall@k as our evaluation where , is position in ranked list of th individual ranking metric. algorithm given query and candidate quote . And is a set of Recall@k: Since there is only one correct or hidden quote for each algorithm. The quotes are ordered by this score in each query in the original test set, Recall@k is the number of descending order. cases that the gold quote is recommended in the top-k result We assume that high ranks of individual ranking algorithms are divided by number of total test cases. We set k as five. more dependable than lower ranks. From this assumption we propose second rank aggregation method called top-k Rank 4.3 Baselines and Parameter Settings multiplication (top-k RM) that multiplies only k rank values of a We compare our approaches with three state-of-the-art quote from each of k single algorithms for a query. Thus the final approaches in quote or citation recommendation domain. score of the top-k RM is defined as follows: Learning-to-recommend quote (LRQ) [7] is an algorithm for 1 recommending quote for wring. Context-aware relevance model , _ ∏| ∈| , (CRM) [1], citation translation model (CTM) [3] are algorithms for recommending citation for scientific paper. Also popularity- where is a set of k algorithms that yield the k highest rank based method (Popularity), and cosine similarity-based method positions given query q and quote t. (Cosine similarity) is adopted as baselines. The methods, Popularity and Cosine similarity methods are used in order to 4. EXPERIMENTS reveal the different levels of difficulties of the datasets. These methods are described in detail below. 4.1 Data Construction LRQ exploits an existing learning-to-rank framework for quote We have collected 439,655 quotes from three sources: recommendation with quote-based features, quote-query similarity Wikiquote1, Oxford Concise Dictionary of Proverbs2, and Library features, and context-query similarity features. of Quotes3. For the context data, we searched blocks of texts that CRM recommends quotes according to average of the squared contain these quotes from three different sets of corpus: 2 million cosine similarities between contexts of each quote and the query. tweet threads from Twitter (~2015.11.15), 20GB of electronic book from the Project Gutenberg Database 4 , and 190GB of CTM recommends quotes according the probability that the query ICWSM spinn3r 2009 blog dataset5. In the tweet corpus, in order context would be translated into the quote. to extract dialogs only, we selected threads where only two users Popularity ranks the quotes according to their frequency in are involved. Next, we chose the top 400 quote set from each contexts of training set. corpus according to the number of contexts, in order to reflect the characteristics of the quotes that appeared frequently in different Cosine similarity ranks the quote by examining individual corpus. Finally, we generate three datasets: Twitter dataset, context of the quote with the given query using bag-of-words Gutenberg dataset, and Blog dataset. representation. We implement these methods and set the parameters to 1 https://en.wikiquote.org/ optimum as specified in the respective papers of the methods. 2 Specifically, we truncate each half-context (pre-context or post- Oxford University Press, 1998 context) of length longer than 150 characters for LRQ, 50 words 3 http://www.libraryofquotes.com/ for CRM and one sentence for CTM respectively as the respective 4 http://www.gutenberg.org/ authors suggested in the papers. For our approaches, we set length 5 http://icwsm.cs.umbc.edu/data/icwsm2009/ of half-context to its optimal value which shows best result in validation dataset: 1) 150 characters of pre-context and post- Table 3 number of correct cases of single algorithms in context with word truncation for context clustering and context Twitter dataset lumping, 2) 50 words for RF, and 3) 30 words of pre-context for # correct case #case exclusively correct Approaches (B) / (A) CNN and RNN. As stated in the introduction, we used pre-context (A) (B) and post-context as query for Gutenberg and Blog dataset and pre- Context lumping 6,031(0.286) 1,122 0.186 context as query for Twitter dataset. Hyper parameters of single RF 5,186(0.244) 493 0.095 algorithms are set by using validation set. For rank aggregation, CNN 8,213(0.390) 935 0.114 we used the proposed five algorithms (context clustering, context RNN 8,191(0.389) 1023 0.125 lumping, RF, CNN and RNN) and, top-k RM showed best results when k=3. 4.4 Results and Discussions In conclusion, although some of our single algorithms such as Results of experiments are listed in Table 2. Recall@5 and the context clustering or RF do not outperform the baselines, there are improvement ratio of each algorithm over the best baseline in cases where each single algorithm is able to exclusively answer each dataset are denoted. The individual algorithms (context correctly, which we believe we were able to exploit in our lumping and CNN), even without rank aggregation, outperform proposed rank aggregation method. baselines in all of the datasets. Surprisingly, the simple method context lumping is the best performer in Gutenberg and Blog 5. CONCLUSIONS In this paper, we tackled quote recommendation by exploring dataset, which beats LRQ up to 35%. Context clustering four single recommendation approaches considering different outperforms CRM and Cosine similarity which does not treat the aspects of the context. And we presented new rank aggregation context of quote collectively. These better results of context methods for maximizing performance. Over our datasets, we lumping and context clustering show the effectiveness of showed that the proposed algorithm (top-k RM RA) outperforms adjusting context matching granularity. One can observe that the best baseline by up to 46.7%. In the future, we plan to extend performance of the baseline Cosine similarity in Twitter dataset is our research to recommend common phrase which has wider worse than ones in Gutenberg and Blog dataset. This means that applications in the real world. sparsity problem is more serious in Twitter where the tweet contains more infrequent words than others. In Twitter dataset, 6. REFERENCES deep learning algorithms (CNN and RNN) outperform CTM by [1] He, Q., Pei, J., Kifer, D., Mitra P., and Giles, C. L., 2010. up to 43%. From this result, we can see that deep learning Context-aware citation recommendation. In Proceedings of algorithms are able to mitigate such serious sparsity problem the 19th international conference on World wide web because it is not based on word matching. Results of RF show that (Raleigh, NC, USA, April 26 - 30). WWW '10. ACM, New it is competitive to CTM algorithm. In fact, in our preliminary York, NY, USA, 421-430. experiments on top 100 Twitter dataset, RF outperforms CNN. DOI=http://dx.doi.org/10.1145/1772690.1772734 However, in large dataset, generalization of the algorithm is not made as expected; an area for future investigation. [2] Hochreiter, S. and Schmidhuber, J. 1997. Long Short-Term Memory. Neural Computation (Nov. 1997). 1735-1780 Although some of our single algorithm outperform others in specific datasets, there is no single algorithm that outperforms all [3] Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C. L., the others. Also even in a dataset, there exists a portion of queries and Rokach, L. 2012. Recommending citations: translating where each of single recommendation algorithms is exclusively papers into references. In Proceedings of the 21st ACM correct. See Table 3. These justify our motivation of adopting international conference on Information and knowledge rank aggregation, and as expected, improvement attained through management (Maui, HI, USA, October 29 - November 02, rank aggregations (RM RA and top-k RM RA) are better than the 2012). CIKM '12. ACM, New York, NY, USA, 1910- best baseline algorithm on average 44.0% and 46.7% respectively. 1914.DOI=http://dx.doi.org/10.1145/2396761.2398542 [4] Kim, Y. 2014, Convolutional Neural Networks for Sentence Table 2 Results of Recall@5 of different methods. Classification, In Proceedings of the 2014 Conference on Context source Empirical Methods in Natural Language Processing (Doha, Twitter Gutenberg Blog Approaches Qatar, October 25-29, 2014). EMNLP '14. ACL Context clustering 0.190 (-30%) 0.299 (- 1 %) 0.494 ( 0 %) [5] Liaw, A. and Wiener, M. 2002. Classification and regression Context lumping 0.286* (+ 5%) 0.409* (+35%) 0.521* (+ 5 %) by randomForest. R News (2002). 2(3):18–22 RF 0.244 (- 11%) 0.246 (-19 %) 0.470 (- 5 %) [6] Rangrej, A., Kulkarni, S., and Tendulkar, A. V. 2011. CNN 0.390* (+43%) 0.326* (+ 8 %) 0.506 (+ 2 %) Comparative study of clustering techniques for short text RNN 0.389* (+42%) 0.294 (- 3 %) 0.473 (- 4 %) documents. In Proceedings of the 20th international RM RA 0.424* (+55%) 0.445* (+47%) 0.640* (+30 %) conference companion on World wide web (Hyderabad, India, top-k RM RA (k=3) 0.436* (+60%) 0.451* (+49%) 0.648* (+31 %) March 28 - April 01, 2011). WWW '11. ACM, New York, NY, USA, 111-112. LRQ 0.196 0.302 0.494 DOI=http://dx.doi.org/10.1145/1963192.1963249 CRM 0.119 0.237 0.382 [7] Tan, J., Wan, X. and Xiao, J. 2015. Learning to recommend CTM 0.273 0.257 0.441 quotes for writing. In Proceedings of the Twenty-Ninth AAAI Popularity 0.156 0.111 0.223 Conference on Artificial Intelligence (Austin Texas, January Cosine similarity 0.196 0.248 0.469 25–30, 2015). AAAI'15. AAAI Press, USA, 2453-2459 (* indicates that each of our algorithms outperform the best baseline algorithm with statistically significant increase at p < 0.01 in two-tailed t-tests) [8] Young, H. P. 1974. An axiomatization of Borda's rule. J. Econ. Theory 9, 1, 43-52