Quote Recommendation for Dialogs and Writings
               Yeonchan Ahn*, Hanbit Lee*, Heesik Jeon**, Seungdo Ha* and Sang-goo Lee*
                     *School of Computer Science and Engineering, Seoul National University, Seoul, Korea
                           **Artificial Intelligence Team, Samsung Electronics Co., Ltd., Seoul, Korea
         *{skcheon, acha21, seungtto, sglee}@europa.snu.ac.kr, **heesik.jeon@samsung.com


ABSTRACT
Citing proverbs and (famous) statements of other people can
provide support, shed new perspective, and/or add humor to one’s
arguments in writings or dialogs. Recommending quote for dialog
or writing can be done by considering the various features of the
current text called context. We present five new approaches to
quote recommendation: 1) methods to adjust the matching
granularity for better context matching, 2) random forest based
approach that utilizes word discrimination, 3) convolutional
neural network based approach that captures important local
semantic features, 4) recurrent neural network based approach that                         Figure 1 An example of quote usage in Twitter thread
reflects the ordering of sentences and words in the context, and 5)                    On investigating our collected datasets, we found that various
rank aggregation of these algorithms for maximum performance.                       features of context, such as keywords, topic, n-grams, latent
We adopt as baseline state-of-the-arts in citation recommendation                   semantics, etc., can be exploited in the recommendation. For
and quote recommendation. Experiments show that our rank                            example, word matching-based algorithm such as ranking with
aggregation method outperforms the best baseline by up to 46.7%.                    cosine similarity between query and context of quote was able to
As candidate quotes, we use famous proverbs and famous                              find the correct quote in Figure 1, since many contexts of the
statement of other person in dialogs and writings. The quotes and                   same quote in training dataset mention the keywords such as
their contexts were extracted from Twitter, Project Gutenberg, and                  casino and luck but the others do not. Also, some of the quotes are
Web blog corpus.                                                                    closely related to specific situations, topic or semantics behind
                                                                                    query not to only keywords.
CCS Concepts
                                                                                       In this paper, we present five new approaches for quote
 • Information systems ~ Recommender systems                            • Natural
                                                                                    recommendation based on observations in our datasets: 1)
language processing.
                                                                                    methods to adjust matching granularity for better context
Keywords                                                                            matching, 2) Random Forest (RF) based approach that utilizes
Quote recommendation; Context matching; Random forest;                              word discrimination, 3) convolutional neural network (CNN)
Convolutional Neural Network; Recurrent Neural Network; Rank                        based approach that captures important local semantic features, 4)
aggregation                                                                         recurrent neural network (RNN) based approach that reflects the
                                                                                    ordering of sentences and words in the context, and 5) rank
1. INTRODUCTION                                                                     aggregation of these algorithms for maximum performance. As
   Citing proverbs and (famous) statements of other people is an                    baseline, we adopt previous works on citation recommendation [1,
important part in conversation and writing. Such quotes or                          3] and quote recommendation [7]. Experiments show that the
quotations can provide support, shed new perspective, and/or add                    proposed approaches significantly outperform baseline methods in
humor to one’s arguments. However, it is not easy for a person to                   real world datasets.
find from a large number of quotes an appropriate one for a given
context since the words in quote are usually metaphorical.                          2. RELATED WORKS
                                                                                       Quote recommendation can be viewed as task of searching or
   Quote recommendation in writing has been introduced in Tan,                      recommending short texts which are appropriate to given current
et al. [7]. Quote recommendation is a task of recommending a                        writing or dialog context. Most related works are citation
ranked list of quotes which are relevant to the current body of text                recommendation for academic articles [1, 3], which recommends
which we call context. We separate context into pre-context and                     relevant reference articles for academic writing. For citation
post-context, which refer to texts that appear before and after a                   recommendation, rich information on paper such as title, abstract,
quote within certain fixed length respectively. For dialogs, unlike                 full text and venue can be exploited. In contrast, in quote
for writings, we only use pre-contexts because post-contexts are                    recommendation such rich information is not available. This
usually unavailable for on-the-fly recommendation of quotes                         makes quote recommendation more challenging. Tan, et al. [7]
during a conversation in real world applications. We define query                   present a method for recommending quote for the first time. They
as a context for which the user desires a list of recommended                       apply learning-to-rank approach with several features which is
quotes. Figure 1 shows an example of quote usage in our Twitter                     quote-context, context-context (or context-query), and quote
dataset. In this example, the block of text that appears before the                 feature. In their experiments, they show that the algorithm heavily
quote ‘Strike while the iron is hot’ is the pre-context.                            depends on context-context feature. However, we argue that
                                                                                    enough exploration on the context-context features is not
 CBRecSys 2016, September 16, 2016, Boston, MA, USA.
 Copyright remains with the authors and/or original copyright holders
conducted. For this, we focus on how to mine the semantics on             its resilience to over-fitting and tendency to exhibit low variance
contexts of quote for recommending quote.                                 and bias. RF constructs          decision trees by training each tree
                                                                          with samples of random subset of features. The method is able to
3. APPROCHES                                                              populate each decision tree with the most discriminating quotes at
  In this section, we describe four approaches and our rank               each state and aggregate the results by voting. In the case of our
aggregation method which combines the four approaches for                 dataset, we view contexts as ‘documents’ and use bag-of-words
quote recommendation.                                                     TF-IDF as features for each context. Then, we train the Random
                                                                          Forest classifier using the vectors of TF-IDFs and their correct
3.1 Matching Granularity Adjustment                                       labels i.e. quote. To the best of our knowledge, this is the first
   In this section we discuss methods to deal with the contexts of        time RF classification has been used for quote recommendation.
quotes when measuring relevance between query and a set of
contexts of quotes, which we call matching granularity adjustment.        3.3 Convolutional Neural Network
As usage of words or words themselves in the quote are different             Word matching-based methods such as context-aware relevance
from that in context, the state-of-the-arts in quote/citation             model [1] and citation translation model [3] have difficulty in
recommendation [1, 3, 7] measures the relevance between query             exploiting n-gram features because of sparsity problem, so they
and contexts of a quote. More specifically, all of them attempt to        only use unigram-based features. But n-gram features are
examine individual context of a quote to the query. A drawback of         important because there are many phrases which are meaningful
this approach is that it suffers from sparsity problem that words in      only when the terms in phrase stay together. For example, a
query do not match the individual context of the correct quote. In        phrasal verb give up loses the meaning when it is tokenized into
order to alleviate this sparsity problem, we propose methods to           give and up. Unlike matching-based methods, CNN based
adjust the matching unit of contexts to the given query. We               approach can exploit important n-gram features in the context by
believe that more semantics can be exploited if the contexts of a         learning the parameters of fixed size filters for each n-grams.
quote are treated collectively.                                           Generally, CNN is composed of several pairs of convolution layer
   Firstly, we propose a method called context clustering, which          and max-pooling layer which capture the local patterns from the
group the context by context cluster which represent (latent) topic.      training example and down-sample extracted features in order to
In the collected dataset, we observed that there exist a number of        prevent overfitting. When CNN is applied to natural language
quotes that can be used in different topic. For example, the quote        sentences, it captures the significant local semantics, i.e., n-gram.
‘All work and no play makes jack a dull boy’ can be used in very             We adopted a single-layer CNN, mainly inspired by [4] which
different situations such as ‘overworking in workplace’ or                reports that simple CNN model shows similar performance to
‘educating children’. Thus when dealing with query about                  complex one with several convolutions-pooling layers in order to
specific topic, we need to consider the contexts related to it among      capture distinguished n-gram features in contexts of quotes. Our
different topics of quote. In the context clustering, we first clusters   CNN model takes a context in the form of a list of word
contexts of each quote. And we exploit the context clusters to            embedding vectors of the words in the context. Then the input
measure the relevance of a quote. For context clustering, we adopt        matrix, a list of context vectors, are fed to the layer which is
affinity propagation clustering algorithm, which is known to              composed of single convolution layer and max-pooling layer.
perform better than others in short text clustering [6]. Based on         After that the output vector is fed to fully connected softmax layer
context clustering, we propose a scoring function given query :           in order to compute the probability of candidate quotes and rank
                                                                          the quotes. We use filter size of 3 and 500 hidden nodes in the
                           ,     max         ,     ))
                                                                          hidden layer. We also exploit dropout method to prevent
where       is concatenated text in th context cluster of quote t         overfitting.
and         is cosine similarity with their TF-IDF vector
representation.                                                           3.4 Recurrent Neural Network
                                                                             We use RNN to tackle our quote recommendation problem in
   In order to solve the sparsity problem, we present another             perspective of language modeling, which means that we treat each
method called context lumping to adjust the matching granularity.         quote as a special token or word and compute the probability of it
In context lumping, we simply concatenate all the context of each         given context. While none of above approaches uses order
quote and make it a matching unit to the query. Then the lumped           information of words in the context, RNN based approach can
context of quote is compared to query with cosine similarity with         model such sequence of words recursively. We use long short-
TF-IDF vector representation. In both of context clustering and           term memory unit (LSTM) [2] which is a recurrent neural network
lumping, quotes are sorted by the proposed similarities in                consists of three gates (forget, input, output) those control the
descending order respectively.                                            networks to learn long-term dependencies without loss of
                                                                          information. The input vector of each time step passes through the
3.2 Random Forest                                                         three gates and updates latent vectors which LSTM is retaining. In
   In the collected dataset, we observe that some simple rules such       our model we recurrently feed LSTM with a sequence of words in
as checking whether the given context contains certain words are          the context in the form of list of word embedding vectors. We use
reliable cursors to its correct label. For example, in Twitter dataset,   pre-trained word embedding for mapping each word to word
given that a context contains the keywords invite, join, come over        vector. The output vector of LSTM layer is passed to fully
or any of the morphemes, there is 40.2% probability that the              connected layer and softmax layer in order to compute the
context is labeled with the proverb ‘the more the merrier’. From          probability of target quotes to be recommended. We also use 500
this observation, we explore the possibility of adopting tree based       dimension hidden vector in LSTM and also use dropout method.
classification algorithm into the quote recommendation task.
   Among various decision tree algorithms, RF [5] is an ensemble          3.5 Rank Aggregation
learning method that had notable success in various fields due to            We observed that previously proposed algorithms show
                                                                          different recommendation results according to queries (we will
discuss this in the experiment section). This suggests that instead       Table 1 number of contexts for each quote in datasets
of relying on the single best ranking algorithm, it is better to
                                                                          Datasets       Avg         Std dev       Max          Min
aggregate rank values of all of the single algorithms to produce
                                                                       Twitter           556          971         10764         15
accurate and robust ranking, called rank aggregation (RA).
                                                                       Gutenberg          89          122          1366         14
   We propose two methods which can aggregate the individual
                                                                       Blog              230          543          5923         24
ranking results of previously proposed algorithms. Traditional
rank aggregation method Borda [8] assigns a score to candidate
quote inversely proportional to its position in a ranked list of          Table 1 shows the number of context for each quote in each
individual algorithm, and the scores of each quotes are added up       datasets, which describes average, maximum, and minimum
to the final score. We observed that Borda cannot handle the case      number of context for each quote and standard deviation of them.
where one or two inaccurate rank of individual algorithms lowers       From Table 1, we see that the most frequently appeared quotes
accuracy of final aggregated rank. In order to cope with this issue,   from each corpus cover large range of quotes of varying
we propose a rank aggregation method called Rank Multiplication        frequencies, helping us deal with the situation recommending
(RM) to multiply the ranks of each quotes submitted by individual      quotes by using small number of contexts as well as large number
algorithm. By using this method, we can get the effect that            of contexts. We divide dataset to the proportion of 8:1:1, as
maintaining case that all of the individual ranker rank consistently   training set, validation set, and test set. We create test sets by
high, it can give less weight to result of inaccurate ranking          hiding the quotes which the contexts are paired with.
algorithm. Thus final score by using RM can be defined as
follows:                                                               4.2 Evaluation Metric
                                                      1                  We consider a plausible application that recommends only a
                                      ,
                                              ∏| |        ,            few number of quotes. In such application since the position of
                                                                       correct quote is not important, we use Recall@k as our evaluation
where     , is position in ranked list of th individual ranking        metric.
algorithm given query and candidate quote . And is a set of
                                                                       Recall@k: Since there is only one correct or hidden quote for
each algorithm. The quotes are ordered by this score in
                                                                       each query in the original test set, Recall@k is the number of
descending order.
                                                                       cases that the gold quote is recommended in the top-k result
  We assume that high ranks of individual ranking algorithms are       divided by number of total test cases. We set k as five.
more dependable than lower ranks. From this assumption we
propose second rank aggregation method called top-k Rank               4.3 Baselines and Parameter Settings
multiplication (top-k RM) that multiplies only k rank values of a         We compare our approaches with three state-of-the-art
quote from each of k single algorithms for a query. Thus the final     approaches in quote or citation recommendation domain.
score of the top-k RM is defined as follows:                           Learning-to-recommend quote (LRQ) [7] is an algorithm for
                                                          1            recommending quote for wring. Context-aware relevance model
                                          ,
                                _
                                              ∏| ∈|           ,        (CRM) [1], citation translation model (CTM) [3] are algorithms
                                                                       for recommending citation for scientific paper. Also popularity-
where        is a set of k algorithms that yield the k highest rank    based method (Popularity), and cosine similarity-based method
positions given query q and quote t.                                   (Cosine similarity) is adopted as baselines. The methods,
                                                                       Popularity and Cosine similarity methods are used in order to
4. EXPERIMENTS                                                         reveal the different levels of difficulties of the datasets. These
                                                                       methods are described in detail below.
4.1 Data Construction
                                                                       LRQ exploits an existing learning-to-rank framework for quote
   We have collected 439,655 quotes from three sources:                recommendation with quote-based features, quote-query similarity
Wikiquote1, Oxford Concise Dictionary of Proverbs2, and Library        features, and context-query similarity features.
of Quotes3. For the context data, we searched blocks of texts that
                                                                       CRM recommends quotes according to average of the squared
contain these quotes from three different sets of corpus: 2 million
                                                                       cosine similarities between contexts of each quote and the query.
tweet threads from Twitter (~2015.11.15), 20GB of electronic
book from the Project Gutenberg Database 4 , and 190GB of              CTM recommends quotes according the probability that the query
ICWSM spinn3r 2009 blog dataset5. In the tweet corpus, in order        context would be translated into the quote.
to extract dialogs only, we selected threads where only two users      Popularity ranks the quotes according to their frequency in
are involved. Next, we chose the top 400 quote set from each           contexts of training set.
corpus according to the number of contexts, in order to reflect the
characteristics of the quotes that appeared frequently in different    Cosine similarity ranks the quote by examining individual
corpus. Finally, we generate three datasets: Twitter dataset,          context of the quote with the given query using bag-of-words
Gutenberg dataset, and Blog dataset.                                   representation.

                                                                          We implement these methods and set the parameters to
1
    https://en.wikiquote.org/
                                                                       optimum as specified in the respective papers of the methods.
2
                                                                       Specifically, we truncate each half-context (pre-context or post-
    Oxford University Press, 1998
                                                                       context) of length longer than 150 characters for LRQ, 50 words
3
    http://www.libraryofquotes.com/                                    for CRM and one sentence for CTM respectively as the respective
4
    http://www.gutenberg.org/                                          authors suggested in the papers. For our approaches, we set length
5
    http://icwsm.cs.umbc.edu/data/icwsm2009/                           of half-context to its optimal value which shows best result in
validation dataset: 1) 150 characters of pre-context and post-                          Table 3 number of correct cases of single algorithms in
context with word truncation for context clustering and context                                            Twitter dataset
lumping, 2) 50 words for RF, and 3) 30 words of pre-context for                                     # correct case   #case exclusively correct
                                                                                       Approaches                                                (B) / (A)
CNN and RNN. As stated in the introduction, we used pre-context                                          (A)                   (B)
and post-context as query for Gutenberg and Blog dataset and pre-                 Context lumping   6,031(0.286)              1,122               0.186
context as query for Twitter dataset. Hyper parameters of single                  RF                 5,186(0.244)              493                0.095
algorithms are set by using validation set. For rank aggregation,                 CNN                8,213(0.390)              935                0.114
we used the proposed five algorithms (context clustering, context
                                                                                  RNN                8,191(0.389)              1023               0.125
lumping, RF, CNN and RNN) and, top-k RM showed best results
when k=3.

4.4 Results and Discussions                                                         In conclusion, although some of our single algorithms such as
   Results of experiments are listed in Table 2. Recall@5 and the                 context clustering or RF do not outperform the baselines, there are
improvement ratio of each algorithm over the best baseline in                     cases where each single algorithm is able to exclusively answer
each dataset are denoted. The individual algorithms (context                      correctly, which we believe we were able to exploit in our
lumping and CNN), even without rank aggregation, outperform                       proposed rank aggregation method.
baselines in all of the datasets. Surprisingly, the simple method
context lumping is the best performer in Gutenberg and Blog
                                                                                  5. CONCLUSIONS
                                                                                    In this paper, we tackled quote recommendation by exploring
dataset, which beats LRQ up to 35%. Context clustering
                                                                                  four single recommendation approaches considering different
outperforms CRM and Cosine similarity which does not treat the
                                                                                  aspects of the context. And we presented new rank aggregation
context of quote collectively. These better results of context
                                                                                  methods for maximizing performance. Over our datasets, we
lumping and context clustering show the effectiveness of
                                                                                  showed that the proposed algorithm (top-k RM RA) outperforms
adjusting context matching granularity. One can observe that
                                                                                  the best baseline by up to 46.7%. In the future, we plan to extend
performance of the baseline Cosine similarity in Twitter dataset is
                                                                                  our research to recommend common phrase which has wider
worse than ones in Gutenberg and Blog dataset. This means that
                                                                                  applications in the real world.
sparsity problem is more serious in Twitter where the tweet
contains more infrequent words than others. In Twitter dataset,                   6. REFERENCES
deep learning algorithms (CNN and RNN) outperform CTM by
                                                                                  [1] He, Q., Pei, J., Kifer, D., Mitra P., and Giles, C. L., 2010.
up to 43%. From this result, we can see that deep learning
                                                                                      Context-aware citation recommendation. In Proceedings of
algorithms are able to mitigate such serious sparsity problem
                                                                                      the 19th international conference on World wide web
because it is not based on word matching. Results of RF show that
                                                                                      (Raleigh, NC, USA, April 26 - 30). WWW '10. ACM, New
it is competitive to CTM algorithm. In fact, in our preliminary
                                                                                      York, NY, USA, 421-430.
experiments on top 100 Twitter dataset, RF outperforms CNN.
                                                                                      DOI=http://dx.doi.org/10.1145/1772690.1772734
However, in large dataset, generalization of the algorithm is not
made as expected; an area for future investigation.                               [2] Hochreiter, S. and Schmidhuber, J. 1997. Long Short-Term
                                                                                      Memory. Neural Computation (Nov. 1997). 1735-1780
  Although some of our single algorithm outperform others in
specific datasets, there is no single algorithm that outperforms all              [3] Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C. L.,
the others. Also even in a dataset, there exists a portion of queries                 and Rokach, L. 2012. Recommending citations: translating
where each of single recommendation algorithms is exclusively                         papers into references. In Proceedings of the 21st ACM
correct. See Table 3. These justify our motivation of adopting                        international conference on Information and knowledge
rank aggregation, and as expected, improvement attained through                       management (Maui, HI, USA, October 29 - November 02,
rank aggregations (RM RA and top-k RM RA) are better than the                         2012). CIKM '12. ACM, New York, NY, USA, 1910-
best baseline algorithm on average 44.0% and 46.7% respectively.                      1914.DOI=http://dx.doi.org/10.1145/2396761.2398542
                                                                                  [4] Kim, Y. 2014, Convolutional Neural Networks for Sentence
        Table 2 Results of Recall@5 of different methods.
                                                                                      Classification, In Proceedings of the 2014 Conference on
       Context source                                                                 Empirical Methods in Natural Language Processing (Doha,
                                Twitter        Gutenberg            Blog
Approaches                                                                            Qatar, October 25-29, 2014). EMNLP '14. ACL
Context clustering          0.190 (-30%)      0.299 (- 1 %)    0.494 ( 0 %)       [5] Liaw, A. and Wiener, M. 2002. Classification and regression
Context lumping             0.286* (+ 5%)     0.409* (+35%)    0.521* (+ 5 %)         by randomForest. R News (2002). 2(3):18–22
RF                          0.244 (- 11%)     0.246 (-19 %)    0.470 (- 5 %)      [6] Rangrej, A., Kulkarni, S., and Tendulkar, A. V. 2011.
CNN                         0.390* (+43%)     0.326* (+ 8 %)   0.506 (+ 2 %)          Comparative study of clustering techniques for short text
RNN                         0.389* (+42%)     0.294 (- 3 %)    0.473 (- 4 %)          documents. In Proceedings of the 20th international
RM RA                       0.424* (+55%)     0.445* (+47%)    0.640* (+30 %)         conference companion on World wide web (Hyderabad, India,
top-k RM RA (k=3)           0.436* (+60%)     0.451* (+49%)    0.648* (+31 %)         March 28 - April 01, 2011). WWW '11. ACM, New York,
                                                                                      NY, USA, 111-112.
LRQ                         0.196             0.302            0.494
                                                                                      DOI=http://dx.doi.org/10.1145/1963192.1963249
CRM                         0.119             0.237            0.382
                                                                                  [7] Tan, J., Wan, X. and Xiao, J. 2015. Learning to recommend
CTM                         0.273             0.257            0.441
                                                                                      quotes for writing. In Proceedings of the Twenty-Ninth AAAI
Popularity                      0.156             0.111               0.223           Conference on Artificial Intelligence (Austin Texas, January
Cosine similarity               0.196             0.248               0.469           25–30, 2015). AAAI'15. AAAI Press, USA, 2453-2459
(* indicates that each of our algorithms outperform the best baseline algorithm
with statistically significant increase at p < 0.01 in two-tailed t-tests)        [8] Young, H. P. 1974. An axiomatization of Borda's rule. J.
                                                                                      Econ. Theory 9, 1, 43-52