=Paper=
{{Paper
|id=Vol-1673/paper7
|storemode=property
|title=Quote Recommendation for Dialogs and Writings
|pdfUrl=https://ceur-ws.org/Vol-1673/paper7.pdf
|volume=Vol-1673
|authors=Yeonchan Ahn,Hanbit Lee,Heesik Jeon,Seungdo Ha,Sang-Goo Lee
|dblpUrl=https://dblp.org/rec/conf/recsys/AhnLJHL16
}}
==Quote Recommendation for Dialogs and Writings==
Quote Recommendation for Dialogs and Writings
Yeonchan Ahn*, Hanbit Lee*, Heesik Jeon**, Seungdo Ha* and Sang-goo Lee*
*School of Computer Science and Engineering, Seoul National University, Seoul, Korea
**Artificial Intelligence Team, Samsung Electronics Co., Ltd., Seoul, Korea
*{skcheon, acha21, seungtto, sglee}@europa.snu.ac.kr, **heesik.jeon@samsung.com
ABSTRACT
Citing proverbs and (famous) statements of other people can
provide support, shed new perspective, and/or add humor to one’s
arguments in writings or dialogs. Recommending quote for dialog
or writing can be done by considering the various features of the
current text called context. We present five new approaches to
quote recommendation: 1) methods to adjust the matching
granularity for better context matching, 2) random forest based
approach that utilizes word discrimination, 3) convolutional
neural network based approach that captures important local
semantic features, 4) recurrent neural network based approach that Figure 1 An example of quote usage in Twitter thread
reflects the ordering of sentences and words in the context, and 5) On investigating our collected datasets, we found that various
rank aggregation of these algorithms for maximum performance. features of context, such as keywords, topic, n-grams, latent
We adopt as baseline state-of-the-arts in citation recommendation semantics, etc., can be exploited in the recommendation. For
and quote recommendation. Experiments show that our rank example, word matching-based algorithm such as ranking with
aggregation method outperforms the best baseline by up to 46.7%. cosine similarity between query and context of quote was able to
As candidate quotes, we use famous proverbs and famous find the correct quote in Figure 1, since many contexts of the
statement of other person in dialogs and writings. The quotes and same quote in training dataset mention the keywords such as
their contexts were extracted from Twitter, Project Gutenberg, and casino and luck but the others do not. Also, some of the quotes are
Web blog corpus. closely related to specific situations, topic or semantics behind
query not to only keywords.
CCS Concepts
In this paper, we present five new approaches for quote
• Information systems ~ Recommender systems • Natural
recommendation based on observations in our datasets: 1)
language processing.
methods to adjust matching granularity for better context
Keywords matching, 2) Random Forest (RF) based approach that utilizes
Quote recommendation; Context matching; Random forest; word discrimination, 3) convolutional neural network (CNN)
Convolutional Neural Network; Recurrent Neural Network; Rank based approach that captures important local semantic features, 4)
aggregation recurrent neural network (RNN) based approach that reflects the
ordering of sentences and words in the context, and 5) rank
1. INTRODUCTION aggregation of these algorithms for maximum performance. As
Citing proverbs and (famous) statements of other people is an baseline, we adopt previous works on citation recommendation [1,
important part in conversation and writing. Such quotes or 3] and quote recommendation [7]. Experiments show that the
quotations can provide support, shed new perspective, and/or add proposed approaches significantly outperform baseline methods in
humor to one’s arguments. However, it is not easy for a person to real world datasets.
find from a large number of quotes an appropriate one for a given
context since the words in quote are usually metaphorical. 2. RELATED WORKS
Quote recommendation can be viewed as task of searching or
Quote recommendation in writing has been introduced in Tan, recommending short texts which are appropriate to given current
et al. [7]. Quote recommendation is a task of recommending a writing or dialog context. Most related works are citation
ranked list of quotes which are relevant to the current body of text recommendation for academic articles [1, 3], which recommends
which we call context. We separate context into pre-context and relevant reference articles for academic writing. For citation
post-context, which refer to texts that appear before and after a recommendation, rich information on paper such as title, abstract,
quote within certain fixed length respectively. For dialogs, unlike full text and venue can be exploited. In contrast, in quote
for writings, we only use pre-contexts because post-contexts are recommendation such rich information is not available. This
usually unavailable for on-the-fly recommendation of quotes makes quote recommendation more challenging. Tan, et al. [7]
during a conversation in real world applications. We define query present a method for recommending quote for the first time. They
as a context for which the user desires a list of recommended apply learning-to-rank approach with several features which is
quotes. Figure 1 shows an example of quote usage in our Twitter quote-context, context-context (or context-query), and quote
dataset. In this example, the block of text that appears before the feature. In their experiments, they show that the algorithm heavily
quote ‘Strike while the iron is hot’ is the pre-context. depends on context-context feature. However, we argue that
enough exploration on the context-context features is not
CBRecSys 2016, September 16, 2016, Boston, MA, USA.
Copyright remains with the authors and/or original copyright holders
conducted. For this, we focus on how to mine the semantics on its resilience to over-fitting and tendency to exhibit low variance
contexts of quote for recommending quote. and bias. RF constructs decision trees by training each tree
with samples of random subset of features. The method is able to
3. APPROCHES populate each decision tree with the most discriminating quotes at
In this section, we describe four approaches and our rank each state and aggregate the results by voting. In the case of our
aggregation method which combines the four approaches for dataset, we view contexts as ‘documents’ and use bag-of-words
quote recommendation. TF-IDF as features for each context. Then, we train the Random
Forest classifier using the vectors of TF-IDFs and their correct
3.1 Matching Granularity Adjustment labels i.e. quote. To the best of our knowledge, this is the first
In this section we discuss methods to deal with the contexts of time RF classification has been used for quote recommendation.
quotes when measuring relevance between query and a set of
contexts of quotes, which we call matching granularity adjustment. 3.3 Convolutional Neural Network
As usage of words or words themselves in the quote are different Word matching-based methods such as context-aware relevance
from that in context, the state-of-the-arts in quote/citation model [1] and citation translation model [3] have difficulty in
recommendation [1, 3, 7] measures the relevance between query exploiting n-gram features because of sparsity problem, so they
and contexts of a quote. More specifically, all of them attempt to only use unigram-based features. But n-gram features are
examine individual context of a quote to the query. A drawback of important because there are many phrases which are meaningful
this approach is that it suffers from sparsity problem that words in only when the terms in phrase stay together. For example, a
query do not match the individual context of the correct quote. In phrasal verb give up loses the meaning when it is tokenized into
order to alleviate this sparsity problem, we propose methods to give and up. Unlike matching-based methods, CNN based
adjust the matching unit of contexts to the given query. We approach can exploit important n-gram features in the context by
believe that more semantics can be exploited if the contexts of a learning the parameters of fixed size filters for each n-grams.
quote are treated collectively. Generally, CNN is composed of several pairs of convolution layer
Firstly, we propose a method called context clustering, which and max-pooling layer which capture the local patterns from the
group the context by context cluster which represent (latent) topic. training example and down-sample extracted features in order to
In the collected dataset, we observed that there exist a number of prevent overfitting. When CNN is applied to natural language
quotes that can be used in different topic. For example, the quote sentences, it captures the significant local semantics, i.e., n-gram.
‘All work and no play makes jack a dull boy’ can be used in very We adopted a single-layer CNN, mainly inspired by [4] which
different situations such as ‘overworking in workplace’ or reports that simple CNN model shows similar performance to
‘educating children’. Thus when dealing with query about complex one with several convolutions-pooling layers in order to
specific topic, we need to consider the contexts related to it among capture distinguished n-gram features in contexts of quotes. Our
different topics of quote. In the context clustering, we first clusters CNN model takes a context in the form of a list of word
contexts of each quote. And we exploit the context clusters to embedding vectors of the words in the context. Then the input
measure the relevance of a quote. For context clustering, we adopt matrix, a list of context vectors, are fed to the layer which is
affinity propagation clustering algorithm, which is known to composed of single convolution layer and max-pooling layer.
perform better than others in short text clustering [6]. Based on After that the output vector is fed to fully connected softmax layer
context clustering, we propose a scoring function given query : in order to compute the probability of candidate quotes and rank
the quotes. We use filter size of 3 and 500 hidden nodes in the
, max , ))
hidden layer. We also exploit dropout method to prevent
where is concatenated text in th context cluster of quote t overfitting.
and is cosine similarity with their TF-IDF vector
representation. 3.4 Recurrent Neural Network
We use RNN to tackle our quote recommendation problem in
In order to solve the sparsity problem, we present another perspective of language modeling, which means that we treat each
method called context lumping to adjust the matching granularity. quote as a special token or word and compute the probability of it
In context lumping, we simply concatenate all the context of each given context. While none of above approaches uses order
quote and make it a matching unit to the query. Then the lumped information of words in the context, RNN based approach can
context of quote is compared to query with cosine similarity with model such sequence of words recursively. We use long short-
TF-IDF vector representation. In both of context clustering and term memory unit (LSTM) [2] which is a recurrent neural network
lumping, quotes are sorted by the proposed similarities in consists of three gates (forget, input, output) those control the
descending order respectively. networks to learn long-term dependencies without loss of
information. The input vector of each time step passes through the
3.2 Random Forest three gates and updates latent vectors which LSTM is retaining. In
In the collected dataset, we observe that some simple rules such our model we recurrently feed LSTM with a sequence of words in
as checking whether the given context contains certain words are the context in the form of list of word embedding vectors. We use
reliable cursors to its correct label. For example, in Twitter dataset, pre-trained word embedding for mapping each word to word
given that a context contains the keywords invite, join, come over vector. The output vector of LSTM layer is passed to fully
or any of the morphemes, there is 40.2% probability that the connected layer and softmax layer in order to compute the
context is labeled with the proverb ‘the more the merrier’. From probability of target quotes to be recommended. We also use 500
this observation, we explore the possibility of adopting tree based dimension hidden vector in LSTM and also use dropout method.
classification algorithm into the quote recommendation task.
Among various decision tree algorithms, RF [5] is an ensemble 3.5 Rank Aggregation
learning method that had notable success in various fields due to We observed that previously proposed algorithms show
different recommendation results according to queries (we will
discuss this in the experiment section). This suggests that instead Table 1 number of contexts for each quote in datasets
of relying on the single best ranking algorithm, it is better to
Datasets Avg Std dev Max Min
aggregate rank values of all of the single algorithms to produce
Twitter 556 971 10764 15
accurate and robust ranking, called rank aggregation (RA).
Gutenberg 89 122 1366 14
We propose two methods which can aggregate the individual
Blog 230 543 5923 24
ranking results of previously proposed algorithms. Traditional
rank aggregation method Borda [8] assigns a score to candidate
quote inversely proportional to its position in a ranked list of Table 1 shows the number of context for each quote in each
individual algorithm, and the scores of each quotes are added up datasets, which describes average, maximum, and minimum
to the final score. We observed that Borda cannot handle the case number of context for each quote and standard deviation of them.
where one or two inaccurate rank of individual algorithms lowers From Table 1, we see that the most frequently appeared quotes
accuracy of final aggregated rank. In order to cope with this issue, from each corpus cover large range of quotes of varying
we propose a rank aggregation method called Rank Multiplication frequencies, helping us deal with the situation recommending
(RM) to multiply the ranks of each quotes submitted by individual quotes by using small number of contexts as well as large number
algorithm. By using this method, we can get the effect that of contexts. We divide dataset to the proportion of 8:1:1, as
maintaining case that all of the individual ranker rank consistently training set, validation set, and test set. We create test sets by
high, it can give less weight to result of inaccurate ranking hiding the quotes which the contexts are paired with.
algorithm. Thus final score by using RM can be defined as
follows: 4.2 Evaluation Metric
1 We consider a plausible application that recommends only a
,
∏| | , few number of quotes. In such application since the position of
correct quote is not important, we use Recall@k as our evaluation
where , is position in ranked list of th individual ranking metric.
algorithm given query and candidate quote . And is a set of
Recall@k: Since there is only one correct or hidden quote for
each algorithm. The quotes are ordered by this score in
each query in the original test set, Recall@k is the number of
descending order.
cases that the gold quote is recommended in the top-k result
We assume that high ranks of individual ranking algorithms are divided by number of total test cases. We set k as five.
more dependable than lower ranks. From this assumption we
propose second rank aggregation method called top-k Rank 4.3 Baselines and Parameter Settings
multiplication (top-k RM) that multiplies only k rank values of a We compare our approaches with three state-of-the-art
quote from each of k single algorithms for a query. Thus the final approaches in quote or citation recommendation domain.
score of the top-k RM is defined as follows: Learning-to-recommend quote (LRQ) [7] is an algorithm for
1 recommending quote for wring. Context-aware relevance model
,
_
∏| ∈| , (CRM) [1], citation translation model (CTM) [3] are algorithms
for recommending citation for scientific paper. Also popularity-
where is a set of k algorithms that yield the k highest rank based method (Popularity), and cosine similarity-based method
positions given query q and quote t. (Cosine similarity) is adopted as baselines. The methods,
Popularity and Cosine similarity methods are used in order to
4. EXPERIMENTS reveal the different levels of difficulties of the datasets. These
methods are described in detail below.
4.1 Data Construction
LRQ exploits an existing learning-to-rank framework for quote
We have collected 439,655 quotes from three sources: recommendation with quote-based features, quote-query similarity
Wikiquote1, Oxford Concise Dictionary of Proverbs2, and Library features, and context-query similarity features.
of Quotes3. For the context data, we searched blocks of texts that
CRM recommends quotes according to average of the squared
contain these quotes from three different sets of corpus: 2 million
cosine similarities between contexts of each quote and the query.
tweet threads from Twitter (~2015.11.15), 20GB of electronic
book from the Project Gutenberg Database 4 , and 190GB of CTM recommends quotes according the probability that the query
ICWSM spinn3r 2009 blog dataset5. In the tweet corpus, in order context would be translated into the quote.
to extract dialogs only, we selected threads where only two users Popularity ranks the quotes according to their frequency in
are involved. Next, we chose the top 400 quote set from each contexts of training set.
corpus according to the number of contexts, in order to reflect the
characteristics of the quotes that appeared frequently in different Cosine similarity ranks the quote by examining individual
corpus. Finally, we generate three datasets: Twitter dataset, context of the quote with the given query using bag-of-words
Gutenberg dataset, and Blog dataset. representation.
We implement these methods and set the parameters to
1
https://en.wikiquote.org/
optimum as specified in the respective papers of the methods.
2
Specifically, we truncate each half-context (pre-context or post-
Oxford University Press, 1998
context) of length longer than 150 characters for LRQ, 50 words
3
http://www.libraryofquotes.com/ for CRM and one sentence for CTM respectively as the respective
4
http://www.gutenberg.org/ authors suggested in the papers. For our approaches, we set length
5
http://icwsm.cs.umbc.edu/data/icwsm2009/ of half-context to its optimal value which shows best result in
validation dataset: 1) 150 characters of pre-context and post- Table 3 number of correct cases of single algorithms in
context with word truncation for context clustering and context Twitter dataset
lumping, 2) 50 words for RF, and 3) 30 words of pre-context for # correct case #case exclusively correct
Approaches (B) / (A)
CNN and RNN. As stated in the introduction, we used pre-context (A) (B)
and post-context as query for Gutenberg and Blog dataset and pre- Context lumping 6,031(0.286) 1,122 0.186
context as query for Twitter dataset. Hyper parameters of single RF 5,186(0.244) 493 0.095
algorithms are set by using validation set. For rank aggregation, CNN 8,213(0.390) 935 0.114
we used the proposed five algorithms (context clustering, context
RNN 8,191(0.389) 1023 0.125
lumping, RF, CNN and RNN) and, top-k RM showed best results
when k=3.
4.4 Results and Discussions In conclusion, although some of our single algorithms such as
Results of experiments are listed in Table 2. Recall@5 and the context clustering or RF do not outperform the baselines, there are
improvement ratio of each algorithm over the best baseline in cases where each single algorithm is able to exclusively answer
each dataset are denoted. The individual algorithms (context correctly, which we believe we were able to exploit in our
lumping and CNN), even without rank aggregation, outperform proposed rank aggregation method.
baselines in all of the datasets. Surprisingly, the simple method
context lumping is the best performer in Gutenberg and Blog
5. CONCLUSIONS
In this paper, we tackled quote recommendation by exploring
dataset, which beats LRQ up to 35%. Context clustering
four single recommendation approaches considering different
outperforms CRM and Cosine similarity which does not treat the
aspects of the context. And we presented new rank aggregation
context of quote collectively. These better results of context
methods for maximizing performance. Over our datasets, we
lumping and context clustering show the effectiveness of
showed that the proposed algorithm (top-k RM RA) outperforms
adjusting context matching granularity. One can observe that
the best baseline by up to 46.7%. In the future, we plan to extend
performance of the baseline Cosine similarity in Twitter dataset is
our research to recommend common phrase which has wider
worse than ones in Gutenberg and Blog dataset. This means that
applications in the real world.
sparsity problem is more serious in Twitter where the tweet
contains more infrequent words than others. In Twitter dataset, 6. REFERENCES
deep learning algorithms (CNN and RNN) outperform CTM by
[1] He, Q., Pei, J., Kifer, D., Mitra P., and Giles, C. L., 2010.
up to 43%. From this result, we can see that deep learning
Context-aware citation recommendation. In Proceedings of
algorithms are able to mitigate such serious sparsity problem
the 19th international conference on World wide web
because it is not based on word matching. Results of RF show that
(Raleigh, NC, USA, April 26 - 30). WWW '10. ACM, New
it is competitive to CTM algorithm. In fact, in our preliminary
York, NY, USA, 421-430.
experiments on top 100 Twitter dataset, RF outperforms CNN.
DOI=http://dx.doi.org/10.1145/1772690.1772734
However, in large dataset, generalization of the algorithm is not
made as expected; an area for future investigation. [2] Hochreiter, S. and Schmidhuber, J. 1997. Long Short-Term
Memory. Neural Computation (Nov. 1997). 1735-1780
Although some of our single algorithm outperform others in
specific datasets, there is no single algorithm that outperforms all [3] Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C. L.,
the others. Also even in a dataset, there exists a portion of queries and Rokach, L. 2012. Recommending citations: translating
where each of single recommendation algorithms is exclusively papers into references. In Proceedings of the 21st ACM
correct. See Table 3. These justify our motivation of adopting international conference on Information and knowledge
rank aggregation, and as expected, improvement attained through management (Maui, HI, USA, October 29 - November 02,
rank aggregations (RM RA and top-k RM RA) are better than the 2012). CIKM '12. ACM, New York, NY, USA, 1910-
best baseline algorithm on average 44.0% and 46.7% respectively. 1914.DOI=http://dx.doi.org/10.1145/2396761.2398542
[4] Kim, Y. 2014, Convolutional Neural Networks for Sentence
Table 2 Results of Recall@5 of different methods.
Classification, In Proceedings of the 2014 Conference on
Context source Empirical Methods in Natural Language Processing (Doha,
Twitter Gutenberg Blog
Approaches Qatar, October 25-29, 2014). EMNLP '14. ACL
Context clustering 0.190 (-30%) 0.299 (- 1 %) 0.494 ( 0 %) [5] Liaw, A. and Wiener, M. 2002. Classification and regression
Context lumping 0.286* (+ 5%) 0.409* (+35%) 0.521* (+ 5 %) by randomForest. R News (2002). 2(3):18–22
RF 0.244 (- 11%) 0.246 (-19 %) 0.470 (- 5 %) [6] Rangrej, A., Kulkarni, S., and Tendulkar, A. V. 2011.
CNN 0.390* (+43%) 0.326* (+ 8 %) 0.506 (+ 2 %) Comparative study of clustering techniques for short text
RNN 0.389* (+42%) 0.294 (- 3 %) 0.473 (- 4 %) documents. In Proceedings of the 20th international
RM RA 0.424* (+55%) 0.445* (+47%) 0.640* (+30 %) conference companion on World wide web (Hyderabad, India,
top-k RM RA (k=3) 0.436* (+60%) 0.451* (+49%) 0.648* (+31 %) March 28 - April 01, 2011). WWW '11. ACM, New York,
NY, USA, 111-112.
LRQ 0.196 0.302 0.494
DOI=http://dx.doi.org/10.1145/1963192.1963249
CRM 0.119 0.237 0.382
[7] Tan, J., Wan, X. and Xiao, J. 2015. Learning to recommend
CTM 0.273 0.257 0.441
quotes for writing. In Proceedings of the Twenty-Ninth AAAI
Popularity 0.156 0.111 0.223 Conference on Artificial Intelligence (Austin Texas, January
Cosine similarity 0.196 0.248 0.469 25–30, 2015). AAAI'15. AAAI Press, USA, 2453-2459
(* indicates that each of our algorithms outperform the best baseline algorithm
with statistically significant increase at p < 0.01 in two-tailed t-tests) [8] Young, H. P. 1974. An axiomatization of Borda's rule. J.
Econ. Theory 9, 1, 43-52