Learning-to-Rank in research paper CBF recommendation: Leveraging irrelevant papers Anas Alzoghbi Victor A. Arrascue Ayala Peter M. Fischer Georg Lausen Department of Computer Science, University of Freiburg, Germany {alzoghba, arrascue, peter.fischer, lausen}@informatik.uni-freiburg.de ABSTRACT of works adopted different techniques to tackle this prob- Suggesting relevant literature to researchers has become an lem. A recent extensive survey in this domain [3] identified active area of study, typically relying on content-based fil- content-based filtering (CBF) as the predominant approach tering (CBF) over the rich textual features available. Given for research paper recommendation because of the rich tex- the high dimensionality and the sparsity of the training tual features available. For learning user profile, almost ex- samples inherent to this domain, the focus has so far been clusively the focus was on relevance feedback approaches, on heuristic-based methods. In this paper, we argue for building on the assumption that papers appearing in user’s the model-based approach and propose a learning-to-rank preference list have an equal (or a presumed extent) share method that leverages publicly available publications’ meta- in the underlying user taste. Thus, user profiles are con- data to produce an effective prediction model. The proposed structed as aggregation of relevant papers’ keywords. Based method is systematically evaluated on a scholarly paper rec- on the classification suggested by Adomavicius et al. in [1], ommendation dataset and compared against state-of-the-art these approaches are referred to as heuristic-based. In con- model-based approaches as well as current, domain-specific trast, model-based approaches depend on a learning method heuristic methods. The results show that our approach to fit the underlying user model (profile). This enables con- clearly outperforms state-of-the-art research paper recom- structing a better modeling of researcher-keywords relation mendations utilizing only publicly available meta-data. in user profiles. But they require a large body of training data which is not intuitively available in this domain. As a result, little work on applying model-based approaches ex- CCS Concepts ists for this problem. •Information systems → Learning to rank; Recom- In this paper, we employ pairwise learning-to-rank [4] as a mender systems; model-based technique for learning user profile. We incorpo- rate both relevant and irrelevant “peer” papers -papers pub- Keywords lished in relevant papers’ conferences- to formulate pairwise preferences and enrich the training set. Our main contribu- Research paper recommendation; Learning-to-Rank; Content- tions include: based Recommendation; Model-based user profile • We investigate and customize learning-to-rank for CBF research paper recommendation. 1. INTRODUCTION • We incorporate only a small set of data, restricted to Scholars and researchers are confronted with an overwhelm- publicly available metadata of papers. This makes our ing number of newly published research papers in their do- approach suitable for a much larger domain than pre- main of expertise. Although advantageous in restricting the vious approaches which require papers’ full-text. domain, keyword-based search tools typically available in • We perform an initial, yet systematic study on a real- digital libraries offer a limited help to researchers in locat- world datatset in which we show that our approach ing the relevant content. As a result, researchers need to clearly outperforms existing heuristic- and model-based manually search within unspecific search results to identify algorithms. paper(s) of interest. This is the situation where recom- The rest of this paper is organized as following: the second mendaer systems have great potential, and indeed plenty section provides an overview of existing related work. In sec- tion 3 we present our approach and in section 4 we demon- strate experimental setup and results. Finally, we conclude in section 5 by summarizing our findings and situate this work within our future plan. 2. RELATED WORK A rich amount of related work tackled the problem of re- CBRecSys 2016, September 16, 2016, Boston, MA, USA. search paper recommendation. collaborative filtering (CF) Copyright c 2016 remains with the authors and/or original copyright holders. approaches [8, 13, 14] showed a successful application of model-based methods incorporating knowledge from other we start from the following hypothesis: when users identify “similar” users. However, we restrict our search to content- relevant papers, they, to some extent, implicitly rate other based scenarios considering only information from the active papers published at the same conference (we call them peer user. In this domain, the main focus in learning user profile papers) as irrelevant1 . Based on this hypothesis, we utilize has been on heuristic-based approaches with a wide adop- peer papers as irrelevant papers as follows: for each user, we tion of relevance feedback and cosine similarity [3]. Papers build pairs of preferences out of relevant and peer papers. are recommended which are most similar to one or more of Such pairs are called pairwise preferences or for simplicity previously published or liked papers. In [10], De Nart et pairs, we will use these terms interchangeably along the pa- al. used extracted terms (keyphrases) from user’s liked pa- per. Afterward, we feed these pairs as training examples to pers in constructing user profile. The profile has a graph a learning algorithm in order to fit the user’s model. This representation, and the focus here was on the keyphrases model is used later to rank candidate papers and recommend extraction method and the graph structure. The approach top ranked ones to the user. Before delving deeper in the of Lee et al. [6] proposed a memory based CBF, where users’ method details, we first introduce some notation. The func- r papers are clustered based on their similarity, and candidate tion peer(.) is defined over the interest set Pint of a user r. r papers are ranked based on the distance from user’s clusters. It delivers for a paper p ∈ Pint the set of p’s peer papers. In Sugiyama et al. in [11, 12] applied a relevance feedback ap- practice, this can be retrieved via digital libraries like DBLP proach utilizing all terms from the fulltext of the researcher’s registry2 . For the paper modeling, we adopt a vector space publications in addition to terms from the citing and the ref- model representation. Having the domain related keywords erenced papers in order to build profiles. All of these works extracted from paper’s title, abstract and keyword list as are heuristic-based, where weights in user profile are set by features, each paper p is a vector: p = hsp,v1 , ..., sp,v|V | i, aggregating individual keywords’ scores of relevant papers. with vi ∈ V is a domain-related vocabulary and sp,vi is a On the contrary, model-based approaches depend on ma- score reflecting the importance of vi in p. We adopt the chine learning techniques to learn user affinity towards key- TF-IDF score as the weighting scheme. Based on this rep- words, promising a more representative user profile. In a resentation, the similarity between two papers is calculated previous work [2], we showed the superiority of a model- by the cosine similarity between the papers’ vectors. based method over relevance feedback methods for CBF re- search paper recommendations. We applied multivariate lin- 3.1 Method Steps ear regression to learn researchers’ profiles from their previ- An overview of the proposed approach is depicted in Fig- ous publications. Yet, the work was tailored to researchers ure 1. For the experimental setup only, we split user’s r r r r with previous publications and didn’t consider irrelevant pa- interest set Pint into training and test sets Ptrain , Ptest re- pers. In [9], Minkov et al. presented a collaborative rank- spectively. However, this step is dropped out in the non- ing approach for events recommendation. They compared it experimental recommendation scenario and the first step re- r with a content-based baseline that applies pairwise learning- ceives, in this case, the complete interest set Pint . to-rank on pairs of relevant and irrelevant events. In our User’s work, we follow similar approach in applying learning-to- Interest list 1 2 3 rank on pairs of relevant an irrelevant papers. However, we Peer papers Forming pairwise Preferences Test (20%) training (80%) push it further and investigate the quality of these pairs and augmenting preferences validation their effect on the model performance. 4 Model learning 3. PROPOSED APPROACH This work targets users who have previously interacted 1 5 with scientific papers and identified some as papers of inter- Peer papers Ranking & User’s Model augmenting Evaluation (profile) est (relevant papers). Having a set of relevant papers for a user, the recommendation process can start and a machine learning method is applied to fit a user profile (model). The Figure 1: Overview the proposed approach steps learned model is used to rank a set of candidate papers and recommend the top ranked papers to the user. Our ap- 1. Peer papers augmenting: in this step, the peer pa- proach is to employ the pairwise learning-to-rank technique pers are retrieved for all relevant papers. Retrieved in building the user profile. We chose this method because peer papers serve as potential negative classes and are of its desirable properties: It was proven to be successful in important for empowering the learning algorithm to solving ranking tasks in similar problem domains like online construct a better understanding of user’s taste. advertising [7]. It also shows a good performance on prob- 2. Forming pairwise preferences: here we apply the con- lems with sparse data. The main idea of pairwise learning- cept of pairwise learning from learning-to-rank. The to-rank is to build pairs of preferences out of the training set. training set in this step is reformulated as a set of Each pair consists of a positive and a negative instance. Af- pairs P, where each pair consists of two components: terwards, the pairs are fed as training instances to a learning a relevant paper and an irrelevant paper. That is, each algorithm, which in turn learns the desirable model. In the r relevant paper p ∈ Ptrain is paired with all papers from underlying problem, papers marked as interesting by users peer(p): are the positive instances. However, the negative instances or the irrelevant papers are usually not explicitly provided P = {(p, p0 )|∀p ∈ Ptrain r ∧ ∀p0 ∈ peer(p)} by the users. This makes pairwise learning-to-rank not di- 1 Later, we introduce a validation process that checks the rectly applicable on this setup. In our contribution, we seek correctness of this hypothesis for each pair. 2 implicit information about the irrelevant papers. For this, http://dblp.uni-trier.de A pair (p, p0 ) ∈ P depicts a preference in user’s taste Pruning Based Validation (PBV). Here we filter out and implies that p has a higher relevance to user r than invalid pairwise preferences. Validity is judged based on the p0 . dissimilarity between the pair’s components. If they prove 3. Preferences validation: In the first step, we introduced to be similar, then we don’t consider p0 as an irrelevant pa- the peer papers as negative classes based on the hy- per and consequently, the pair (p, p0 ) is not eligible for model pothesis mentioned earlier in this section. Yet, this learning. A similarity threshold τ is applied and a pair (p, p0 ) can’t be adopted as a ground truth due to: (a) it is pruned if similarity(p, p0 ) > τ . In our experiments, we is not explicitly affirmed by users that they are not empirically test a range of values for τ and discuss the cor- interested in peer papers; and (b) some peer papers responding effect on the model. might be of interest to the user but might have been overlooked. Having this in mind, not all pairwise pref- 4. EXPERIMENTS erences formulated in the previous step have the same level of correctness. Therefore, this step examines pair- 4.1 Dataset & Setup wise preferences and makes sure to pass valid ones to We evaluated the proposed approach on the Scholarly model learning. We propose two different mechanisms publication recommendation dataset from [12], including the to accomplish this validation: pruning based valida- extensions applied in our previous work [2]: Papers are iden- tion and weighting based validation. We explain these tified and enriched with meta-data from the DBLP register, techniques in the next section. namely titles, abstracts, keywords and the publishing con- 4. Model learning: In this step, we apply a pairwise learning- ference. The dataset contains 69,762 candidate papers, as to-rank method (Ranking SVM [5]) to train a user well as the lists of relevant papers for 48 researchers. The model ŵr . Using validated pairwise preference from number of relevant papers ranges from 8 to 208 with an av- the previous step, we seek ŵr that minimizes the ob- erage of 71 papers. After augmenting peer papers, we got jective function: a skewed distribution as the ratio of relevant papers to peer 1 paper ranges from 0.45% to 3% with an average of 1.2%. We ŵr = arg max ||wr ||2 + C.L(wr ) wr 2 performed offline experiments with 5-folds cross validation following the steps outlined in Figure 1. For each researcher With C ≥ 0 is a penalty parameter and L(wr ) is the we randomly split the interest list into training and test sets; pairwise hinge loss function: then, we learn researchers’ models as described in section 3; X finally, we evaluate the learned models on the test set. The L(wr ) = max(0, 1 − wrT (p − p0 ))2 (*) test set consists of: (a) positive instances, the test relevant (p,p0 )∈P papers (20% of the researchers interest list) and (b) negative 5. Ranking & Evaluation: Given the user’s model as a instances, the peer papers of the positive instances. This ap- result of the previous step, here we apply the predic- plies for all of our experiments, except for experiments on tion on candidate papers. For the experimental setup, the pruning based validation method (PBV). In PBV, we this is the the test set, which is constructed out of rel- filter out those pairs which components have a similarity r higher than τ from the training set. Therefore, we apply evant papers Ptest (the positive instances), in addition to their peer papers as irrelevant papers (the negative the same rule on the test set and we filter out peer pa- instances). pers based on their similarity to the corresponding relevant paper. For example, given a similarity threshold τ and a 3.2 Preferences Validation Methods relevant paper p from the test set, a peer paper p0 ∈ peer(p) As pairwise learning-to-rank expects pairs that show con- is added as an irrelevant paper to the test set if and only if trast between negative and positive classes, pairs with “wrongly similarity(p, p0 ) ≤ τ . assigned” peers pose a potential noise to the learning pro- cess. After all, the validity of a pairwise preference (p, p0 ) depends on the correctness of considering its peer paper p0 ir- 4.2 Metrics relevant. The pair’s relevant paper p forms the ground truth We measured the following metrics to determine the per- and hence, it can be considered as the reference point to de- formance for top k ranking and also overall classification. cide whether p0 is irrelevant or not. For each pair (p, p0 ) ∈ P We show the averages over all researchers for each metric: we measure the similarity between p and p0 , and adopt two Mean Reciprocal Rank (MRR): evaluates the position of the methods to validate the pair based on this similarity: first relevant paper in the ranked result. Weighting Based Validation (WBV). This strategy is Normalized Discounted Cumulative Gain (nDCG): nDCG@k based on giving pairwise preferences different weights based indicates how good the top k results of the ranked list are. on the dissimilarity between the pairs components. This We look at nDCG for k ∈ {5, 10} boosts the importance for pairs with dissimilar components AUC and Recall: used to study the behavior of validation and assures that the more similar the pair’s components are, strategies PBV, WBV and the baseline algorithms: Logistic the less important the pair for model learning is. Therefore, Regression and SVM. we weight the importance of each pair according to the dis- tance (1-similarity) between the relevant paper and the peer 4.3 Results & Discussion paper. Then, we redefine the loss function from (*) to con- In total, we performed three different experiments. The sider pairs’ weights as following: first experiment (with the results shown in Table 1) shows X a superior performance for our weighting based validation L(wr ) = max(0, 1−wrT (1−similarity(p, p0 ))(p−p0 ))2 method (WBV) over the state-of-the-art heuristic-based work (p,p0 )∈P (Sugiyama [12]) and model-based (PubRec [2]) approach. The experiments were performed using the same features 4 5 6 7 8 9 0. 0. 0. 0. 0. 0. and datasets present in these works and show a clear lead 7 0. MRR AUC over all metrics. 6 0. MRR nDCG@5 nDCG@10 0. 5 0. 01 5 0. 1 05 1 5 1 0. 5 0. 01 5 0. 1 05 1 5 1 0 00 0 0. 0. 0 00 0 0. 0. WBV 0.728 0.471 0.391 00 0 0. 00 0 0. τ τ 0. 0. PubRec 0.717 0.445 0.382 7 3 4 5 6 7 8 nDCG@10 0. 0. 0. 0. 0. nDCG@5 0. 0. 0. 0. 0. 0. Sugiyama[12] via [2] 0.577 0.345 0.285 5 6 Table 1: WBV compared to state-of-the-art model- 4 based and heuristic-based approaches 3 0. 5 0. 01 5 0. 1 05 1 5 1 0. 5 0. 01 5 0. 1 05 1 5 1 0 00 0 0. 0. 0 00 0 0. 0. 00 0 0. 00 0 0. The second experiment compares the performance of our τ τ 0. 0. approach over other, baseline classification algorithms like 2 3 4 5 6 7 8 9 1 SVM and logistic regression to provide a more general un- 0. 0. 0. 0. 0. 0. 0. 0. Recall derstanding of its capabilities. As shown in Figure 2, logistic PBV LR regression showed a weak performance on all metrics, par- ticularly on Recall. It didn’t succeed in identifying relevant SVM WBV papers even when it is fed with a balanced training set. How- 0. 5 0. 01 5 0. 1 05 1 5 1 0 00 0 0. 0. ever, SVM showed a better ability to recognize the relevant 00 0 0. τ 0. papers with a better recall value, but produced a lot of false positives and this is clear from its lower MRR and nDCG values. In contrast, all variants of our method showed a Figure 2: WBV and PBV compared with Logistic superior performance in all metrics. Finally, we compare regression and Support Vector Machine between the suggested pair validation techniques WBV and PBV, including tuning the latter by varying the similarity [2] A. Alzoghbi, V. A. A. Ayala, P. M. Fischer, and threshold τ from 1 (where no pairs are filtered, this case rep- G. Lausen. Pubrec: Recommending publications based resents the CBF approach of [9]), down to 4 ∗ 10−4 (where a on publicly available meta-data. In LWA, 2015. lot of “noisy” pairs are pruned from the training set). WBV [3] J. Beel, B. Gipp, S. Langer, and C. Breitinger. showed in general a very good performance, beating PBV Research-paper recommender systems: a literature for higher values of τ on all metrics except recall. There, survey. IJDL, 2015. PBV gives a slightly better recall even without filtering any pairs (when τ = 1). This refers to the fact that weighting [4] L. Hang. A short introduction to learning to rank. the pairs in WBV causes the model to miss some relevant IEICE TRANSACTIONS on Information and papers, while PBV made models more capable of recogniz- Systems, 2011. ing the relevant papers by eliminating the noisy pairs from [5] R. Herbrich, T. Graepel, and K. Obermayer. Large the training set. When decreasing τ , PBV shows very good Margin Rank Boundaries for Ordinal Regression. 2000. scores, but these results need additional investigation before [6] J. Lee, K. Lee, J. G. Kim, and S. Kim. Personalized leading to a clear conclusion. As mentioned earlier in this academic paper recommendation system. In SRS, section, reducing τ also leads to a smaller number of irrele- 2015. vant papers in the test set. This reduces the underlying bias [7] C. Li, Y. Lu, Q. Mei, D. Wang, and S. Pandey. in the test set which has an (additional) positive impact on Click-through prediction for advertising in twitter the metrics, even though there is still a clear bias (the rel- timeline. In KDD, 2015. evant/peer ratio is on average 11.2%) present at the lowest [8] S. M. McNee and et al. On the recommending of τ values. citations for research papers. In CSCW, 2002. [9] E. Minkov, B. Charrow, J. Ledlie, S. Teller, and 5. CONCLUSION T. Jaakkola. Collaborative future event In this paper, we investigated the application of learning- recommendation. CIKM, 2010. to-rank in research paper recommendation. We proposed a [10] D. D. Nart and C. Tasso. A personalized novel approach that leverages irrelevant papers to produce concept-driven recommender system for scientific more accurate user models. Offline experiments showed that libraries. Procedia Computer Science, 2014. our method outperforms state-of-the-art CBF research pa- [11] K. Sugiyama and M.-Y. Kan. Scholarly paper per recommendations utilizing only publicly available meta- recommendation via user’s recent research interests. In data. Our future steps will focus on further understanding JCDL, 2010. the effect of the similarity threshold in pruning based vali- [12] K. Sugiyama and M.-Y. Kan. Exploiting potential dation (PBV) on the model quality and study the suitability citation papers in scholarly paper recommendation. In of pairwise learning-to-rank algorithms other than Ranking JCDL, 2013. SVM for this problem. [13] A. Vellino. A comparison between usage-based and citation-based methods for recommending scholarly 6. REFERENCES research articles. In ASIS&T, 2010. [1] G. Adomavicius, Z. Huang, and A. Tuzhilin. [14] C. Wang and D. M. Blei. Collaborative topic modeling Personalization and Recommender Systems. 2014. for recommending scientific articles. In KDD, 2011.