<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning to Rank Research Articles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Kershaw</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Pettit</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maya Hristakeva</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kris Jack</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>75</fpage>
      <lpage>88</lpage>
      <abstract>
        <p>Online academic repositories help millions of researchers discover relevant articles, a domain in which there are many potential signals of relevance, including text, citation links, and how recently an article was published. In this paper we present a case study of prodictionizing learning to rank for large scale recommendation, which utilises these diverse feature sets to increase user engagement. We first introduce itemto-item collaborative filtering (CF), then how these recommendations are rescored with a LtR model. We then describe offline and online evaluation, which are essential for productionizing any recommender. The online results show that learning to rank significantly increased user engagement with the recommender. Finally we show through post-hoc analysis that the original CF solution tended to promote older articles with lower traffic. However, by learning from subjective user interactions with the recommender system, our relevance model reversed those trends.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The rate of scientific discovery is ever increasing, with new methods, theory and
practice being published each day. This growth poses a challenge for researchers,
who need to stay up to date. Online academic catalogues, such as
ScienceDirect,1 give users access to large amounts of peer reviewed scientific publications.
However, the experience of using such catalogues can be characterised as a
combination of information overload [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and information shortage [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. First, when
encountering a large catalogue of information there is no simple way for the
user to read, comprehend and critique all the documents. Additionally, when
browsing they may not be discovering content that they would deem relevant.
      </p>
      <p>
        The specific use case we aim to address is to help users of ScienceDirect
find additional relevant articles that go beyond explicitly encoded relationships
such as authorship or publication venue. In contrast to the personalised research
article recommendations in platforms such as Mendeley [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and CiteULike [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
we want an approach that works in the absence of profile data or reading history.
      </p>
      <p>In this paper, we describe how we built an initial system using item-based
collaborative filtering (IBCF). Once this system is in production, we collected
data on which recommendations users preferred, and then trained a Learning
to Rank (LtR) model to re-rank the recommendations using a range of article
features and similarity metrics. We focus on the evaluation methodology and
how the recommendations surfaced by LtR differed from the pure collaborative
filtering system. These give insight into what the LtR system achieves that could
not be done with collaborative filtering (CF) alone.</p>
      <p>
        Recommender Systems (RS) have become key tools in a researcher’s
content discovery toolkit, as they not only allow them to more efficiently navigate
large and ever-growing catalogues, but also discover content that they would not
have seen otherwise. Previous work has resulted in a distinction between
methods used for personalised and non-personalised approaches. This can be seen
in the use of implicit feedback and CF for personalised recommendations [
        <xref ref-type="bibr" rid="ref13 ref19">13,
19</xref>
        ], whereas content-based methods are used predominately for non-personalised
recommendations [
        <xref ref-type="bibr" rid="ref16 ref20 ref21">20, 21, 16</xref>
        ]. In combining CF and LtR we overcome some of
the limitations of CF by reducing the dependence on current navigation
patterns, allowing us to take into account the content of the article [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and how
recently it was published. Unlike purely content-based recommendations, this
system adapts to how the articles are being used and which recommendations
users engage with.
      </p>
      <p>Through this research, we ultimately found that to train and evaluate a
better model of article relevance, it is necessary to go beyond trying to predict
what users will browse next. While the implicit feedback from article browsing
is valuable for CF, it is limited by the very same information shortage problem
that the recommender system attempts to mitigate. In addition to developing
the model, we present some post-hoc analysis that sheds light on some of the
biases in the CF results that are reversed by applying LtR. Our contributions
can be summarised by the following points:
1. Offline evaluation should be matched to the online challenge: By
comparing two offline evaluation scenarios with an online experiment, we
show that higher accuracy at predicting browsing behaviour does not
correspond to having the highest engagement from users.</p>
      <sec id="sec-1-1">
        <title>2. The winning ranking model depends on text, usage, article age,</title>
        <p>and the citation network: By training a number of LtR models we show
that relevance of a document to a user is a combination of textual similarity,
recency, popularity, usage similarity, and proximity in the citation network.</p>
        <p>Traditional bibliometric features have limited impact.
3. The ranking model increases diversity: On average, the ranking model
increases the number of distinct journals in each list of related items.</p>
      </sec>
      <sec id="sec-1-2">
        <title>4. The ranking model promotes recently published items that have</title>
        <p>
          more traffic in the past year: Although other collaborative filtering
systems have reported a popularity bias [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], in this case collaborative filtering
has a bias towards unpopular items, which the LtR system reverses.
2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        A wide variety of techniques have been used to generate recommendations for
academic publications. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] focused on using hierarchical clustering of
citation networks to make recommendations that distinguish between core
papers in a discipline and important papers in sub-fields. While, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
used author-defined keywords and tags to identify similar documents. Not only
explicit article metadata are used to generate recommendations: Mendeley
Suggest [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]2 used user-based collaborative filtering (UBCF) to generate
recommendations based on a user’s library of documents, which showed improvements
over content and citation based methods. Another example comes from Wang et
al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], who used implict feedback from CiteULike3 in conjunction with Latent
Dirichlet allocation (LDA) applied to article text. For a comprehensive survey
of research article recommender systems, we refer the readers to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        LtR has traditionally been used in information retrieval (IR) systems such as
search engines [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For example, WalMart demonstrated that LtR could be used
to improve the rankings of Grocery search results [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. They used features mined
from images of the products, and experimented with optimising the models for
different actions such as clicking on the item or buying the item.
      </p>
      <p>
        To aid research into LtR, a number of frameworks have been developed which
allow for the test and training of models. Microsoft released LEROT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], an
“online learning to rank framework”. Additionally, RankLib [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a JAVA framework
that contains a number of standard LtR models, which will be discussed next.
These frameworks have allowed research to be reproducible and applied easily
across a number of domains. These frameworks implement a number of different
algorithms between point-wise, pair-wise and list-wise approaches.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data sets</title>
      <p>The recommendations generation has two distinct phases. First we generate the
recommendation candidates using CF and then re-rank them with a LtR model
to get the recommendations list. We use different data sets within these two
distinct phases.
3.1</p>
      <sec id="sec-3-1">
        <title>Article browsing logs</title>
        <p>The main data set for CF contains implicit feedback from users as they browse
ScienceDirect. This usage data set is in the format of &lt;sessionID, articleID,
accessTime&gt;. We apply IBCF on this data set to find candidate related articles
based on co-usage patterns. High traffic users are removed, as these can represent
public access machines in institutions. Additionally, we remove traffic that was
elicited by the recommender system, in order to remove a positive feedback loop.
2 https://www.mendeley.com/suggest
3 https://www.citeulike.com
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Recommender logs</title>
        <p>LtR requires labelled training data that represents user preferences in relation to
the recommendation list. We computed relevance labels by aggregating clicks and
impressions from the live CF recommender on ScienceDirect. When users visit a
ScienceDirect article page, a set of related-article recommendations are displayed
(i.e. impressions) and they may click on one or more of the recommendations to
view or download the article. We ignored impression data for page loads where
none of the recommendations were clicked, or where all of the recommendations
were clicked. For each query article, we aggregated the recommended articles
across all user sessions. For simplicity we treat relevance as a binary label: 1 if
the recommended article was clicked at least once, otherwise 0.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Article metadata</title>
        <p>Each article has a large amount of data and meta-data associated with it. This
includes titles, authors, abstract, references, and various metrics on article,
journal impact and citation information. These additional data sets are used to
generate features for the LtR models.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Recommendation Method</title>
      <p>4.1</p>
      <sec id="sec-4-1">
        <title>Collaborative Filtering</title>
        <p>
          At the core of IBCF is the method proposed by [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], which finds similar items
(documents) based on the similarity between their usage vectors. This means
that for a given document (d) the function sim(d; D) returns a set of M
similar documents based on their co-usage patters. Within this nearest-neighbour
method, we use cosine similarity to identify documents that have been browsed
in the same sessions. Here, Sd is the set of sessions in which document d was
browsed.
        </p>
        <p>cosine(d; d0) =</p>
        <p>jSd \ Sd0 j
pjSdj
jSd0 j
(1)</p>
        <p>Using cosine similarity to score neighbours ignores the statistical confidence
in the correlation between usage patters. For example, many documents were not
viewed in very many sessions, so a similar document d0 may have only one or
two sessions in common with the focal document d, but nonetheless rank highly
in terms of cosine(d; d0) because jSd0 j is also small. Therefore, we scale cosine
similarity with a significance weighting computed from the number of articles in
common:
score(d; d0) = min(1; jSd \ Sd0 j )
q
cosine(d; d0)
(2)</p>
        <p>If the documents have fewer than q sessions in common, then its contribution
to the CF score is scaled down. This means that preference is given to
recommendations that are generated from high co-occurrence neighbours, who are more
likely related to the focal document d. An alternative would be to discard pairs
with low co-occurrence, but to keep catalogue coverage high we chose to keep
them and reduce their scores instead.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Learning to Rank</title>
        <p>Once the set of recommendations (Rd) have been generated for each document
(d 2 D) it is re-scored using a pre-trained LtR model. The premise of LtR is to
rank items higher which users are more likely to engage with through observing
their past action. Training takes the form of n labelled query documents, qi(i =
1; :::; n) and their associated recommended documents with feature vectors with
relevance judgements xi = fx(ji)gjm=(i1) , where n is the number of recommendations
in the query.</p>
        <p>
          For this work we focus on methods which have been implemented in the
LtR package RankLib [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This is a JAVA application which can apply popular
LtR methods to an SVMRank formatted file. The algorithms we compared
included RankNet, LambdaRank, MART, and LambdaMART. These LtR models
represent both pair-wise and list-wise objectives.
        </p>
        <p>
          RankNet [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is a pair-wise neural network algorithm. The objective function
is cross entropy, which aims to minimise the number of inversions in the
ranking. However, this pair-wise objective does not optimise for the whole list, unlike
list-wise methods such as LambdaRank and LambdaMART. LambdaRank [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
built on RankNet by modifying the cost function to use gradients, , which
also take into account the change in a list-wise IR metric such as NDCG.
LambdaMART [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] combines LambdaRank and multiple additive regression trees
(MART), using the cost function from LambdaRank rather than from MART,
thus optimizing for the whole list. Out of the available models, LambdaMART is
generally considered state-of-the-art, and has performed well in competitions [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
Feature Extraction The features for the LtR models are taken from the query
document and the recommended document as well as from the interaction
between the two. These features can be grouped into seven categories.
        </p>
        <p>CF score: A measure of co-usage of the query document and the candidate
recommendation (Equation 2).</p>
        <p>Popularity: Popularity (number of views) of a document potentially
indicates its quality or its future engagement.</p>
        <p>Citations: Two documents (the query document and the recommendation)
share references then this could indicate a quality recommendation, and likewise
if two articles are both cited by the same article. To quantify this, we compute
two measures, the first being the Jaccard index between the citation
neighbourhoods,
cite_sim(Cd; Cd0 ) = jCd \ Cd0 j
jCd [ Cd0 j
where the neighbourhood Cd is the set of articles that either cite document d
or are cited by document d, plus document d itself. The second measure is the
total number of citations a document has received.</p>
        <p>Journal Metrics: Impact factors are potential predictors of the quality of
the research, although a weakness is the huge variation in article impact within a
journal. We added to the feature set several impact metrics of the journal where
the recommended article was published .</p>
        <p>Temporal: We represent age as the number of years since the cover date.</p>
        <p>Topics: All articles published are tagged with a scientific taxonomy,
indicating which topics and subjects they cover. We calculate binary similarity between
the sets of topics associated with each document .</p>
        <p>Text: We included the similarity of the recommended document text to the
query document text, where text is represented as an n-gram tf-idf vector of a
document’s title and abstract.
(3)
Training We used query-recommendation pairs with relevance labels inferred
from recommender logs, as described in Section 3.2. We held out 20% of the
query articles from the training data as a validation set, and trained on the
remaining 80%. The validation set was used for hyper-parameter tuning and
feature selection.</p>
        <p>For hyper-parameter tuning, LambdaMART and MART require choosing
the learning rate, the number of trees, maximum leaves per tree, and minimum
training examples per leaf. With LambdaMART, we used 20 leaves per tree,
with at least 200 examples per leaf and a learning rate of 0:1. The number of
trees ( 250) was determined by early stopping, again based on the validation
set. The final model used was LambdaMART, optimising for NDCG@3.</p>
        <p>We pruned features through backwards elimination: removing the least
important feature (highest NDCG@3 when removed), and then repeating the process
as long as NDCG@3 increased on the validation set. In Section 6.3 we compare the
importance of the different types of features.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Dithering</title>
        <p>
          The candidate recommendations are ranked using the LtR model or, in the initial
version of the system, by their CF scores. Before selecting the top portion of the
list to display to the user, we apply dithering [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], so that a larger proportion
of the list is explored. Dithering is the process of adding Gaussian noise to the
items’ ranks, thus slightly shuffling the list; new_score = log(rank)+N (0; log ")
, where " = rraannkk and typically " 2 [1:5; 3]. Over time, items at lower ranks will
eventually be shown to users, allowing us to collect feedback on their quality as
recommendations.
We used two different tasks to evaluate LtR offline. First, we test it on
recommendation click prediction, the same task that the model was trained on. Then
we test whether the model transfers to a session prediction task.
        </p>
        <p>In the recommendation click prediction task, each test case consists of a query
article and its recommended articles that were displayed to users. Some of the
recommendations were clicked (labelled 1) and some were not clicked (labelled
0). The recommender is evaluated on its ability to rank the clicked items higher
than the non-clicked, using NDCG@k. This is the same ranking task that the LtR
model was trained on, but the test data came from a time interval after the
training and validation data. Its limitation is that it only evaluates performance
on the recommendations that were displayed by the incumbent recommender
system, so it cannot judge articles that a new ranker will introduce into the top
k.</p>
        <p>In the session prediction task, each test case consists of a sequence of articles
that were browsed in the same user session, excluding browsed articles elicited
by the recommender system. The top k recommendations for the first item are
compared to what the user actually browsed next. The recommender is evaluated
on its ability to rank the browsed items highly (NDCG@k). We have found this
evaluation procedure useful for tuning CF candidate selection, where it correctly
predicted which CF variant had higher click through rate (CTR).</p>
        <p>For both types of evaluation, we split the data set on a time boundary ( ),
train the model on data generated before , and then test the model to see if
it predicts the actions after . This means before we use the clicks in sessions
(browsing between articles) to train the CF model, and then clicks on the already
deployed CF recommender is used to train the LtR model. We use actions after
as the ground truth, where sessions browsed are used for sessions prediction
and click on the deployed recommender for click predictions. Although they were
collected in the same time window, the test sets from the two evaluation tasks
use mutually exclusive subsets of the logs, e.g. the first only uses recommender
clicks, whereas the second excludes recommender clicks.</p>
        <p>We calculated a random baseline for each evaluation task. For session
prediction, the baseline is a random permutation of the candidate recommendation
list for each query article. For recommendation click prediction, the baseline is
a random permutation of the displayed recommendations for each query article.
6
6.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <sec id="sec-5-1">
        <title>Offline evaluation</title>
        <p>LambdaMART performed best in terms of NDCG@3 on a time-split test set of
recommender clicks and impressions (Table 1).</p>
        <p>Having chosen a modelling approach for LtR, we tested whether
LambdaMART’s performance gains transferred to the session-prediction task. As shown
in Figure 1, the ranking using LambdaMART was an improvement over CF score
on the click prediction task, but not on the session prediction task. The
LambdaMART and CF score rankings outperformed the random baseline in both
tasks (Figure 1). Nonetheless, we decided to proceed with online evaluation,
because the click prediction task is based on user behaviour in the recommendation
context, and we therefore thought it would be a better proxy for online
performance. We discuss the discrepancy between the two offline evaluation tasks in
more detail in Section 7.
6.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Online evaluation</title>
        <p>Our best candidate algorithms from offline evaluation are compared against the
current production model via A/B testing. Prior to the work on LtR, we had
0.820
0.815 all features
0.810
3@0.805
G
DC0.800
N0.795
0.790
0.785
e
r
o
c
s
F
C
y
iltr
a
u
p
o
P
t
x
e
T
e
g
A
s
n
o
iitt
a
C
c
i
p
o</p>
        <p>T
Type of feature removed
s
c
itr
e
m
l
a
n
r
u
o
J
A/B tested a number of iterative improvements on the CF stage, for example
by varying the duration of log data ingested to generate the recommendations.
A/B tests performed included increasing the time period of input data for CF
candidate selection which resulted in a CTR change of +9:6%, and Re-rank using
LambdaMART mode which increase CTR by +9:1%.</p>
        <p>For LtR, we carried out an online A/B experiment comparing LtR (using
the LambdaMART model in Figure 1) to our best CF algorithm. Each user is
randomly allocated to one of two treatments, either receiving recommendations
from the incumbent CF system or from the LtR variant. The LtR variant resulted
in a statistically significant 9:1% increase in CTR. This result agreed with the
offline performance of LtR on the click prediction task, but it disagreed with
their performance on the session prediction task (Figure 1).
6.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Feature importance</title>
        <p>Figure 2 shows the effect of removing each category of features, as introduced
in Section 3. Using backwards elimination (Section 4.2), we obtained a slight
increase in the validation metric (NDCG@3). Using fewer features also reduced
training time and simplified the feature extraction pipeline. The journal metrics
were all removed, and we retained one or two features from each of the other
categories. The evaluation steps described above used this tuned LambdaMART
model with a reduced feature set.</p>
        <p>As one can see in Figure 2, CF score was the most important feature, in that
removing it resulted in the greatest drop in NDCG@3 on the held-out validation
set. The feature importance for our ranking model should not be interpreted
as what makes an article “relevant” in general, because the training data were
collected through a specific user interface that displayed some article metadata
and not others, and furthermore the order of the articles was determined by CF
score. Instead, the impact of features within the model guided us as to which
could be safely discarded (e.g. journal impact metrics).
)e 0.2
r
sco 0.1
,eg 0.0
(as 0.1
0.2</p>
        <p>Method
cf
ltr
(a) Mean age of items at a(b) Correlation between age(c) Correlation between
popgiven rank. and rank ularity and rank
Item Age One of the main motivations behind research article discovery tools is
to help people stay up to date with recent developments in their fields. Therefore
we examined the mean age of recommended documents in each rank position,
where age is defined as the current year minus the publication year of the
recommend article. As shown in Figure 3a, recommendations ranked by CF scores
are biased against newly published articles, with higher-ranking articles tending
to be older. When ranked by the LtR model, there is a bias towards younger
articles instead.</p>
        <p>The bias towards older articles, when ranked by CF, is especially pronounced
where the query article is relatively cold, in other words it had less usage data
available for collaborative filtering (Figure 3b). Figure 3b bins the query items
based on their popularity within the CF input data, then for each query item
it computes the Spearman’s rank correlation ( s) between the recommendation
age and rank (ordered either by LtR or CF score), and finally takes the mean
of s within each bin. A positive value of s indicates that older recommended
articles tend to be ranked higher.</p>
        <p>
          However, for recommendations ranked by LtR this effect of bias towards
older articles has been reduced (lower values of s in every bin), resulting in
warm articles having recommendations that favour newer publications.
Popularity In addition to favouring older articles, our CF method also tends
to give higher scores to less popular articles (Figure 3b), where popularity is
measured as the number of unique users viewing the item in the 12 months of
input data for CF (this is assessed using the same method as in Section 6.4).
This is the opposite pattern to the popularity bias often found in UBCF [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
Like the age bias, it is most pronounced for colder source articles in the
bottom two quartiles of popularity. LtR reverses this trend and instead biases the
recommendations towards more popular items. Although this is not
necessarily a desirable outcome for all recommender systems, it is not so surprising in
this case, where popularity metrics were available as a feature for the relevance
model. Despite the bias towards popular items, LtR caused only a 2:1%
reduction in item coverage@3, with over 90% of recommendable items appearing in
the top three recommendations for at least one query article.
        </p>
        <p>
          Diversity One of the challenges of RS is to highlight content from across a
catalogue. Diversity in recommendations has been shown to increase users’
interactions and perception of fairness [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. However one of the issues with our
system is that traffic can be concentrated within the same journal, which leads
to some recommendation lists being all from the same journal. Thus we aim to
understand how LtR and CF affect diversity within the set of recommendations.
        </p>
        <p>To quantify diversity within our recommendations we look at the number of
recommendations that come from different journals. With journals identified as
distinct ISSN numbers.</p>
        <p>As one can see in Table 2, LtR caused a slight increase diversity, in terms of
the number of different journals in the top three recommendations for each query
article. To further confirm the significance of change in diversity we perform a
chi-square test ( 2) of the contingency table showed a significant dependency
between the ranker and number of distinct journals (p &lt; 10 16).
7</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>Designing offline evaluation raises the question of what the recommender
system is designed to do. The session-prediction task assumes the goal of
recommendation is to predict human browsing behaviour. Although it was a good proxy for
online performance of the candidate selection step, the session-prediction task
did not agree with online results when it came to choosing a ranking method.</p>
      <p>One explanation for why CF outperformed the LtR model on the session
prediction task is that users’ browsing behaviour outside of the recommender
system is at odds with what they actually find useful as recommendations. Users
might struggle to find the most relevant articles, which is the very problem of
information shortage that motivated us to create the recommender. For example,
a recently published article may be less well known within its field, and yet when
it is presented in a recommendations list it attracts more attention than older
articles.</p>
      <p>The post-hoc analysis demonstrated some trends in the types of items
recommended by IBCF, which LtR counteracted. Cold start and data sparsity are
well-known challenges in collaborative filtering. Even among warm items with
usage data, those with fewer users are expected to have less accurate
recommendations. We found that the less popular the query article, the more the CF
system tended to promote old or unpopular articles as recommendations. Based
on which articles users clicked, the LtR system learned to promote articles that
were published more recently or had more activity in the past year.</p>
      <p>Based on these results, we should be able to improve the model by
including as a feature the popularity of the query article (not just the recommended
article). That would allow the tree-ensemble model to learn different relevance
functions depending on how warm the source article is, for example giving less
importance to CF score for items with sparser usage data. In theory, the model
could also counteract the unpopularity-bias for colder articles without adding a
popularity bias to the warmer articles’ recommendations.
8</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this work we have demonstrated the benefits of LtR for improving upon
a production CF system, which is especially important in a domain such as
research articles where there are vast catalogues of recommendable items, each
of which has high quality structured metadata. We showed how changes to the
production system were ultimately decided on the basis of online evaluation, but
given the number of models and parameter choices, we also needed an offline
evaluation that was a good proxy for online performance.</p>
      <p>The winning LambdaMART model combines co-usage, semantic similarity,
shared citations, popularity, and recency. The original IBCF solution tended to
promote older articles with lower traffic, but by learning from subjective user
interactions with the recommender system, our relevance model reversed those
trends and therefore overcame some of the limitations of CF.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. S.
          <string-name>
            <surname>Whiteson M. de Rijke</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schuth</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hofmann</surname>
          </string-name>
          .
          <article-title>Lerot: an online learning to rank framework</article-title>
          .
          <source>In Living Labs for Information Retrieval Evaluation workshop</source>
          at CIKM'
          <volume>13</volume>
          .,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Himan</given-names>
            <surname>Abdollahpouri</surname>
          </string-name>
          , Robin Burke, and
          <string-name>
            <given-names>Bamshad</given-names>
            <surname>Mobasher</surname>
          </string-name>
          .
          <article-title>Controlling popularity bias in learning-to-rank recommendation</article-title>
          .
          <source>In Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys '17</source>
          , pages
          <fpage>42</fpage>
          -
          <lpage>46</lpage>
          , New York, NY, USA,
          <year>2017</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Joeran</given-names>
            <surname>Beel</surname>
          </string-name>
          , Bela Gipp, Stefan Langer, and Corinna Breitinger.
          <article-title>Research-paper recommender systems: a literature survey</article-title>
          .
          <source>International Journal on Digital Libraries</source>
          ,
          <volume>17</volume>
          (
          <issue>4</issue>
          ):
          <fpage>305</fpage>
          -
          <lpage>338</lpage>
          ,
          <year>Nov 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Borgs</surname>
          </string-name>
          , Jennifer Chayes, Brian Karrer, Brendan Meeder,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ravi</surname>
          </string-name>
          , Ray Reagans, and
          <string-name>
            <given-names>Amin</given-names>
            <surname>Sayedi</surname>
          </string-name>
          .
          <article-title>Game-theoretic models of information overload in social networks</article-title>
          .
          <source>In Ravi Kumar and Dandapani Sivakumar</source>
          , editors,
          <source>Algorithms and Models for the Web-Graph</source>
          , pages
          <fpage>146</fpage>
          -
          <lpage>161</lpage>
          , Berlin, Heidelberg,
          <year>2010</year>
          . Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Chris</given-names>
            <surname>Burges</surname>
          </string-name>
          , Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and
          <string-name>
            <given-names>Greg</given-names>
            <surname>Hullender</surname>
          </string-name>
          .
          <article-title>Learning to rank using gradient descent</article-title>
          .
          <source>In Proceedings of the 22Nd International Conference on Machine Learning, ICML '05</source>
          , pages
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Christopher</surname>
            <given-names>J. C.</given-names>
          </string-name>
          <string-name>
            <surname>Burges</surname>
          </string-name>
          .
          <article-title>From ranknet to lambdarank to lambdamart: An overview</article-title>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Fidel</given-names>
            <surname>Cacheda</surname>
          </string-name>
          , Víctor Carneiro, Diego Fernández, and
          <string-name>
            <given-names>Vreixo</given-names>
            <surname>Formoso</surname>
          </string-name>
          .
          <article-title>Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems</article-title>
          .
          <source>ACM Trans. Web</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ):2:
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          :
          <fpage>33</fpage>
          ,
          <year>February 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Van Dang.
          <source>RankLib</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Ted</given-names>
            <surname>Dunning</surname>
          </string-name>
          , Ellen Friedman, and
          <string-name>
            <given-names>M D</given-names>
            <surname>Ellen</surname>
          </string-name>
          <article-title>Friedman</article-title>
          .
          <source>Practical Machine Learning: Innovations in Recommendation. O'Reilly Media</source>
          , Inc.,
          <year>August 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Felice</surname>
            <given-names>Ferrara</given-names>
          </string-name>
          , Nirmala Pudota, and
          <string-name>
            <given-names>Carlo</given-names>
            <surname>Tasso</surname>
          </string-name>
          .
          <article-title>A keyphrase-based paper recommender system</article-title>
          .
          <source>In Maristella Agosti</source>
          , Floriana Esposito, Carlo Meghini, and Nicola Orio, editors,
          <source>Digital Libraries and Archives</source>
          , pages
          <fpage>14</fpage>
          -
          <lpage>25</lpage>
          , Berlin, Heidelberg,
          <year>2011</year>
          . Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Balázs</surname>
            <given-names>Hidasi</given-names>
          </string-name>
          , Alexandros Karatzoglou, Linas Baltrunas, and
          <string-name>
            <given-names>Domonkos</given-names>
            <surname>Tikk</surname>
          </string-name>
          .
          <article-title>Session-based recommendations with recurrent neural networks</article-title>
          .
          <source>CoRR, abs/1511.06939</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Liangjie</surname>
            <given-names>Hong</given-names>
          </string-name>
          , Ron Bekkerman, Joseph Adler, and
          <string-name>
            <given-names>Brian D.</given-names>
            <surname>Davison</surname>
          </string-name>
          .
          <article-title>Learning to rank social update streams</article-title>
          .
          <source>In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '12</source>
          , pages
          <fpage>651</fpage>
          -
          <lpage>660</lpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Maya</surname>
            <given-names>Hristakeva</given-names>
          </string-name>
          , Daniel Kershaw, Marco Rossetti, Petr Knoth, Benjamin Pettit, Saúl Vargas, and
          <string-name>
            <given-names>Kris</given-names>
            <surname>Jack</surname>
          </string-name>
          .
          <article-title>Building recommender systems for scholarly information</article-title>
          .
          <source>In Proceedings of the 1st Workshop on Scholarly Web Mining, SWM '17</source>
          , pages
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          , New York, NY, USA,
          <year>2017</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Optimizing search engines using clickthrough data</article-title>
          .
          <source>In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '02</source>
          , pages
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          , New York, NY, USA,
          <year>2002</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Shubhra Kanti Karmaker Santu</surname>
          </string-name>
          , Parikshit Sondhi, and ChengXiang Zhai.
          <article-title>On application of learning to rank for e-commerce search</article-title>
          .
          <source>In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '17</source>
          , pages
          <fpage>475</fpage>
          -
          <lpage>484</lpage>
          , New York, NY, USA,
          <year>2017</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Lin</surname>
          </string-name>
          and
          <string-name>
            <surname>W. John Wilbur.</surname>
          </string-name>
          <article-title>PubMed related articles: a probabilistic topicbased model for content similarity</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>8</volume>
          (
          <issue>1</issue>
          ):423, oct
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Greg</surname>
            <given-names>Linden</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Brent</given-names>
            <surname>Smith</surname>
          </string-name>
          , and Jeremy York. Amazon.
          <article-title>com recommendations: Item-to-item collaborative filtering</article-title>
          .
          <source>IEEE Internet Computing</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <fpage>76</fpage>
          -
          <lpage>80</lpage>
          ,
          <year>January 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Cristiano</surname>
            <given-names>Nascimento</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alberto H.F. Laender</surname>
          </string-name>
          , Altigran S. da Silva, and
          <article-title>Marcos André Gonçalves. A source independent framework for research paper recommendation</article-title>
          .
          <source>In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11</source>
          , pages
          <fpage>297</fpage>
          -
          <lpage>306</lpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>Chong</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>David M.</given-names>
            <surname>Blei</surname>
          </string-name>
          .
          <article-title>Collaborative topic modeling for recommending scientific articles</article-title>
          .
          <source>In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11</source>
          , pages
          <fpage>448</fpage>
          -
          <lpage>456</lpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Ian</surname>
            Wesley-Smith and
            <given-names>Jevin D.</given-names>
          </string-name>
          <string-name>
            <surname>West</surname>
          </string-name>
          .
          <article-title>Babel: A platform for facilitating research in scholarly article discovery</article-title>
          .
          <source>In Proceedings of the 25th International Conference Companion on World Wide Web, WWW '16 Companion</source>
          , pages
          <fpage>389</fpage>
          -
          <lpage>394</lpage>
          , Republic and Canton of Geneva, Switzerland,
          <year>2016</year>
          . International World Wide Web Conferences Steering Committee.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>J. D. West</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Wesley-Smith</surname>
            , and
            <given-names>C. T.</given-names>
          </string-name>
          <string-name>
            <surname>Bergstrom</surname>
          </string-name>
          .
          <article-title>A recommendation system based on hierarchical clustering of an article-level citation network</article-title>
          .
          <source>IEEE Transactions on Big Data</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>113</fpage>
          -
          <lpage>123</lpage>
          ,
          <year>June 2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>