Scalable Recommendation of Wikipedia Articles to Editors
                    Using Representation Learning
             Oleksii Moskalenko                                             Denis Parra                                  Diego Saez-Trumper
         Ukrainian Catholic University                        Pontificia Universidad Catolica de                          Wikimedia Foundation
                 Lviv, Ukraine                                           Chile & IMFD                                      San Francisco, USA
                                                                        Santiago, Chile

ABSTRACT                                                                                 monthly just in the English Wikipedia) and articles [12] ii) having
Wikipedia is edited by volunteer editors around the world. Consid-                       good coverage of articles beyond just the most popular articles, and,
ering the large amount of existing content (e.g. over 5M articles in                     iii) being able to provide good recommendations for newcomers,
English Wikipedia), deciding what to edit next can be difficult, both                    facing the classical user cold-start problem [30].
for experienced users that usually have a huge backlog of articles to                        To address these problems, we have created an efficient and
prioritize, as well as for newcomers who that might need guidance                        scalable implementation of a state-of-art convolutional graph em-
in selecting the next article to contribute. Therefore, helping edi-                     bedding algorithm [14] that is able to deal with the large Wikipedia
tors to find relevant articles should improve their performance and                      article graph. We combine this with a document embedding model
help in the retention of new editors. In this paper, we address the                      that allows us to learn representations for articles and editors, and
problem of recommending relevant articles to editors. To do this,                        does not require retraining when new users are added in the system.
we develop a scalable system on top of Graph Convolutional Net-                          With only a few edited articles then, the system is able to produce
works and Doc2Vec, learning how to represent Wikipedia articles                          personalized recommendations, similar to Youtube deep recom-
and deliver personalized recommendations for editors. We test our                        mendation model [7]. We test our algorithm in English Wikipedia
model on editors’ histories, predicting their most recent edits based                    (the largest one with almost 6 million articlesshowing that we can
on their prior edits. We outperform competitive implicit-feedback                        overcame well-established content-based filtering methods as well
collaborative-filtering methods such as WMRF based on ALS, as                            as collaborative filtering approaches. Moreover, we evaluate our
well as a traditional IR-method such as content-based filtering based                    recommendation measuring the top-100 items, to support a robust
on BM25. All of the data used on this paper is publicly available,                       evaluation against popularity bias [31].
including graph embeddings for Wikipedia articles, and we release                            In summary, the main contributions of this paper are: (i) Intro-
our code to support replication of our experiments. Moreover, we                         duce a model which learns representations (graph and content-
contribute with a scalable implementation of a state-of-art graph                        based) of Wikipedia articles and makes personalized recommenda-
embedding algorithm as current ones cannot efficiently handle the                        tions to editors; (ii) Evaluate it with a large corpus, comparing with
sheer size of the Wikipedia graph.                                                       competitive baselines (iii) and release a scalable implementation
                                                                                         of GraphSage, a state-of-art graph embedding system, that in pre-
KEYWORDS                                                                                 vious implementations was unable to deal with the large graph of
                                                                                         Wikipedia page2 .
Wikipedia, RecSys, Graph Convolutional Neural Network, Repre-
sentation Learning
                                                                                         2    RELATED WORK
1    INTRODUCTION                                                                        There are several projects trying to solve the task of recommending
Wikipedia is edited by hundreds of thousands of volunteers around                        items to users at real-world scales of millions of users and mil-
the world. While the level of expertise, motivations, and time ded-                      lions of items. For instance, Ying et al. for Pinterest [34] created
icated to that task varies among users, most of them experience                          an extension of GraphSAGE [14], a type of Graph Convolutional
challenges in deciding which articles to edit next. For example,                         Network [18] (GCN); researchers at YouTube [7] built a system
many experienced users have huge backlogs1 of work with a large                          based on regular deep neural networks that jointly learns users’
number of articles to improve or review. Prioritizing the articles in                    and items’ representations from users’ previous history of views.
this backlog, as via a personalized article ranking system, would                        However, in both examples, the model learns in a supervised setup,
potentially be of great help for these editors. On the other hand,                       whereas we lack a sufficiently comprehensive dataset of previous
newcomers might experience difficulties deciding what do to af-                          interactions because 94% of Wikipedia contributors are associated
ter their first contribution, as evidenced by the many efforts to                        with less that 10 interactions in last 3 years [12]. eBay’s recommen-
understand how to improve the retention of newcomers [5, 22].                            dation system covers a similar gap by using TF-IDF for similar item
   While previous work on recommendations in Wikipedia has                               search, which does not require training [4].
focused on finding articles for translation [32] or content to be                           On the document representation task, we can highlight several
added to existing articles [26], there are still important unsolved                      approaches: Doc2Vec[19] is method for obtaining content-based
problems: i) creating a scalable recommender system that can deal
efficiently with the large number of Wikipedia editors (over 400K
1 https://en.wikipedia.org/wiki/Wikipedia:Backlog                                        2 https://github.com/digitalTranshumant/WikiRecNet-ComplexRec2020

Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                  Oleksii Moskalenko, Denis Parra, and Diego Saez-Trumper


Figure 1: Flow of Candidate Generation for Wikipedia articles recommendation: Doc2Vec embeddings are trained on Wiki
Text Corpus and then passed as input into GraphSAGE model. Received articles’ representations are then used in Nearest
Neighbors Search to produce candidates


representations of paragraph or longer text in vector space. How-         by the articles that they edited, then we generate a list of candidates
ever, one main advantage of our dataset is the availability of struc-     from the article pool by comparing that user representation with
tural knowledge [6] - i.e. links among articles that could potentially    the article representations. Next, we sort the article candidates
tell more about the article beyond its content. Those links can be        accordingly to the user preferences and generate a list of top-n best
represented as a graph, where nodes are articles and edges are links      candidates recommendation.
between them. Thus, the task of learning document representa-
tions can be transformed into learning the representation of a node       3.1     Article and user representation
in the graph. Node2vec [13] is a recent approach to learn such a          The primary challenge for our system is producing good user and
representation. However, its scalability is still limited [35] and the    article representations. It is an especially big problem for user rep-
main drawback for our use case is the necessity of full retraining        resentation since most of Wikipedia contributors do not fill any
after changes in the structure of the graph. Node2vec also omit the       additional information about themselves except their login creden-
content part of articles (node features), which is a huge part of our     tials 3 , and around 28% of all revisions in our English Wikipedia
dataset.                                                                  dataset, are done by anonymous users [12]. The only useful infor-
    GCN [10, 18] are a recent approach to solve many machine              mation that could uniquely characterize the user is the history of
learning tasks like classification or clustering of a graph’s nodes via   his editions. Hence, most of our efforts were dedicated to learning
a message-passing architecture that uses shared filters on each pass.     articles’ representations, and then representing the user based on
It combines initial node features and structural knowledge to learn       the articles edited. One effective approaches to construct good user
comprehensive representations of the nodes. However, the original         and item representations is to learn them with recommendation su-
GCN architecture is still not applicable to large-scale graphs because    pervision [7, 34]. However, it is not possible to follow this approach
it implies operations with a full adjacency matrix of the graph. To       due to the lack of the required comprehensive-enough dataset of
tackle these limitations, GraphSAGE model was introduced [14] in          previous interactions. History of users editions in Wikipedia is far
a way that only some fixed-sized sample of neighbors is utilized on       from exhaustive (88% of users of English Wikipedia have done less
the convolutional Layer. Because of the fixed-size samples, we also       than 5 major editions [12]) and too sparse, in a way that it is hard
have fixed-sized weights that are generalized and could be applied        to model user’s area of interest. Therefore, the additional challenge
to a new, unknown part of the graph or even completely different          is to conduct representation learning [2] in an unsupervised way
graph. Thus, with inductive learning, we can train the model on a         in relation to our final task.
sub-graph, which means less computation resources are required,
and evaluate generalization on the full graph.
                                                                          3.2     Candidate Generation
                                                                          Similar to YouTube’s deep learning recommender [7], WikiRec-
3   WIKIRECNET DESCRIPTION                                                Net first generates candidates for a final personalized ranking in a
Here we introduce WikiRecNet, a scalable system for providing             second stage. To generate candidates we first calculate representa-
personalized article recommendations in Wikipedia, built on top           tion vectors (content-based and graph-based) for all articles in our
of GCN and Doc2Vec. The design of our solution is inspired by a
classic Information Retrieval architecture. First we represent users      3 https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_is_anonymous
Scalable Recommendation of Wikipedia Articles to Editors


                                                                      Nearest Neighbors search with user representation as a query in
                                                                      the articles’ representation database, a procedure we call Candidate
                                                                      Generation and which is conducted online, as shown in Figure 2.

                                                                      Content-based articles representation: Doc2Vec. For learning
                                                                      the content-based article representation, text features are needed to
                                                                      be extracted first. This can be conducted with traditional document
                                                                      vector space model [29] or by using word embeddings such as
                                                                      Word2vec [21] and GloVe [25] and performing an additional step of
                                                                      aggregation. Another option is using directly a full text embedding
                                                                      model and with that goal we use Doc2Vec [19]. There are two
                                                                      distinct approaches for learning document embedding with this
                                                                      model. One is Paragraph Vector Distributed Bag-of-Words model
                                                                      (PV-DBOW) model, which is based on word2vec’s Continuous
                                                                      Bag-of-Words approach [21] but instead of word input it accepts
                                                                      paragraph vector and predicts context words for this paragraph.
                                                                      In the second approach, Distributed Memory (PV-DM), which is
                                                                      based on word2vec skip-gram model, the model predicts middle
                                                                      word based on context and paragraph vector given as input. Later
                                                                      on this paper (Section 5) we show that PV-DBOW is the best fit
                                                                      for our task. We train Doc2Vec-DBOW model on the Corpus of all
                                                                      Wikipedia articles in a given language. Output vectors of Doc2Vec
                                                                      are being passed as input features to the GNC.

                                                                      Graph-based article representation: GraphSAGE. GraphSAGE
Figure 2: Candidate Ranking: user history along with can-             has been used as GCN due to its ability to learn with an inductive
didate are passed through articles’ representation data-              approach and construct embeddings for unseen nodes. During the
base (Embedding Layer) and then through several fully-                pre-processing of the input dataset –snapshot of Wikipedia Dataset–
connected layers to train in the log-regression setup.                we create a graph 𝐺 (𝑉 , 𝐸) where 𝑉 denotes the set of articles, and
                                                                      𝐸 the set of links between them. GraphSAGE utilizes structural
Table 1: Performance of different algorithms for K-NN                 knowledge from graph 𝐺 and produces new vectors that preserve
search. All tests were conducted with English Wikipedia ar-           both text and structural representations. Due to the inductive nature
ticles (|𝑉 | = 5, 251, 875). Setup is measured in seconds. Secs.      of GraphSAGE architecture, we do not need to retrain the model
req. means seconds per request.                                       every time after adding a new article into the database, this is very
                                                                      important for applying WikiRecNet in real scenarios, where new
         Algorithm    Setup Secs./req. Recall MRR                     articles are constantly added [12].
         Exact search 3.91    0.81     0.224 0.0220                      After producing the document vector and updating the Graph 𝐺
       . IVF          207.02  0.07     0.206 0.0212                   structure, we can run GraphSAGE model as is, with already trained
         HNSW         232.68  0.04     0.224 0.0220                   weights. GCN is a multi-layer network, where each layer can be
         LSH          472.31  0.15     0.215 0.0219                   formulated as:
                                                                                                         1        1
                                                                                       𝐻 (𝑙+1) = 𝜎 (𝐷˜ − 2 𝐴˜𝐷˜ − 2 𝐻 (𝑙) 𝑊 (𝑙+1) )
      Table 2: Specifications of built Wikipedia Graph                where 𝐴˜ = 𝐴 + 𝐼 is the adjacency matrix with self-connections
                                                                      (𝐼 ), 𝐷˜ = 𝑗 𝐴˜𝑖 𝑗 , 𝑊 are trainable weights and 𝐻 is the output of
                                                                                Í

         Specification               English Wikipedia                previous layer or 𝐻 (0) = 𝑋 is input, 𝑋 represents node features.
         Amount of vertices (|𝑉 |)       5,251,875                    An intuitive explanation of this process is that each node collects
         Amount of Edges (|𝐸|)          458,867,626                   features of its neighbors that were propagated through trainable
         Average Degree (𝑑𝑎𝑙𝑙 )             174                       filters (convolutions) so called, message passing. On each step node
         Median Degree (𝑑g
                         𝑎𝑙𝑙 )               60                       collects knowledge of its neighborhood and propagates its state
         Approx. Diameter (D)                23                       further on the next step. Thus, properties of 1st, 2nd, ..., nth proxim-
         Amount of labeled nodes         4,652,604                    ity are being incorporated into node’s state along with preserving
                                                                      original features of node’s community.
                                                                      Optimizing candidate retrieval. In serving time, recommenda-
dataset, a process conducted off-line which is presented as Repre-    tion candidates will be produced by applying K-Nearest Neighbors
sentation Learning in Figure 1. Then, for every user we define her    (K-NN) search to find the most similar articles to the user represen-
representation as an aggregation of representation vectors of cor-    tation vector in the pre-computed database of all articles’ represen-
responding articles that were edited by this user. Next, we conduct   tations. K-NN search is one of the main parts of candidate article
                                                                                    Oleksii Moskalenko, Denis Parra, and Diego Saez-Trumper


generation, since its performance in terms of time and resource con-       parsed to organize this data. During pre-processing stage, all links
sumption is very critical for online recommendation in a high-load         to redirect pages5 were replaced by their actual destinations. "Cate-
system. We conducted experiments with different optimizations for          gory pages", that consists only of links to other pages and do not
K-NN candidate search using FAISS library [16] : Locality-Sensitive        have their own content, were detected and filtered out. We used
Hashing (LSH), Inverted file with exact post-verification (IVF), Hi-       Apache Spark and GraphX for parallel parsing of SQL dumps and
erarchical Navigable Small World graph exploration (HNSW). Our             discovering and cleaning Article Graph respectively. The output
tests showed that HNSW gives the best speed along with exactly             Graph was converted into binary format with graph-tool [24] to
the same recall and MRR as exact search, so with no trade-off in           achieve fast processing (see Table 2). For extracting the articles’
performance we achieved 20x times improvement in speed. Results            texts we took the latest revision6 per each article from XML dump.
of these experiments are shown in Table 1.                                 We used Gensim [27] to tokenize and lemmatize text and prepare
                                                                           for the Doc2vec training.
3.3     Ranking of Candidate Articles                                          To the facilitate the evaluation in an end-to-end fashion we
After learning content and graph-based representations for Wikipedia       reorganized the data into a revisions-per-user dataset. Only revisions
articles, in the second part of our system we are trying to model          that were created after Jan. 2015 were kept in this dataset, so our
user preferences based on the previous edit history of Wikipedia           recommendations that are based on the latest snapshot of article
contributors. With given previous editions and articles, we pro-           graph (Oct. 2018) will not recommend too many articles that did
duce a relevant a list of candidates, ranked by its relevance for a        not exists on that moment.
given user. Our model is trained on binary labels - relevant / not             We found that 88% of contributors are not regular users, since
relevant (logistic regression), as shown in Figure 1, but on serving       they edited fewer than 5 different articles for selected dates. We
time it will produce probabilities of user interest, which are used        also calculated diversity [3] of users’ contribution based on vector
as a preference ranking score.                                             representations obtained from Doc2vec. The set of contributors that
   This approach is inspired by Pointwise ranking [20] and is im-          fits to our needs has mostly edited from 5 to 40 different articles,
plemented in many similar recommender systems: YouTube[7],                 though diversity of those articles is rather high. That is the main
eBay[4]. The model is shown on Figure 2 and consists of several            cause our representations cannot be trained against this data like it
fully-connected layers with Batch Normalization and ReLU acti-             was done in previous work [7]. Unlike the work by Covington et
vation after each layer except for the last layer, where sigmoid           al. [7], our training dataset is small (around 60K users) and users’
activation is used. The final model’s architecture was selected as         areas of interest are very diverse.
follows: 1024 ReLU -> 512 ReLU -> 256 ReLU. As input model ac-
cept a concatenated vector of user and candidate representations.          4.2      Training
   Preference score We define our preference ranking score as              For all training experiments with GraphSAGE we generated a sam-
the probability that a user 𝑢 finds a wikipedia article 𝑎𝑖 relevant:       pled adjacency matrix based on Articles’ graph 𝐺. The width of
                                                         1                 the adjacency matrix was determined by our experiments, based
          𝑠𝑐𝑜𝑟𝑒 (𝑢, 𝑎𝑖 ) = 𝑃 (𝑎𝑖 = 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 |𝑢) =                 (1)
                                                  1 + 𝑒 −Φ(𝑎𝑖 ,𝑢)          available memory resources as well as on graph statistics (Table 2).
where 𝑢 represents the user, 𝑎𝑖 a candidate wikipedia article, 𝐴 is        We selected 128 as maximum amount of neighbors in this matrix. If
the set of all articles to be ranked, and Φ(·) a weighted sum of the       a node had more than that, then we used random subsampling. Our
values in the last network layer of the Candidate ranking neural           GCN architecture consists of two convolutional layers. On each
network, shown in Figure 2. We train the model with a traditional          convolutional step we picked a random sample of 25 neighbors
loss for a binary logistic regression.                                     from this adjacency matrix. This 25-neighbors sample is being re-
                                                                           sampled on each new batch. For better generalization we used a
4     EXPERIMENTS                                                          batch size of 512, since experiments with dropout between convo-
                                                                           lutional layers led to no improvement in generalization. We set the
We worked with the English version of Wikipedia not only because
                                                                           size for all output vectors to 512 considering a balance between
is the most popular one. In addition, is the most challenging in
                                                                           better resolution and available memory.
terms of data processing, and if our system is able to deal with the
                                                                              Document representations from Doc2Vec-DBOW, trained with
largest Wikipedia it would be easy to apply later in smaller projects.
                                                                           vector size 300 and window size 8, were passed as initial node states
                                                                           and graph edges played the role of labels when the model was
4.1     Dataset
                                                                           trying to predict those edges. We utilized max-margin loss [34],
All data used has been downloaded from the official Wikimedia              as target for training GraphSAGE in link-prediction setup. Model
Dumps [11] which are snapshots of the full Wikipedia. Some of the          parameters were tuned with the Adam optimizer[17].
objects in the dump (like articles’ links) are stored in SQL format,
                                                                                      𝐽 (𝑧𝑢 𝑧𝑖 ) = 𝐸 𝑣𝑛 ∼𝑃𝑛 (𝑢) 𝑚𝑎𝑥 {0, 𝑧𝑢 · 𝑧 𝑣𝑢 − 𝑧𝑢 · 𝑧𝑖 + Δ}                     (2)
others, with deeper structure (like articles’ text) are stored in XML.
   First of all, for representation learning we built a graph 𝐺 (𝑉 , 𝐸),     For training the ranking model, a dataset from users’ history
where the set of nodes 𝑉 is the set of all Wikipedia pages belonging       was constructed. As input this model takes 5 articles edited by a
to article namespace4 and 𝐸 is the set of directed links between           user ( representing users’ preferences) and 1 candidate that might
them. The SQL dumps of page, pagelinks, and redirects tables were
                                                                           5 https://en.wikipedia.org/wiki/Wikipedia:Redirect
4 https://en.wikipedia.org/wiki/Wikipedia:Namespace                        6 Article’s revision is a specific version of article’s content after each modification
Scalable Recommendation of Wikipedia Articles to Editors


interest the user. The model tries to predict the probability of rele-     model (merge, mean-pool, max-pool), as well as the method for
vance of this candidate to the current user. Those 6 input articles        ranking (cosine similarity and Deep-Rank).
are passed through an Embedding Layer populated with represen-                 The results in Table 3 show that WikiRecNet, using merge aggre-
tations received from GraphSAGE and then concatenated into one             gation and Deep-Rank ranking, outperforms the other methods in
vector. We chose positive candidates from actual user history and          all metrics. We highlight the following the aspects in the evaluation:
generated negative candidates with kNN search on constructed
articles’ representations. Logistic (binary cross-entropy) regression           • ALS implicit feedback collaborative filtering performs the
with class-weights (due to high class imbalance) was used as loss                 worst among all methods. This result must be due to the
function.                                                                         extreme high sparsity of the dataset.
                                                                                • BM25, despite being a simple and traditional content-based
                                                                                  filtering method, performs well and remains very competi-
4.3    Evaluation                                                                 tive.
To prepare the evaluation dataset, we subsampled windows of size                • The simple K-NN based on Doc2Vec representation per-
10 from user’s history (from users that were not previously used                  forms better than ALS, and mean-pool reports better results
for training or testing the Deep Ranking model). Our assumption if                than merge but only at higher ranking positions (MAP@50,
that the first 5 articles denoted users’ area of interest. To compute a           nDCG@50, Recall@50).
single user vector we took element-wise average of representations              • Among the WikiRecNet variations, the max-pool aggrega-
from the first 5 articles (GraphSAGE representations). We were                    tion seems to be the least helpful. In terms of nDCG@50 and
trying to predict the rest 5. Algorithm can be expressed as follows: (i)          nDCG@100 (the metric most robust to popularity bias [31]),
take first 5 articles per user. Calculate average of their embeddings             merge aggregation seems more effective than mean-pool,
vectors, output this as the user vector representation, (ii) generate             and then the combination with DeepRank produce the best
candidates by nearest neighbors search of user representation, (iii)              performance, with a 100% increase compared to the Doc2vec
sort candidates according to ranking algorithm and select the top                 mean-pool reference method.
𝐾. In our evaluation we compare two ranking techniques: sort by
cosine similarity, and sort by probability from Deep Ranking model,        6    CONCLUSION
and (iv) compare Top-K recommendations with the 5 articles in the          In this article we have introduced WikiRecNet, a neural-based model
test set (from second half of the sampled window).                         which aims at recommending Wikipedia articles to editors, in order
    To measure the results we used several metrics: mean average           to help them dealing with the sheer volume of potential articles that
precision (MAP), normalized discounted cumulative gain (nDCG)[1]           might need their attention. Our approach uses representation learn-
and Recall@k [8]. We calculate these metrics at high k values,             ing, i.e., finding alternative ways to represent the Wikipedia articles
𝑘 = 50 an 𝑘 = 100. Unlike traditional research on top-k recommen-          in order to produce a useful recommendation without requiring
dation systems usually focusing on small k values (k=10,20,30), we         more information than the previous articles edited by targeted
are specially interested in preventing popularity bias, i.e., having       users. For this purpose, we used Doc2Vec [19] for a content-based
WikiRecNet biased to recommend mostly popular items. Valcarce et           representation and GraphSage [14], a graph convolutional network,
al. [31] showed recently that usual top-k ranking metrics measured         for a graph-based representation.
at higher values of k (50, 100) are specially robust to popularity            WikiRecNet architecture is composed of two networks, a candi-
bias, and that is why we use them here.                                    date generation network and a ranking network, and our imple-
    .                                                                      mentation is able to deal with larges volumes of data, improving
                                                                           existing implementations that were not capable to work in such
                                                                           scenarios. Also, our approach does not need to be retrained when
5     RESULTS
                                                                           new items are added, facilitating its application in dynamic environ-
Results of the evaluation are presented in Table 3. We first describe      ments such as Wikipedia. To best of our knowledge, this is the first
the competing methods:                                                     recommender system especially designed for Wikipedia editors
    Baselines. We used two well established methods. The first one         that takes in account such applications constrains, and therefore,
is BM25[28], a probabilistic method used on information retrieval          can be applied in real world scenarios.
but also applied for content-based filtering in the area of recommen-         In order to contribute to the community, we provide our code
dation [23]. A second baseline is implicit feedback collaborative          and the graph embedding of each Wikipedia page used in this
filtering optimized with Alternative Least Squares (ALS) [15].             experiment7 available in a public repository, as well as a working
    K-NN recommender. In addition, we implemented a simple                 demo that can be tested by the Wikipedia editors community8 . With
K-NN recommender where the Wikipedia articles are represented              respect to text embeddings, there have been important progresses
by the Doc2vec vector embeddings. Each user 𝑢 is represented by            in the latest years, so another idea for future work will be testing
the articles she has edited, and we test two forms of aggregation to       models like BERT [9] or XLNet [33].
represent the user model: merging the user-edited articles (merge)
and calculating the mean at each dimension of the document (mean-
pool). We rank recommended articles by cosine similarity.
    Aggregations. Finally,WikiRecNet is presented in 5 versions by         7 Embeddings in other languages would be also available under request.

varying the type of aggregation of articles to represent the user          8 https://github.com/digitalTranshumant/WikiRecNet-ComplexRec2020
                                                                                                    Oleksii Moskalenko, Denis Parra, and Diego Saez-Trumper

Table 3: Offline evaluation of generated recommendations on the task of predicting next 5 articles edited by user with per-
centage improvement over content-based model Doc2Vec (mean-pool) with cosine similarity.

                                                       K=50                             K=100
                Model      Aggregate Rank       MAP nDCG Recall MAP                nDCG           Recall
                WikiRecNet mean      cosine    0.0221 0.1361 0.0846 0.0238 (+78%)  0.1468 (+66%)  0.1179 (+99%)
                           mean      deep-rank 0.0228 0.1363 0.0841 0.0243 (+82%)  0.1493 (+70%)  0.1134 (+92%)
                           max       cosine    0.0192 0.1196 0.0672 0.0206 (+54 %) 0.1299 (+47%)  0.0923 (+56%)
                           merge     cosine    0.0208 0.1412 0.0825 0.0227 (+70%)  0.1538 (+75%)  0.1175 (+99%)
                           merge     deep-rank 0.0262 0.1625 0.0935 0.0282 (+111%) 0.1760 (+100%) 0.1302 (+120%)
                Doc2Vec    merge     cosine    0.0085 0.0805 0.0438 0.0092         0.0883         0.0600
                           mean      cosine    0.0126 0.0821 0.0436 0.0133         0.0880         0.0590
                BM25                           0.0251 0.1602 0.0921 0.0273         0.1710         0.1290
                ALS MF                         0.0027 0.0163 0.044 0.0063          0.0204         0.0609


7     ACKNOWLEDGMENTS                                                                       [15] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for
                                                                                                 implicit feedback datasets. In In IEEE International Conference on Data Mining
The author Denis Parra has been funded by the Millennium Institute                               (ICDM 2008. 263–272.
for Foundational Research on Data (IMFD) and by the Chilean                                 [16] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity
                                                                                                 search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
research agency ANID, FONDECYT grant 1191791.                                               [17] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochas-
                                                                                                 tic Optimization. In 3rd International Conference on Learning Representations,
REFERENCES                                                                                       ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
                                                                                                 http://arxiv.org/abs/1412.6980
 [1] Linas Baltrunas, Tadas Makcinskas, and Francesco Ricci. 2010. Group Recom-             [18] Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with
     mendations with Rank Aggregation and Collaborative Filtering. In Proceedings                Graph Convolutional Networks. CoRR abs/1609.02907 (2016).
     of the Fourth ACM Conference on Recommender Systems (RecSys ’10). ACM, New             [19] Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences
     York, NY, USA, 119–126. https://doi.org/10.1145/1864708.1864733                             and Documents. CoRR abs/1405.4053 (2014). arXiv:1405.4053 http://arxiv.org/
 [2] Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation                    abs/1405.4053
     learning: A review and new perspectives. IEEE transactions on pattern analysis         [20] Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf.
     and machine intelligence 35, 8 (2013), 1798–1828.                                           Retr. 3, 3 (March 2009), 225–331. https://doi.org/10.1561/1500000016
 [3] J Bobadilla, Francisco Serradilla, and J Bernal. 2010. A new collaborative filtering   [21] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
     metric that improves the behavior of recommender systems. Knowledge-Based                   Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
     Systems 23 (08 2010), 520–528. https://doi.org/10.1016/j.knosys.2010.03.009                 arXiv:1301.3781 http://arxiv.org/abs/1301.3781
 [4] Yuri M. Brovman, Marie Jacob, Natraj Srinivasan, Stephen Neola, Daniel Galron,         [22] Jonathan T Morgan, Siko Bouterse, Heather Walls, and Sarah Stierch. 2013. Tea
     Ryan Snyder, and Paul Wang. 2016. Optimizing Similar Item Recommendations                   and sympathy: crafting positive new user experiences on wikipedia. In Pro-
     in a Semi-structured Marketplace to Maximize Conversion. In Proceedings of the              ceedings of the 2013 conference on Computer supported cooperative work. ACM,
     10th ACM Conference on Recommender Systems (RecSys ’16). ACM, New York, NY,                 839–848.
     USA, 199–202. https://doi.org/10.1145/2959100.2959166                                  [23] Denis Parra and Peter Brusilovsky. 2009. Collaborative filtering for social tagging
 [5] Boreum Choi, Kira Alexander, Robert E Kraut, and John M Levine. 2010. So-                   systems: an experiment with CiteULike. In Proceedings of the third ACM conference
     cialization tactics in wikipedia and their effects. In Proceedings of the 2010 ACM          on Recommender systems. ACM, 237–240.
     conference on Computer supported cooperative work. ACM, 107–116.                       [24] Tiago P. Peixoto. 2014. The graph-tool python library. figshare (2014). https:
 [6] Cristian Consonni, David Laniado, and Alberto Montresor. 2019. WikiLinkGraphs:              //doi.org/10.6084/m9.figshare.1164194
     A complete, longitudinal and multi-language dataset of the Wikipedia link net-         [25] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
     works. arXiv preprint arXiv:1902.04298 (2019).                                              Global vectors for word representation. In Proceedings of the 2014 conference on
 [7] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks                      empirical methods in natural language processing (EMNLP). 1532–1543.
     for YouTube Recommendations. In Proceedings of the 10th ACM Conference on              [26] Tiziano Piccardi, Michele Catasta, Leila Zia, and Robert West. 2018. Struc-
     Recommender Systems (RecSys ’16). ACM, New York, NY, USA, 191–198. https:                   turing Wikipedia Articles with Section Recommendations. arXiv preprint
     //doi.org/10.1145/2959100.2959190                                                           arXiv:1804.05995 (2018).
 [8] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of                [27] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
     recommender algorithms on top-n recommendation tasks. In Proceedings of the                 with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
     fourth ACM conference on Recommender systems. ACM, 39–46.                                   for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/
 [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:               884893/en.
     Pre-training of deep bidirectional transformers for language understanding. arXiv      [28] Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame-
     preprint arXiv:1810.04805 (2018).                                                           work: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (April 2009), 333–389.
[10] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bom-                         https://doi.org/10.1561/1500000019
     barell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P Adams. 2015.                     [29] Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model
     Convolutional Networks on Graphs for Learning Molecular Finger-                             for automatic indexing. Commun. ACM 18, 11 (1975), 613–620.
     prints.       In Advances in Neural Information Processing Systems 28,                 [30] Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock.
     C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett                           2002. Methods and metrics for cold-start recommendations. In Proceedings of the
     (Eds.). Curran Associates, Inc., 2224–2232.            http://papers.nips.cc/paper/         25th annual international ACM SIGIR conference on Research and development in
     5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.                  information retrieval. ACM, 253–260.
     pdf                                                                                    [31] Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2018. On
[11] Wikimedia Foundation. 2018. Wikimedia Downloads. https://dumps.wikimedia.                   the robustness and discriminative power of information retrieval metrics for top-
     org [Online; accessed 14. Oct. 2019].                                                       N recommendation. In Proceedings of the 12th ACM Conference on Recommender
[12] Wikimedia Foundation. 2019. Wikimedia Statistics - All wikis. https://stats.                Systems. ACM, 260–268.
     wikimedia.org/v2/#/all-projects [Online; accessed 13. Oct. 2019].                      [32] Ellery Wulczyn, Robert West, Leila Zia, and Jure Leskovec. 2016. Growing
[13] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning                  wikipedia across languages via recommendation. In Proceedings of the 25th Inter-
     for Networks. CoRR abs/1607.00653 (2016). http://dblp.uni-trier.de/db/journals/             national Conference on World Wide Web. International World Wide Web Confer-
     corr/corr1607.html#GroverL16                                                                ences Steering Committee, 975–985.
[14] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation
     Learning on Large Graphs. In NIPS.
Scalable Recommendation of Wikipedia Articles to Editors


[33] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,        Recommender Systems. In KDD.
     and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Lan-    [35] Dongyan Zhou, Songjie Niu, and Shimin Chen. 2018. Efficient Graph Computation
     guage Understanding. arXiv preprint arXiv:1906.08237 (2019).                        for Node2Vec. CoRR abs/1805.00280 (2018). http://dblp.uni-trier.de/db/journals/
[34] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton,         corr/corr1805.html#abs-1805-00280
     and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale