Scalable Recommendation of Wikipedia Articles to Editors Using Representation Learning Oleksii Moskalenko Denis Parra Diego Saez-Trumper Ukrainian Catholic University Pontificia Universidad Catolica de Wikimedia Foundation Lviv, Ukraine Chile & IMFD San Francisco, USA Santiago, Chile ABSTRACT monthly just in the English Wikipedia) and articles [12] ii) having Wikipedia is edited by volunteer editors around the world. Consid- good coverage of articles beyond just the most popular articles, and, ering the large amount of existing content (e.g. over 5M articles in iii) being able to provide good recommendations for newcomers, English Wikipedia), deciding what to edit next can be difficult, both facing the classical user cold-start problem [30]. for experienced users that usually have a huge backlog of articles to To address these problems, we have created an efficient and prioritize, as well as for newcomers who that might need guidance scalable implementation of a state-of-art convolutional graph em- in selecting the next article to contribute. Therefore, helping edi- bedding algorithm [14] that is able to deal with the large Wikipedia tors to find relevant articles should improve their performance and article graph. We combine this with a document embedding model help in the retention of new editors. In this paper, we address the that allows us to learn representations for articles and editors, and problem of recommending relevant articles to editors. To do this, does not require retraining when new users are added in the system. we develop a scalable system on top of Graph Convolutional Net- With only a few edited articles then, the system is able to produce works and Doc2Vec, learning how to represent Wikipedia articles personalized recommendations, similar to Youtube deep recom- and deliver personalized recommendations for editors. We test our mendation model [7]. We test our algorithm in English Wikipedia model on editors’ histories, predicting their most recent edits based (the largest one with almost 6 million articlesshowing that we can on their prior edits. We outperform competitive implicit-feedback overcame well-established content-based filtering methods as well collaborative-filtering methods such as WMRF based on ALS, as as collaborative filtering approaches. Moreover, we evaluate our well as a traditional IR-method such as content-based filtering based recommendation measuring the top-100 items, to support a robust on BM25. All of the data used on this paper is publicly available, evaluation against popularity bias [31]. including graph embeddings for Wikipedia articles, and we release In summary, the main contributions of this paper are: (i) Intro- our code to support replication of our experiments. Moreover, we duce a model which learns representations (graph and content- contribute with a scalable implementation of a state-of-art graph based) of Wikipedia articles and makes personalized recommenda- embedding algorithm as current ones cannot efficiently handle the tions to editors; (ii) Evaluate it with a large corpus, comparing with sheer size of the Wikipedia graph. competitive baselines (iii) and release a scalable implementation of GraphSage, a state-of-art graph embedding system, that in pre- KEYWORDS vious implementations was unable to deal with the large graph of Wikipedia page2 . Wikipedia, RecSys, Graph Convolutional Neural Network, Repre- sentation Learning 2 RELATED WORK 1 INTRODUCTION There are several projects trying to solve the task of recommending Wikipedia is edited by hundreds of thousands of volunteers around items to users at real-world scales of millions of users and mil- the world. While the level of expertise, motivations, and time ded- lions of items. For instance, Ying et al. for Pinterest [34] created icated to that task varies among users, most of them experience an extension of GraphSAGE [14], a type of Graph Convolutional challenges in deciding which articles to edit next. For example, Network [18] (GCN); researchers at YouTube [7] built a system many experienced users have huge backlogs1 of work with a large based on regular deep neural networks that jointly learns users’ number of articles to improve or review. Prioritizing the articles in and items’ representations from users’ previous history of views. this backlog, as via a personalized article ranking system, would However, in both examples, the model learns in a supervised setup, potentially be of great help for these editors. On the other hand, whereas we lack a sufficiently comprehensive dataset of previous newcomers might experience difficulties deciding what do to af- interactions because 94% of Wikipedia contributors are associated ter their first contribution, as evidenced by the many efforts to with less that 10 interactions in last 3 years [12]. eBay’s recommen- understand how to improve the retention of newcomers [5, 22]. dation system covers a similar gap by using TF-IDF for similar item While previous work on recommendations in Wikipedia has search, which does not require training [4]. focused on finding articles for translation [32] or content to be On the document representation task, we can highlight several added to existing articles [26], there are still important unsolved approaches: Doc2Vec[19] is method for obtaining content-based problems: i) creating a scalable recommender system that can deal efficiently with the large number of Wikipedia editors (over 400K 1 https://en.wikipedia.org/wiki/Wikipedia:Backlog 2 https://github.com/digitalTranshumant/WikiRecNet-ComplexRec2020 Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Oleksii Moskalenko, Denis Parra, and Diego Saez-Trumper Figure 1: Flow of Candidate Generation for Wikipedia articles recommendation: Doc2Vec embeddings are trained on Wiki Text Corpus and then passed as input into GraphSAGE model. Received articles’ representations are then used in Nearest Neighbors Search to produce candidates representations of paragraph or longer text in vector space. How- by the articles that they edited, then we generate a list of candidates ever, one main advantage of our dataset is the availability of struc- from the article pool by comparing that user representation with tural knowledge [6] - i.e. links among articles that could potentially the article representations. Next, we sort the article candidates tell more about the article beyond its content. Those links can be accordingly to the user preferences and generate a list of top-n best represented as a graph, where nodes are articles and edges are links candidates recommendation. between them. Thus, the task of learning document representa- tions can be transformed into learning the representation of a node 3.1 Article and user representation in the graph. Node2vec [13] is a recent approach to learn such a The primary challenge for our system is producing good user and representation. However, its scalability is still limited [35] and the article representations. It is an especially big problem for user rep- main drawback for our use case is the necessity of full retraining resentation since most of Wikipedia contributors do not fill any after changes in the structure of the graph. Node2vec also omit the additional information about themselves except their login creden- content part of articles (node features), which is a huge part of our tials 3 , and around 28% of all revisions in our English Wikipedia dataset. dataset, are done by anonymous users [12]. The only useful infor- GCN [10, 18] are a recent approach to solve many machine mation that could uniquely characterize the user is the history of learning tasks like classification or clustering of a graph’s nodes via his editions. Hence, most of our efforts were dedicated to learning a message-passing architecture that uses shared filters on each pass. articles’ representations, and then representing the user based on It combines initial node features and structural knowledge to learn the articles edited. One effective approaches to construct good user comprehensive representations of the nodes. However, the original and item representations is to learn them with recommendation su- GCN architecture is still not applicable to large-scale graphs because pervision [7, 34]. However, it is not possible to follow this approach it implies operations with a full adjacency matrix of the graph. To due to the lack of the required comprehensive-enough dataset of tackle these limitations, GraphSAGE model was introduced [14] in previous interactions. History of users editions in Wikipedia is far a way that only some fixed-sized sample of neighbors is utilized on from exhaustive (88% of users of English Wikipedia have done less the convolutional Layer. Because of the fixed-size samples, we also than 5 major editions [12]) and too sparse, in a way that it is hard have fixed-sized weights that are generalized and could be applied to model user’s area of interest. Therefore, the additional challenge to a new, unknown part of the graph or even completely different is to conduct representation learning [2] in an unsupervised way graph. Thus, with inductive learning, we can train the model on a in relation to our final task. sub-graph, which means less computation resources are required, and evaluate generalization on the full graph. 3.2 Candidate Generation Similar to YouTube’s deep learning recommender [7], WikiRec- 3 WIKIRECNET DESCRIPTION Net first generates candidates for a final personalized ranking in a Here we introduce WikiRecNet, a scalable system for providing second stage. To generate candidates we first calculate representa- personalized article recommendations in Wikipedia, built on top tion vectors (content-based and graph-based) for all articles in our of GCN and Doc2Vec. The design of our solution is inspired by a classic Information Retrieval architecture. First we represent users 3 https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_is_anonymous Scalable Recommendation of Wikipedia Articles to Editors Nearest Neighbors search with user representation as a query in the articles’ representation database, a procedure we call Candidate Generation and which is conducted online, as shown in Figure 2. Content-based articles representation: Doc2Vec. For learning the content-based article representation, text features are needed to be extracted first. This can be conducted with traditional document vector space model [29] or by using word embeddings such as Word2vec [21] and GloVe [25] and performing an additional step of aggregation. Another option is using directly a full text embedding model and with that goal we use Doc2Vec [19]. There are two distinct approaches for learning document embedding with this model. One is Paragraph Vector Distributed Bag-of-Words model (PV-DBOW) model, which is based on word2vec’s Continuous Bag-of-Words approach [21] but instead of word input it accepts paragraph vector and predicts context words for this paragraph. In the second approach, Distributed Memory (PV-DM), which is based on word2vec skip-gram model, the model predicts middle word based on context and paragraph vector given as input. Later on this paper (Section 5) we show that PV-DBOW is the best fit for our task. We train Doc2Vec-DBOW model on the Corpus of all Wikipedia articles in a given language. Output vectors of Doc2Vec are being passed as input features to the GNC. Graph-based article representation: GraphSAGE. GraphSAGE Figure 2: Candidate Ranking: user history along with can- has been used as GCN due to its ability to learn with an inductive didate are passed through articles’ representation data- approach and construct embeddings for unseen nodes. During the base (Embedding Layer) and then through several fully- pre-processing of the input dataset –snapshot of Wikipedia Dataset– connected layers to train in the log-regression setup. we create a graph 𝐺 (𝑉 , 𝐸) where 𝑉 denotes the set of articles, and 𝐸 the set of links between them. GraphSAGE utilizes structural Table 1: Performance of different algorithms for K-NN knowledge from graph 𝐺 and produces new vectors that preserve search. All tests were conducted with English Wikipedia ar- both text and structural representations. Due to the inductive nature ticles (|𝑉 | = 5, 251, 875). Setup is measured in seconds. Secs. of GraphSAGE architecture, we do not need to retrain the model req. means seconds per request. every time after adding a new article into the database, this is very important for applying WikiRecNet in real scenarios, where new Algorithm Setup Secs./req. Recall MRR articles are constantly added [12]. Exact search 3.91 0.81 0.224 0.0220 After producing the document vector and updating the Graph 𝐺 . IVF 207.02 0.07 0.206 0.0212 structure, we can run GraphSAGE model as is, with already trained HNSW 232.68 0.04 0.224 0.0220 weights. GCN is a multi-layer network, where each layer can be LSH 472.31 0.15 0.215 0.0219 formulated as: 1 1 𝐻 (𝑙+1) = 𝜎 (𝐷˜ − 2 𝐴˜𝐷˜ − 2 𝐻 (𝑙) 𝑊 (𝑙+1) ) Table 2: Specifications of built Wikipedia Graph where 𝐴˜ = 𝐴 + 𝐼 is the adjacency matrix with self-connections (𝐼 ), 𝐷˜ = 𝑗 𝐴˜𝑖 𝑗 , 𝑊 are trainable weights and 𝐻 is the output of Í Specification English Wikipedia previous layer or 𝐻 (0) = 𝑋 is input, 𝑋 represents node features. Amount of vertices (|𝑉 |) 5,251,875 An intuitive explanation of this process is that each node collects Amount of Edges (|𝐸|) 458,867,626 features of its neighbors that were propagated through trainable Average Degree (𝑑𝑎𝑙𝑙 ) 174 filters (convolutions) so called, message passing. On each step node Median Degree (𝑑g 𝑎𝑙𝑙 ) 60 collects knowledge of its neighborhood and propagates its state Approx. Diameter (D) 23 further on the next step. Thus, properties of 1st, 2nd, ..., nth proxim- Amount of labeled nodes 4,652,604 ity are being incorporated into node’s state along with preserving original features of node’s community. Optimizing candidate retrieval. In serving time, recommenda- dataset, a process conducted off-line which is presented as Repre- tion candidates will be produced by applying K-Nearest Neighbors sentation Learning in Figure 1. Then, for every user we define her (K-NN) search to find the most similar articles to the user represen- representation as an aggregation of representation vectors of cor- tation vector in the pre-computed database of all articles’ represen- responding articles that were edited by this user. Next, we conduct tations. K-NN search is one of the main parts of candidate article Oleksii Moskalenko, Denis Parra, and Diego Saez-Trumper generation, since its performance in terms of time and resource con- parsed to organize this data. During pre-processing stage, all links sumption is very critical for online recommendation in a high-load to redirect pages5 were replaced by their actual destinations. "Cate- system. We conducted experiments with different optimizations for gory pages", that consists only of links to other pages and do not K-NN candidate search using FAISS library [16] : Locality-Sensitive have their own content, were detected and filtered out. We used Hashing (LSH), Inverted file with exact post-verification (IVF), Hi- Apache Spark and GraphX for parallel parsing of SQL dumps and erarchical Navigable Small World graph exploration (HNSW). Our discovering and cleaning Article Graph respectively. The output tests showed that HNSW gives the best speed along with exactly Graph was converted into binary format with graph-tool [24] to the same recall and MRR as exact search, so with no trade-off in achieve fast processing (see Table 2). For extracting the articles’ performance we achieved 20x times improvement in speed. Results texts we took the latest revision6 per each article from XML dump. of these experiments are shown in Table 1. We used Gensim [27] to tokenize and lemmatize text and prepare for the Doc2vec training. 3.3 Ranking of Candidate Articles To the facilitate the evaluation in an end-to-end fashion we After learning content and graph-based representations for Wikipedia reorganized the data into a revisions-per-user dataset. Only revisions articles, in the second part of our system we are trying to model that were created after Jan. 2015 were kept in this dataset, so our user preferences based on the previous edit history of Wikipedia recommendations that are based on the latest snapshot of article contributors. With given previous editions and articles, we pro- graph (Oct. 2018) will not recommend too many articles that did duce a relevant a list of candidates, ranked by its relevance for a not exists on that moment. given user. Our model is trained on binary labels - relevant / not We found that 88% of contributors are not regular users, since relevant (logistic regression), as shown in Figure 1, but on serving they edited fewer than 5 different articles for selected dates. We time it will produce probabilities of user interest, which are used also calculated diversity [3] of users’ contribution based on vector as a preference ranking score. representations obtained from Doc2vec. The set of contributors that This approach is inspired by Pointwise ranking [20] and is im- fits to our needs has mostly edited from 5 to 40 different articles, plemented in many similar recommender systems: YouTube[7], though diversity of those articles is rather high. That is the main eBay[4]. The model is shown on Figure 2 and consists of several cause our representations cannot be trained against this data like it fully-connected layers with Batch Normalization and ReLU acti- was done in previous work [7]. Unlike the work by Covington et vation after each layer except for the last layer, where sigmoid al. [7], our training dataset is small (around 60K users) and users’ activation is used. The final model’s architecture was selected as areas of interest are very diverse. follows: 1024 ReLU -> 512 ReLU -> 256 ReLU. As input model ac- cept a concatenated vector of user and candidate representations. 4.2 Training Preference score We define our preference ranking score as For all training experiments with GraphSAGE we generated a sam- the probability that a user 𝑢 finds a wikipedia article 𝑎𝑖 relevant: pled adjacency matrix based on Articles’ graph 𝐺. The width of 1 the adjacency matrix was determined by our experiments, based 𝑠𝑐𝑜𝑟𝑒 (𝑢, 𝑎𝑖 ) = 𝑃 (𝑎𝑖 = 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 |𝑢) = (1) 1 + 𝑒 −Φ(𝑎𝑖 ,𝑢) available memory resources as well as on graph statistics (Table 2). where 𝑢 represents the user, 𝑎𝑖 a candidate wikipedia article, 𝐴 is We selected 128 as maximum amount of neighbors in this matrix. If the set of all articles to be ranked, and Φ(·) a weighted sum of the a node had more than that, then we used random subsampling. Our values in the last network layer of the Candidate ranking neural GCN architecture consists of two convolutional layers. On each network, shown in Figure 2. We train the model with a traditional convolutional step we picked a random sample of 25 neighbors loss for a binary logistic regression. from this adjacency matrix. This 25-neighbors sample is being re- sampled on each new batch. For better generalization we used a 4 EXPERIMENTS batch size of 512, since experiments with dropout between convo- lutional layers led to no improvement in generalization. We set the We worked with the English version of Wikipedia not only because size for all output vectors to 512 considering a balance between is the most popular one. In addition, is the most challenging in better resolution and available memory. terms of data processing, and if our system is able to deal with the Document representations from Doc2Vec-DBOW, trained with largest Wikipedia it would be easy to apply later in smaller projects. vector size 300 and window size 8, were passed as initial node states and graph edges played the role of labels when the model was 4.1 Dataset trying to predict those edges. We utilized max-margin loss [34], All data used has been downloaded from the official Wikimedia as target for training GraphSAGE in link-prediction setup. Model Dumps [11] which are snapshots of the full Wikipedia. Some of the parameters were tuned with the Adam optimizer[17]. objects in the dump (like articles’ links) are stored in SQL format, 𝐽 (𝑧𝑢 𝑧𝑖 ) = 𝐸 𝑣𝑛 ∼𝑃𝑛 (𝑢) 𝑚𝑎𝑥 {0, 𝑧𝑢 · 𝑧 𝑣𝑢 − 𝑧𝑢 · 𝑧𝑖 + Δ} (2) others, with deeper structure (like articles’ text) are stored in XML. First of all, for representation learning we built a graph 𝐺 (𝑉 , 𝐸), For training the ranking model, a dataset from users’ history where the set of nodes 𝑉 is the set of all Wikipedia pages belonging was constructed. As input this model takes 5 articles edited by a to article namespace4 and 𝐸 is the set of directed links between user ( representing users’ preferences) and 1 candidate that might them. The SQL dumps of page, pagelinks, and redirects tables were 5 https://en.wikipedia.org/wiki/Wikipedia:Redirect 4 https://en.wikipedia.org/wiki/Wikipedia:Namespace 6 Article’s revision is a specific version of article’s content after each modification Scalable Recommendation of Wikipedia Articles to Editors interest the user. The model tries to predict the probability of rele- model (merge, mean-pool, max-pool), as well as the method for vance of this candidate to the current user. Those 6 input articles ranking (cosine similarity and Deep-Rank). are passed through an Embedding Layer populated with represen- The results in Table 3 show that WikiRecNet, using merge aggre- tations received from GraphSAGE and then concatenated into one gation and Deep-Rank ranking, outperforms the other methods in vector. We chose positive candidates from actual user history and all metrics. We highlight the following the aspects in the evaluation: generated negative candidates with kNN search on constructed articles’ representations. Logistic (binary cross-entropy) regression • ALS implicit feedback collaborative filtering performs the with class-weights (due to high class imbalance) was used as loss worst among all methods. This result must be due to the function. extreme high sparsity of the dataset. • BM25, despite being a simple and traditional content-based filtering method, performs well and remains very competi- 4.3 Evaluation tive. To prepare the evaluation dataset, we subsampled windows of size • The simple K-NN based on Doc2Vec representation per- 10 from user’s history (from users that were not previously used forms better than ALS, and mean-pool reports better results for training or testing the Deep Ranking model). Our assumption if than merge but only at higher ranking positions (MAP@50, that the first 5 articles denoted users’ area of interest. To compute a nDCG@50, Recall@50). single user vector we took element-wise average of representations • Among the WikiRecNet variations, the max-pool aggrega- from the first 5 articles (GraphSAGE representations). We were tion seems to be the least helpful. In terms of nDCG@50 and trying to predict the rest 5. Algorithm can be expressed as follows: (i) nDCG@100 (the metric most robust to popularity bias [31]), take first 5 articles per user. Calculate average of their embeddings merge aggregation seems more effective than mean-pool, vectors, output this as the user vector representation, (ii) generate and then the combination with DeepRank produce the best candidates by nearest neighbors search of user representation, (iii) performance, with a 100% increase compared to the Doc2vec sort candidates according to ranking algorithm and select the top mean-pool reference method. 𝐾. In our evaluation we compare two ranking techniques: sort by cosine similarity, and sort by probability from Deep Ranking model, 6 CONCLUSION and (iv) compare Top-K recommendations with the 5 articles in the In this article we have introduced WikiRecNet, a neural-based model test set (from second half of the sampled window). which aims at recommending Wikipedia articles to editors, in order To measure the results we used several metrics: mean average to help them dealing with the sheer volume of potential articles that precision (MAP), normalized discounted cumulative gain (nDCG)[1] might need their attention. Our approach uses representation learn- and Recall@k [8]. We calculate these metrics at high k values, ing, i.e., finding alternative ways to represent the Wikipedia articles 𝑘 = 50 an 𝑘 = 100. Unlike traditional research on top-k recommen- in order to produce a useful recommendation without requiring dation systems usually focusing on small k values (k=10,20,30), we more information than the previous articles edited by targeted are specially interested in preventing popularity bias, i.e., having users. For this purpose, we used Doc2Vec [19] for a content-based WikiRecNet biased to recommend mostly popular items. Valcarce et representation and GraphSage [14], a graph convolutional network, al. [31] showed recently that usual top-k ranking metrics measured for a graph-based representation. at higher values of k (50, 100) are specially robust to popularity WikiRecNet architecture is composed of two networks, a candi- bias, and that is why we use them here. date generation network and a ranking network, and our imple- . mentation is able to deal with larges volumes of data, improving existing implementations that were not capable to work in such scenarios. Also, our approach does not need to be retrained when 5 RESULTS new items are added, facilitating its application in dynamic environ- Results of the evaluation are presented in Table 3. We first describe ments such as Wikipedia. To best of our knowledge, this is the first the competing methods: recommender system especially designed for Wikipedia editors Baselines. We used two well established methods. The first one that takes in account such applications constrains, and therefore, is BM25[28], a probabilistic method used on information retrieval can be applied in real world scenarios. but also applied for content-based filtering in the area of recommen- In order to contribute to the community, we provide our code dation [23]. A second baseline is implicit feedback collaborative and the graph embedding of each Wikipedia page used in this filtering optimized with Alternative Least Squares (ALS) [15]. experiment7 available in a public repository, as well as a working K-NN recommender. In addition, we implemented a simple demo that can be tested by the Wikipedia editors community8 . With K-NN recommender where the Wikipedia articles are represented respect to text embeddings, there have been important progresses by the Doc2vec vector embeddings. Each user 𝑢 is represented by in the latest years, so another idea for future work will be testing the articles she has edited, and we test two forms of aggregation to models like BERT [9] or XLNet [33]. represent the user model: merging the user-edited articles (merge) and calculating the mean at each dimension of the document (mean- pool). We rank recommended articles by cosine similarity. Aggregations. Finally,WikiRecNet is presented in 5 versions by 7 Embeddings in other languages would be also available under request. varying the type of aggregation of articles to represent the user 8 https://github.com/digitalTranshumant/WikiRecNet-ComplexRec2020 Oleksii Moskalenko, Denis Parra, and Diego Saez-Trumper Table 3: Offline evaluation of generated recommendations on the task of predicting next 5 articles edited by user with per- centage improvement over content-based model Doc2Vec (mean-pool) with cosine similarity. K=50 K=100 Model Aggregate Rank MAP nDCG Recall MAP nDCG Recall WikiRecNet mean cosine 0.0221 0.1361 0.0846 0.0238 (+78%) 0.1468 (+66%) 0.1179 (+99%) mean deep-rank 0.0228 0.1363 0.0841 0.0243 (+82%) 0.1493 (+70%) 0.1134 (+92%) max cosine 0.0192 0.1196 0.0672 0.0206 (+54 %) 0.1299 (+47%) 0.0923 (+56%) merge cosine 0.0208 0.1412 0.0825 0.0227 (+70%) 0.1538 (+75%) 0.1175 (+99%) merge deep-rank 0.0262 0.1625 0.0935 0.0282 (+111%) 0.1760 (+100%) 0.1302 (+120%) Doc2Vec merge cosine 0.0085 0.0805 0.0438 0.0092 0.0883 0.0600 mean cosine 0.0126 0.0821 0.0436 0.0133 0.0880 0.0590 BM25 0.0251 0.1602 0.0921 0.0273 0.1710 0.1290 ALS MF 0.0027 0.0163 0.044 0.0063 0.0204 0.0609 7 ACKNOWLEDGMENTS [15] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In In IEEE International Conference on Data Mining The author Denis Parra has been funded by the Millennium Institute (ICDM 2008. 263–272. for Foundational Research on Data (IMFD) and by the Chilean [16] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017). research agency ANID, FONDECYT grant 1191791. [17] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochas- tic Optimization. In 3rd International Conference on Learning Representations, REFERENCES ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6980 [1] Linas Baltrunas, Tadas Makcinskas, and Francesco Ricci. 2010. Group Recom- [18] Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with mendations with Rank Aggregation and Collaborative Filtering. In Proceedings Graph Convolutional Networks. CoRR abs/1609.02907 (2016). of the Fourth ACM Conference on Recommender Systems (RecSys ’10). ACM, New [19] Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences York, NY, USA, 119–126. https://doi.org/10.1145/1864708.1864733 and Documents. CoRR abs/1405.4053 (2014). arXiv:1405.4053 http://arxiv.org/ [2] Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation abs/1405.4053 learning: A review and new perspectives. IEEE transactions on pattern analysis [20] Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf. and machine intelligence 35, 8 (2013), 1798–1828. Retr. 3, 3 (March 2009), 225–331. https://doi.org/10.1561/1500000016 [3] J Bobadilla, Francisco Serradilla, and J Bernal. 2010. A new collaborative filtering [21] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient metric that improves the behavior of recommender systems. Knowledge-Based Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013). Systems 23 (08 2010), 520–528. https://doi.org/10.1016/j.knosys.2010.03.009 arXiv:1301.3781 http://arxiv.org/abs/1301.3781 [4] Yuri M. Brovman, Marie Jacob, Natraj Srinivasan, Stephen Neola, Daniel Galron, [22] Jonathan T Morgan, Siko Bouterse, Heather Walls, and Sarah Stierch. 2013. Tea Ryan Snyder, and Paul Wang. 2016. Optimizing Similar Item Recommendations and sympathy: crafting positive new user experiences on wikipedia. In Pro- in a Semi-structured Marketplace to Maximize Conversion. In Proceedings of the ceedings of the 2013 conference on Computer supported cooperative work. ACM, 10th ACM Conference on Recommender Systems (RecSys ’16). ACM, New York, NY, 839–848. USA, 199–202. https://doi.org/10.1145/2959100.2959166 [23] Denis Parra and Peter Brusilovsky. 2009. Collaborative filtering for social tagging [5] Boreum Choi, Kira Alexander, Robert E Kraut, and John M Levine. 2010. So- systems: an experiment with CiteULike. In Proceedings of the third ACM conference cialization tactics in wikipedia and their effects. In Proceedings of the 2010 ACM on Recommender systems. ACM, 237–240. conference on Computer supported cooperative work. ACM, 107–116. [24] Tiago P. Peixoto. 2014. The graph-tool python library. figshare (2014). https: [6] Cristian Consonni, David Laniado, and Alberto Montresor. 2019. WikiLinkGraphs: //doi.org/10.6084/m9.figshare.1164194 A complete, longitudinal and multi-language dataset of the Wikipedia link net- [25] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: works. arXiv preprint arXiv:1902.04298 (2019). Global vectors for word representation. In Proceedings of the 2014 conference on [7] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks empirical methods in natural language processing (EMNLP). 1532–1543. for YouTube Recommendations. In Proceedings of the 10th ACM Conference on [26] Tiziano Piccardi, Michele Catasta, Leila Zia, and Robert West. 2018. Struc- Recommender Systems (RecSys ’16). ACM, New York, NY, USA, 191–198. https: turing Wikipedia Articles with Section Recommendations. arXiv preprint //doi.org/10.1145/2959100.2959190 arXiv:1804.05995 (2018). [8] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of [27] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling recommender algorithms on top-n recommendation tasks. In Proceedings of the with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges fourth ACM conference on Recommender systems. ACM, 39–46. for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/ [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: 884893/en. Pre-training of deep bidirectional transformers for language understanding. arXiv [28] Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- preprint arXiv:1810.04805 (2018). work: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (April 2009), 333–389. [10] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bom- https://doi.org/10.1561/1500000019 barell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P Adams. 2015. [29] Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model Convolutional Networks on Graphs for Learning Molecular Finger- for automatic indexing. Commun. ACM 18, 11 (1975), 613–620. prints. In Advances in Neural Information Processing Systems 28, [30] Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett 2002. Methods and metrics for cold-start recommendations. In Proceedings of the (Eds.). Curran Associates, Inc., 2224–2232. http://papers.nips.cc/paper/ 25th annual international ACM SIGIR conference on Research and development in 5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints. information retrieval. ACM, 253–260. pdf [31] Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2018. On [11] Wikimedia Foundation. 2018. Wikimedia Downloads. https://dumps.wikimedia. the robustness and discriminative power of information retrieval metrics for top- org [Online; accessed 14. Oct. 2019]. N recommendation. In Proceedings of the 12th ACM Conference on Recommender [12] Wikimedia Foundation. 2019. Wikimedia Statistics - All wikis. https://stats. Systems. ACM, 260–268. wikimedia.org/v2/#/all-projects [Online; accessed 13. Oct. 2019]. [32] Ellery Wulczyn, Robert West, Leila Zia, and Jure Leskovec. 2016. Growing [13] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning wikipedia across languages via recommendation. In Proceedings of the 25th Inter- for Networks. CoRR abs/1607.00653 (2016). http://dblp.uni-trier.de/db/journals/ national Conference on World Wide Web. International World Wide Web Confer- corr/corr1607.html#GroverL16 ences Steering Committee, 975–985. [14] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NIPS. Scalable Recommendation of Wikipedia Articles to Editors [33] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Recommender Systems. In KDD. and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Lan- [35] Dongyan Zhou, Songjie Niu, and Shimin Chen. 2018. Efficient Graph Computation guage Understanding. arXiv preprint arXiv:1906.08237 (2019). for Node2Vec. CoRR abs/1805.00280 (2018). http://dblp.uni-trier.de/db/journals/ [34] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, corr/corr1805.html#abs-1805-00280 and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale