Using an Ensemble of Features for Personalized Recommendations of Scientific Publications Discussion Paper Paolo Tenti1, James Thomas2, Rafael Peñaloza1 and Gabriella Pasi1 1 IKR3 Lab, University of Milano-Bicocca, Milan, Italy 2 EPPI Centre, UCL Social Research Institute, University College London Abstract Maintaining reviews of scientific publications as soon as new relevant publications are available is a typical challenge to many research communities. We address this challenge as a content-based recom- mendation problem, where the publications already selected for a review drive the recommendation of the new publications. In addition, resources such as domain databases, ontologies and academic graphs provide structured information about publications (e.g., authors, journals, conferences). Our experi- ments show that a simple model based on that structured information to represent publications achieve high precision and recall, and outperform models that use more sophisticated representations based on embeddings. Keywords Content-based recommendation, Scientific papers recommendations, Text classification Introduction A common issue to many research communities is building and maintaining meaningful col- lections of references to scientific publications related to a specific research topic; to this aim reviews are compiled, which report such references in an organized way. In this context, a first challenge is discovering the initial collection of publications to be considered for compiling a review; this can be done by searching across several scientific databases and journals and going through several passages of filtering and refinement. A second challenge is maintaining reviews; that is, capturing new relevant publications as soon as they are available after a review has been constructed, to the aim of keeping the review updated. Both processes are partially manual, long-running, and error-prone. We argue that the problem of maintaining existing reviews can entrain a content-based recommendation problem: publications in existing reviews can be used to automatically find and recommend new publications to the reviews’ owners. IIR 2021 – 11th Italian Information Retrieval Workshop, September 13–15, 2021, Bari, Italy " p.tenti1@campus.unimib.it (P. Tenti); james.thomas@ucl.ac.uk (J. Thomas); rafael.penaloza@unimib.it (R. Peñaloza); gabriella.pasi@unimib.it (G. Pasi) ~ https://ikr3.disco.unimib.it/people/paolo-tenti/ (P. Tenti); https://iris.ucl.ac.uk/iris/browse/profile?upi=JTHOA32 (J. Thomas); https://rpenalozan.github.io/ (R. Peñaloza); https://ikr3.disco.unimib.it/people/gabriella-pasi/ (G. Pasi)  0000-0003-2432-3018 (P. Tenti); 0000-0003-4805-4190 (J. Thomas); 0000-0002-2693-5790 (R. Peñaloza); 0000-0002-6080-8170 (G. Pasi) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Problem statement Let ℘ be the domain of publications, and R the domain of reviews, such that a review 𝑅 ∈ R is a finite set of scientific publications, i.e., 𝑅 = {𝑝1 , ..., 𝑝𝑘 | 𝑝𝑖 ∈ ℘}. A set R ⊆ R of reviews and a set of brand-new publications ℘ ⊆ ℘ are given. For any review 𝑅 ∈ R, the problem is to retrieve a set ℘𝑅 ⊆ ℘ of publications that are relevant to R. To define the notion of relevance, we first define a similarity function Φ : R × ℘ → [0, 1] that given a review 𝑅 ∈ R and a publication 𝑝 ∈ ℘, returns their similarity score. Relevance can be modeled in this context as a binary function Ξ𝜆 : R × ℘ → {𝑇 𝑟𝑢𝑒, 𝐹 𝑎𝑙𝑠𝑒} that given a threshold 𝜆 returns true if and only if Φ(𝑝, 𝑅) > 𝜆. For any review 𝑅 ∈ R, the set of relevant publications ℘𝑅 = {𝑝 | 𝑝 ∈ ℘ ∧ Ξ𝜆 (𝑝, 𝑅)} will be recommended. Methodological framework Resources such as scientific databases (e.g., PubMed 1 ), ontologies (e.g., Unified Medical Lan- guage System 2 , PICO 3 ) and academic graphs (e.g., Microsoft Academic Graph [1]), provide relational information about publications, such as title, abstract, authors, citations, references, journals, and conferences. That information can be used to enrich publications with descriptive features. Hence, we study the opportunity of using these features to construct multiple vector representations of publications, to address the problem of recommending scientific publications as described above. Given a set of publications features ℱ = {𝜋1 , ..., 𝜋𝑛 }, for a publication 𝑝 ∈ ℘ we define 𝜐(𝑝) = {𝜐 𝜋1 (𝑝), ..., 𝜐 𝜋𝑛 (𝑝)|𝜋𝑛 ∈ ℱ} as the set of vectors that represent 𝑝, where 𝜐 𝜋 (𝑝) stands for the vector representation of 𝑝 with respect to the feature 𝜋. We define a methodological framework for using the available features to compute a com- pound similarity function between a publication and a review. This framework is simple yet extensible. First, we define a method to construct the vectors 𝜐 𝜋 (𝑝) for a given publication 𝑝 and feature 𝜋. In addition, we define the representation of a review 𝑅 ∈ R with respect to a feature 𝜋 as 𝜐 𝜋 (𝑅) = 𝜐 * ({𝜐 𝜋 (𝑝) | 𝑝 ∈ 𝑅}), where 𝜐 * is an aggregation function over the set of publications representations with respect to the same feature 𝜋. We further define Φ𝜋 (𝑝, 𝑅) as the function to calculate the similarity between a publication 𝑝 and a review 𝑅, with respect to a feature 𝜋. Finally, we define the function to calculate the similarity between 𝑝 and 𝑅 as Φ(𝑝, 𝑅) = Φ* ({Φ𝜋 (𝑝, 𝑅) | 𝜋 ∈ ℱ}) where Φ* is an aggregation function over feature-specific similarity scores. Evaluation We conducted a series of experiments on a manually labeled dataset of domain-specific reviews and a few tens of thousands of publications. Note that the research domain is homogeneous, 1 https://pubmed.ncbi.nlm.nih.gov 2 https://www.nlm.nih.gov/research/umls/index.html 3 https://linkeddata.cochrane.org/pico-ontology and thus the reviews are quite similar to each other. We evaluated our method as a multi-class, multi-label classification problem. Specifically, classes are the reviews, and each publication might be relevant to multiple reviews. Thus, the problem is to predict all relevant labels (i.e., reviews) for a certain, previously unseen, publication. We considered title, abstract, citation network, authors, journals, conferences, topics and ontological categories as features, and their representations as either Tf-Idf vectors or binary vectors. For topics we considered the Field of Studies extracted from the Microsoft Academic Graph. We extracted ontological categories from PiCO, a domain specific ontology. The best performing model showed precision of 97.7% with recall of 99.2%. Similar results were confirmed by experiments on a different dataset. We observed the following: • Ensembles of features over-performed textual features alone, either with Tf-Idf based representations or with embeddings. • For text-based models, Tf-Idf based representations considerably beat embeddings on precision. Our interpretation was that, in a context where reviews come from the same domain, key phrases are better suited to capture the differences between them. • Titles generally performed better than abstracts when using ensembles of features and are computationally more efficient. • Representing publications by means of simple binary or Tf-Idf based vectors suffices to achieve good performance, in contrast to more sophisticated solutions, such as represen- tations based on embeddings. • To implement 𝜐 * many approaches are possible. Our experiments suggest that the best performing ones capture the most representative properties of reviews, rather than any single property regardless of their importance. • To implement Φ* , our experiments show that simple mathematical operators suffice to achieve good results and keep the model computationally efficient and highly explainable. Among the benefits of explainability, note that for any given review it is easy to capture the relevant features to achieve good recommendation performance. Contributions and open challenges Several Web resources (i.e., academic graphs, domain specific scientific databases) provide structured information about scientific publications, such as title, abstract, authors, citations and potentially ontological categories. Our work shows a methodological framework that can make use of such structured information to achieve high-precision and high-recall in the down-stream task of domain-specific, personalized recommendation of scientific publications. In addition, our work shows that representing publications with ensembles of features outperforms representations based on embedding vectors [2, 3, 4]. Our interpretation is that to discriminate publications’ membership to reviews that belong to the same domain, the signals coming from key phrases and ontological categories are more relevant than those from more general semantic representations, like the ones obtained with text embeddings. Constructing more sophisticated embeddings to represent publications that capture both their content and relational properties [5] might achieve comparable or better performance in a more domain independent and task agnostic way. However, we argue that using a simple similarity mathematical model based on easy to capture features is computationally inexpensive, easier to train, more interpretable and still highly generalizable. Finally, we believe that there are still open challenges to address. Our experiments show the importance of some features like topical and ontological categories. However, the availability of such features might be domain-dependent, or hard to extract. On the one hand it would be worth studying the impact of using text embeddings over titles and abstracts in synergy with standard features (i.e., n-grams over titles and abstracts, citation network and co-authoring), to see if they could compensate for more sophisticated and domain-dependent features (i.e., ontological categories, fields of study). On the other hand, it would be worth studying generalizable methods for extracting ontological features from publications [6]. References [1] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P. Hsu, K. Wang, An overview of mi- crosoft academic service (mas) and applications, in: Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 243–246. [2] D. Cer, Y. Yang, S. yi Kong, N. Hua, N. L. U. Limtiaco, R. S. John, N. Constant, M. Guajardo- Céspedes, S. Yuan, C. Tar, Y. hsuan Sung, B. Strope, R. Kurzweil, Universal sentence encoder, arXiv preprint arXiv:1803.11175 (2018). [3] C. W. Schmidt, Improving a tf-idf weighted document vector embedding., arXiv preprint arXiv:1902.09875 (2019). [4] S. Arora, Y. Liang, T. Ma, A simple but tough-to-beat baseline for sentence embeddings, in: ICLR 2017 : International Conference on Learning Representations 2017, 2017. [5] D. Nozza, E. Fersini, E. Messina, Cage: Constrained deep attributed graph embedding, Information Sciences 518 (2020) 56–70. [6] V. Gutiérrez-Basulto, S. Schockaert, From knowledge graph embedding to ontology embed- ding? an analysis of the compatibility between vector space representations and rules, in: KR, 2018, pp. 379–388.