Introduction

Using an Ensemble of Features for Personalized Recommendations of Scientific Publications

Discussion Paper

Paolo Tenti

p.tenti1@campus.unimib.it 1

James Thomas

james.thomas@ucl.ac.uk 0

Rafael Peñaloza

rafael.penaloza@unimib.it 1

Gabriella Pasi

gabriella.pasi@unimib.it 1 0 EPPI Centre, UCL Social Research Institute, University College London 1 IKR3 Lab, University of Milano-Bicocca , Milan , Italy

Maintaining reviews of scientific publications as soon as new relevant publications are available is a typical challenge to many research communities. We address this challenge as a content-based recommendation problem, where the publications already selected for a review drive the recommendation of the new publications. In addition, resources such as domain databases, ontologies and academic graphs provide structured information about publications (e.g., authors, journals, conferences). Our experiments show that a simple model based on that structured information to represent publications achieve high precision and recall, and outperform models that use more sophisticated representations based on embeddings.

eol>Content-based recommendation Scientific papers recommendations Text classification

Introduction

A common issue to many research communities is building and maintaining meaningful collections of references to scientific publications related to a specific research topic; to this aim reviews are compiled, which report such references in an organized way. In this context, a first challenge is discovering the initial collection of publications to be considered for compiling a review; this can be done by searching across several scientific databases and journals and going through several passages of filtering and refinement. A second challenge is maintaining reviews; that is, capturing new relevant publications as soon as they are available after a review has been constructed, to the aim of keeping the review updated. Both processes are partially manual, long-running, and error-prone. We argue that the problem of maintaining existing reviews can entrain a content-based recommendation problem: publications in existing reviews can be used to automatically find and recommend new publications to the reviews’ owners.

Problem statement

Let ℘ be the domain of publications, and R the domain of reviews, such that a review ∈ R is a finite set of scientific publications, i.e., = {1, ..., | ∈ ℘}.

A set R ⊆ R of reviews and a set of brand-new publications ℘ ⊆ ℘ are given. For any review ∈ R, the problem is to retrieve a set ℘ ⊆ ℘ of publications that are relevant to R.

To define the notion of relevance, we first define a similarity function Φ : R × ℘ → [ 0, 1 ] that given a review ∈ R and a publication ∈ ℘, returns their similarity score. Relevance can be modeled in this context as a binary function Ξ : R × ℘ → { , } that given a threshold returns true if and only if Φ(, ) > . For any review ∈ R, the set of relevant publications ℘ = { | ∈ ℘ ∧ Ξ(, )} will be recommended.

Methodological framework

Resources such as scientific databases (e.g., PubMed 1), ontologies (e.g., Unified Medical Language System 2, PICO 3) and academic graphs (e.g., Microsoft Academic Graph [ 1 ]), provide relational information about publications, such as title, abstract, authors, citations, references, journals, and conferences. That information can be used to enrich publications with descriptive features. Hence, we study the opportunity of using these features to construct multiple vector representations of publications, to address the problem of recommending scientific publications as described above.

Given a set of publications features ℱ = {1, ..., }, for a publication ∈ ℘ we define () = {1 (), ..., ()| ∈ ℱ } as the set of vectors that represent , where () stands for the vector representation of with respect to the feature .

We define a methodological framework for using the available features to compute a compound similarity function between a publication and a review. This framework is simple yet extensible.

First, we define a method to construct the vectors () for a given publication and feature . In addition, we define the representation of a review ∈ R with respect to a feature as () = *({() | ∈ }), where * is an aggregation function over the set of publications representations with respect to the same feature .

We further define Φ(, ) as the function to calculate the similarity between a publication and a review , with respect to a feature . Finally, we define the function to calculate the similarity between and as Φ(, ) = Φ*({Φ(, ) | ∈ ℱ }) where Φ* is an aggregation function over feature-specific similarity scores.

Evaluation

We conducted a series of experiments on a manually labeled dataset of domain-specific reviews and a few tens of thousands of publications. Note that the research domain is homogeneous, 1https://pubmed.ncbi.nlm.nih.gov 2https://www.nlm.nih.gov/research/umls/index.html 3https://linkeddata.cochrane.org/pico-ontology and thus the reviews are quite similar to each other.

We evaluated our method as a multi-class, multi-label classification problem. Specifically, classes are the reviews, and each publication might be relevant to multiple reviews. Thus, the problem is to predict all relevant labels (i.e., reviews) for a certain, previously unseen, publication.

We considered title, abstract, citation network, authors, journals, conferences, topics and ontological categories as features, and their representations as either Tf-Idf vectors or binary vectors. For topics we considered the Field of Studies extracted from the Microsoft Academic Graph. We extracted ontological categories from PiCO, a domain specific ontology.

The best performing model showed precision of 97.7% with recall of 99.2%. Similar results were confirmed by experiments on a diferent dataset. We observed the following: • Ensembles of features over-performed textual features alone, either with Tf-Idf based representations or with embeddings. • For text-based models, Tf-Idf based representations considerably beat embeddings on precision. Our interpretation was that, in a context where reviews come from the same domain, key phrases are better suited to capture the diferences between them. • Titles generally performed better than abstracts when using ensembles of features and are computationally more eficient. • Representing publications by means of simple binary or Tf-Idf based vectors sufices to achieve good performance, in contrast to more sophisticated solutions, such as representations based on embeddings. • To implement * many approaches are possible. Our experiments suggest that the best performing ones capture the most representative properties of reviews, rather than any single property regardless of their importance. • To implement Φ*, our experiments show that simple mathematical operators sufice to achieve good results and keep the model computationally eficient and highly explainable. Among the benefits of explainability, note that for any given review it is easy to capture the relevant features to achieve good recommendation performance.

Contributions and open challenges

Several Web resources (i.e., academic graphs, domain specific scientific databases) provide structured information about scientific publications, such as title, abstract, authors, citations and potentially ontological categories. Our work shows a methodological framework that can make use of such structured information to achieve high-precision and high-recall in the down-stream task of domain-specific, personalized recommendation of scientific publications.

In addition, our work shows that representing publications with ensembles of features outperforms representations based on embedding vectors [ 2, 3, 4 ]. Our interpretation is that to discriminate publications’ membership to reviews that belong to the same domain, the signals coming from key phrases and ontological categories are more relevant than those from more general semantic representations, like the ones obtained with text embeddings.

Constructing more sophisticated embeddings to represent publications that capture both their content and relational properties [ 5 ] might achieve comparable or better performance in a more domain independent and task agnostic way. However, we argue that using a simple similarity mathematical model based on easy to capture features is computationally inexpensive, easier to train, more interpretable and still highly generalizable.

Finally, we believe that there are still open challenges to address. Our experiments show the importance of some features like topical and ontological categories. However, the availability of such features might be domain-dependent, or hard to extract. On the one hand it would be worth studying the impact of using text embeddings over titles and abstracts in synergy with standard features (i.e., n-grams over titles and abstracts, citation network and co-authoring), to see if they could compensate for more sophisticated and domain-dependent features (i.e., ontological categories, fields of study). On the other hand, it would be worth studying generalizable methods for extracting ontological features from publications [ 6 ].

[1]

Sinha ,

Shen ,

Song ,

Ma ,

Eide ,

B.-J. P.

Hsu ,

Wang , An overview of microsoft academic service (mas) and applications , in: Proceedings of the 24th International Conference on World Wide Web , 2015 , pp. 243 - 246 .

[2]

Cer ,

Yang , S. yi Kong,

Hua ,

N. L. U.

Limtiaco , R. S. John, N. Constant , M. GuajardoCéspedes, S. Yuan,

Tar , Y. hsuan Sung , B.

Strope , R.

Kurzweil , Universal sentence encoder, arXiv preprint arXiv: 1803 . 11175 ( 2018 ).

[3]

C. W.

Schmidt , Improving a tf-idf weighted document vector embedding ., arXiv preprint arXiv: 1902 . 09875 ( 2019 ).

[4]

Arora ,

Liang , T. Ma, A simple but tough-to-beat baseline for sentence embeddings , in: ICLR 2017 : International Conference on Learning Representations 2017 , 2017 .

[5]

Nozza ,

Fersini , E. Messina, Cage: Constrained deep attributed graph embedding , Information Sciences 518 ( 2020 ) 56 - 70 .

[6]

Gutiérrez-Basulto ,

Schockaert , From knowledge graph embedding to ontology embedding? an analysis of the compatibility between vector space representations and rules , in: KR , 2018 , pp. 379 - 388 .