<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Survey. International Journal on Digital Libraries</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>BERT, ELMo, USE and InferSent Sentence Encoders: The Panacea for Research-Paper Recommendation?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hebatallah A. Mohamed Hassan</string-name>
          <email>hebatallah.mohamed@uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Gaspareti</string-name>
          <email>P@10</email>
          <email>gaspare@dia.uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joeran Beel</string-name>
          <email>beelj@tcd.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Sansoneti</string-name>
          <email>gsansone@dia.uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Micarelli</string-name>
          <email>micarel@dia.uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Roma Tre University, Department of, Engineering</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Trinity College Dublin, School of Computer, Science and Statistics, ADAPT Centre</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>4</volume>
      <issue>2016</issue>
      <abstract>
        <p>Content-based approaches to research paper recommendation are important when user feedback is sparse or not available. The task of content-based matching is challenging, mainly due to the problem of determining the semantic similarity of texts. Nowadays, there exist many sentence embedding models that learn deep semantic representations by being trained on huge corpora, aiming to provide transfer learning to a wide variety of natural language processing tasks. In this work, we present a comparative evaluation among five well-known pre-trained sentence encoders deployed in the pipeline of title-based research paper recommendation. The experimented encoders are USE, BERT,</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>InferSent, ELMo, and SciBERT. For our study, we propose a methodology for evaluating such models in
reranking BM25-based recommendations. The experimental results show that the sole consideration
of semantic information from these encoders does not lead to improved recommendation performance
over the traditional BM25 technique, while their integration enables the retrieval of a set of relevant
papers that may not be retrieved by the BM25 ranking function.</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Pre-trained sentence embeddings, semantic similarity, reranking, research paper recommendation
Sentence encoders such as Google’s BERT and USE, Facebook’s InferSent, and AllenAI’s SciBERT
and ELMo, have received significant atention in recent years. These pre-trained machine learning
models can encode a sentence into deep contextualized embeddings. They have been reported to
outperform previous state-of-the-art approaches such as traditional word embeddings, for many
natural language processing tasks [2, 3, 5, 7, 8]. Such tasks also include the calculation of semantic
similarity and relatedness, which is key in developing efective research-paper recommender systems.
If sentence encoders performed for calculating relatedness of research articles as good as for other
tasks, this would mean a great advancement for research-paper recommender systems.</p>
      <p>There are some work on using document embeddings for ranking research papers based on semantic
relatedness (e.g., [4]), but - as far as we know - no work on exploiting pre-trained sentence embeddings
for the same task. Existing work on sentence encoders focused on diferent domains such as social
media posts [10], news [6, 10], and web pages [6]. Our goal is to find how well some of the most common
sentence encoders perform for calculating document relatedness in the scenario of a research-paper
recommender system. To the best of our knowledge, we are the first to conduct such an evaluation in
the field of research paper recommendations.</p>
    </sec>
    <sec id="sec-3">
      <title>METHODOLOGY</title>
      <p>We focus on the task of related-article recommendations, where a recommender system receives one
paper as input, and returns a list of related papers. In our experiments, research papers are represented
by their title only. While this may not be ideal to leverage the full potential of sentence encoders,
using only the title is a realistic scenario as many research-paper recommender systems do not use
full-texts but only the title (and sometimes the abstract) [1].</p>
    </sec>
    <sec id="sec-4">
      <title>Sentence Embeddings, Baseline, and Hybrid Approaches</title>
      <p>We experiment with five pre-trained sentence encoders to transform the input paper and candidates
papers in the corpus into sentence embeddings. The encoders are as follows:
1htps://tfhub.dev/google/universal-sentenceencoder/2
2htps://tfhub.dev/google/universal-sentenceencoder-large/3</p>
      <sec id="sec-4-1">
        <title>3htps://github.com/facebookresearch/InferSent</title>
      </sec>
      <sec id="sec-4-2">
        <title>4htps://tfhub.dev/google/elmo/2</title>
      </sec>
      <sec id="sec-4-3">
        <title>5 htps://github.com/google-research/bert</title>
      </sec>
      <sec id="sec-4-4">
        <title>6htps://github.com/hanxiao/bert-as-service</title>
      </sec>
      <sec id="sec-4-5">
        <title>7htps://github.com/allenai/scibert</title>
        <p>• USE. It has two models available to download from Tensorflow Hub: the former trained with
a Deep Averaging Network (DAN)1, the later with a Transformer 2 network. The models are
trained on a variety of web sources, such as Wikipedia, web news, web question-answer pages,
and supervised data from the Stanford Natural Language Inference (SNLI) corpus. Both models
return vectors of 512 dimension as output [3].
• InferSent. It adopts a bi-directional Long-Short Term Memory (LSTM) with a max-pooling
operator as sentence encoder, and it is trained on the SNLI corpus. We experimented with two
models of InferSent3: the former trained using Glove word embeddings, the later using fastText
word embeddings. The output is an embedding of 4096 dimension [5].
• ELMo. It uses a bi-directional LSTM to compute contextualized character-based word
representations. We used TensorFlow Hub implementation of ELMo4, trained on the 1 Billion Word
Benchmark. It returns a representation of 1024 dimension [8].
• BERT. It is a sentence embedding model that learns vector representations by training a deep
bidirectional Transformer network. We used the uncased BERT-Base model5 trained on Wikipedia,
through bert-as-service6 to obtain a vector of 768 dimension [7].
• SciBERT. It is a BERT model, but trained on a corpus of 1.14M scientific papers [2]. We used the
recommended uncased scibert-scivocab version7. Similar to BERT, we obtain a vector of 768
dimension using bert-as-service6 .</p>
        <p>As strong baseline we used BM25, a common approach for document ranking. Today’s sentence
encoders have a major drawback, that is, the high execution time on large corpora. Using sentence
embeddings on large corpora seems hardly feasible in a production recommender system, which
needs to return recommendations within a few seconds or less. Hence, we do not only compare the
embeddings with BM25, but additionally experimented with hybrid approaches in which we first
used Apache Lucene’s BM25 to retrieve a list of top-20, 50 or 100 recommendation candidates, and
re-ranked that list with the sentence encoders. Figure 1 shows the overall architecture of the proposed
approach. In the re-ranking step, we calculate the cosine similarity between the sentence embedding of
the input paper title and embeddings of all the candidate paper titles. This similarity metric expresses
the semantic similarity. We perform a linear combination between the initial scores from BM25 after
being normalized, and the semantic similarity scores from sentence embeddings, by summing up the
scores (with uniform weights set to 0.5) to generate the final ranked recommendations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Dataset</title>
      <p>For the evaluation, we used the CiteULike dataset [9]. It contains paper collections of 5,551 researchers,
i.e. lists of which documents researchers added to their personal document collection. Table 1 shows
the statistics of the dataset.</p>
    </sec>
    <sec id="sec-6">
      <title>Evaluation</title>
      <p>Using 5-fold cross-validation, we split the data by randomly selecting one of the research paper titles
that a user has in her library as input paper, and all the remaining paper titles are used for evaluating
if the recommended papers were actually in the user’s library. As evaluation metrics, we calculated
Recall, Precision, and Mean Average Precision (MAP) at rank 10. We also measured the execution
time. Our tests were performed on a computer with Intel Core i7-6700 CPU and 16GB RAM.</p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS AND DISCUSSION</title>
      <p>In Table 2, we report the performance of: (i) BM25 without any reranking as baseline; (ii) Only sentence
embeddings; (iii) BM25 scores are combined with sentence embeddings similarity score for reranking.
The results show that none of the sentence embedding models alone is able to outperform BM25.
Furthermore, we observe that USE, BERT and SciBERT outperform ELMo and InferSent, on average.
One possible reason is that USE, BERT, and SciBERT are trained on corpora that contain technical
and scientific terms (i.e., USE and BERT on Wikipedia, SciBERT on scientific papers), whereas ELMo
and InferSent are trained on a news crawl and a natural language inference corpus, respectively.</p>
      <p>On the other hand, the hybrid sentence embeddings + BM25 ranking outperforms all other single
approaches. Apparently, in some cases, BM25 fails to assign the right ranking scores to papers,
while sentence embeddings could capture the semantic similarity between them. In this case, the
ranking performance increases with the top-N number of papers retrieved by BM25, which means
that more relative papers can be found. BM25 + USE (Transformer) performs best. Compared to BM25,
it relatively increases MAP@10 by +5.29% when reranking 20 titles, by +6.47% when reranking 50,
and by +7.35% when reranking 100 titles.</p>
      <p>In our experiments, BM25 queries took around 5 milliseconds to retrieve up to 100 results. The
extra time taken to calculate embeddings and reranking 20, 50 and 100 titles through the diferent
models is shown in Figure 2. USE (DAN) is the fastest, taking around 0.02 seconds to rerank 20 or 50
titles, and 0.03 seconds to rerank 100 titles. ELMo is the slowest in reranking 20 or 50 titles. Finally,
BERT and SciBERT using bert-as-server are the slowest in reranking 100 titles, taking around 4.0
seconds. This means that they could not be used for real-time reranking recommendations, unless
higher computing resources (e.g., GPU or TPU) were provided.</p>
      <p>In conclusion, our results show that the sentence encoders – that perform so well in other domains
– are not performing well in the domain of research paper recommendations. When combined in
a hybrid approach with BM25, they perform beter than BM25 alone or any of the encoders alone.
However, the improvement of up to 7.35% is still small compared to the exceptional results sentence
encoders achieve in other domains. A limitation of our research is that we only worked with the titles
of papers. Presumably the encoders work beter with longer text. In future research, we will use at
least the abstracts. However, this will probably further increase the running time, and hence decrease
the applicability of sentence encoders in systems that require recommendations in real-time.
in Digital Library Recommender Systems. In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL).
sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364 (2017).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>