<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using an Ensemble of Features for Personalized Recommendations of Scientific Publications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Discussion Paper</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Tenti</string-name>
          <email>p.tenti1@campus.unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Thomas</string-name>
          <email>james.thomas@ucl.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael Peñaloza</string-name>
          <email>rafael.penaloza@unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriella Pasi</string-name>
          <email>gabriella.pasi@unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>EPPI Centre, UCL Social Research Institute, University College London</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IKR3 Lab, University of Milano-Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Maintaining reviews of scientific publications as soon as new relevant publications are available is a typical challenge to many research communities. We address this challenge as a content-based recommendation problem, where the publications already selected for a review drive the recommendation of the new publications. In addition, resources such as domain databases, ontologies and academic graphs provide structured information about publications (e.g., authors, journals, conferences). Our experiments show that a simple model based on that structured information to represent publications achieve high precision and recall, and outperform models that use more sophisticated representations based on embeddings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Content-based recommendation</kwd>
        <kwd>Scientific papers recommendations</kwd>
        <kwd>Text classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>A common issue to many research communities is building and maintaining meaningful
collections of references to scientific publications related to a specific research topic; to this aim
reviews are compiled, which report such references in an organized way. In this context, a first
challenge is discovering the initial collection of publications to be considered for compiling
a review; this can be done by searching across several scientific databases and journals and
going through several passages of filtering and refinement. A second challenge is maintaining
reviews; that is, capturing new relevant publications as soon as they are available after a review
has been constructed, to the aim of keeping the review updated. Both processes are partially
manual, long-running, and error-prone. We argue that the problem of maintaining existing
reviews can entrain a content-based recommendation problem: publications in existing reviews
can be used to automatically find and recommend new publications to the reviews’ owners.</p>
    </sec>
    <sec id="sec-2">
      <title>Problem statement</title>
      <p>Let ℘ be the domain of publications, and R the domain of reviews, such that a review  ∈ R is
a finite set of scientific publications, i.e.,  = {1, ...,  |  ∈ ℘}.</p>
      <p>A set R ⊆ R of reviews and a set of brand-new publications ℘ ⊆ ℘ are given. For any review
 ∈ R, the problem is to retrieve a set ℘ ⊆ ℘ of publications that are relevant to R.</p>
      <p>
        To define the notion of relevance, we first define a similarity function Φ : R × ℘ → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]
that given a review  ∈ R and a publication  ∈ ℘, returns their similarity score. Relevance
can be modeled in this context as a binary function Ξ : R × ℘ → { ,  } that given
a threshold  returns true if and only if Φ(, ) &gt; . For any review  ∈ R, the set of relevant
publications ℘ = { |  ∈ ℘ ∧ Ξ(, )} will be recommended.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodological framework</title>
      <p>
        Resources such as scientific databases (e.g., PubMed 1), ontologies (e.g., Unified Medical
Language System 2, PICO 3) and academic graphs (e.g., Microsoft Academic Graph [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), provide
relational information about publications, such as title, abstract, authors, citations, references,
journals, and conferences. That information can be used to enrich publications with descriptive
features. Hence, we study the opportunity of using these features to construct multiple vector
representations of publications, to address the problem of recommending scientific publications
as described above.
      </p>
      <p>Given a set of publications features ℱ = {1, ..., }, for a publication  ∈ ℘ we define
() = {1 (), ...,  ()| ∈ ℱ } as the set of vectors that represent , where () stands
for the vector representation of  with respect to the feature .</p>
      <p>We define a methodological framework for using the available features to compute a
compound similarity function between a publication and a review. This framework is simple yet
extensible.</p>
      <p>First, we define a method to construct the vectors () for a given publication  and feature
. In addition, we define the representation of a review  ∈ R with respect to a feature  as
() = *({() |  ∈ }), where * is an aggregation function over the set of publications
representations with respect to the same feature .</p>
      <p>We further define Φ(, ) as the function to calculate the similarity between a publication
 and a review , with respect to a feature . Finally, we define the function to calculate the
similarity between  and  as Φ(, ) = Φ*({Φ(, ) |  ∈ ℱ }) where Φ* is an aggregation
function over feature-specific similarity scores.</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>We conducted a series of experiments on a manually labeled dataset of domain-specific reviews
and a few tens of thousands of publications. Note that the research domain is homogeneous,
1https://pubmed.ncbi.nlm.nih.gov
2https://www.nlm.nih.gov/research/umls/index.html
3https://linkeddata.cochrane.org/pico-ontology
and thus the reviews are quite similar to each other.</p>
      <p>We evaluated our method as a multi-class, multi-label classification problem. Specifically,
classes are the reviews, and each publication might be relevant to multiple reviews. Thus,
the problem is to predict all relevant labels (i.e., reviews) for a certain, previously unseen,
publication.</p>
      <p>We considered title, abstract, citation network, authors, journals, conferences, topics and
ontological categories as features, and their representations as either Tf-Idf vectors or binary
vectors. For topics we considered the Field of Studies extracted from the Microsoft Academic
Graph. We extracted ontological categories from PiCO, a domain specific ontology.</p>
      <p>The best performing model showed precision of 97.7% with recall of 99.2%. Similar results
were confirmed by experiments on a diferent dataset. We observed the following:
• Ensembles of features over-performed textual features alone, either with Tf-Idf based
representations or with embeddings.
• For text-based models, Tf-Idf based representations considerably beat embeddings on
precision. Our interpretation was that, in a context where reviews come from the same
domain, key phrases are better suited to capture the diferences between them.
• Titles generally performed better than abstracts when using ensembles of features and
are computationally more eficient.
• Representing publications by means of simple binary or Tf-Idf based vectors sufices to
achieve good performance, in contrast to more sophisticated solutions, such as
representations based on embeddings.
• To implement * many approaches are possible. Our experiments suggest that the best
performing ones capture the most representative properties of reviews, rather than any
single property regardless of their importance.
• To implement Φ*, our experiments show that simple mathematical operators sufice to
achieve good results and keep the model computationally eficient and highly explainable.
Among the benefits of explainability, note that for any given review it is easy to capture
the relevant features to achieve good recommendation performance.</p>
    </sec>
    <sec id="sec-5">
      <title>Contributions and open challenges</title>
      <p>Several Web resources (i.e., academic graphs, domain specific scientific databases) provide
structured information about scientific publications, such as title, abstract, authors, citations and
potentially ontological categories. Our work shows a methodological framework that can make
use of such structured information to achieve high-precision and high-recall in the down-stream
task of domain-specific, personalized recommendation of scientific publications.</p>
      <p>
        In addition, our work shows that representing publications with ensembles of features
outperforms representations based on embedding vectors [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. Our interpretation is that to
discriminate publications’ membership to reviews that belong to the same domain, the signals
coming from key phrases and ontological categories are more relevant than those from more
general semantic representations, like the ones obtained with text embeddings.
      </p>
      <p>
        Constructing more sophisticated embeddings to represent publications that capture both
their content and relational properties [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] might achieve comparable or better performance in
a more domain independent and task agnostic way. However, we argue that using a simple
similarity mathematical model based on easy to capture features is computationally inexpensive,
easier to train, more interpretable and still highly generalizable.
      </p>
      <p>
        Finally, we believe that there are still open challenges to address. Our experiments show the
importance of some features like topical and ontological categories. However, the availability of
such features might be domain-dependent, or hard to extract. On the one hand it would be worth
studying the impact of using text embeddings over titles and abstracts in synergy with standard
features (i.e., n-grams over titles and abstracts, citation network and co-authoring), to see if
they could compensate for more sophisticated and domain-dependent features (i.e., ontological
categories, fields of study). On the other hand, it would be worth studying generalizable methods
for extracting ontological features from publications [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.-J. P.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>An overview of microsoft academic service (mas) and applications</article-title>
          ,
          <source>in: Proceedings of the 24th International Conference on World Wide Web</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , S. yi Kong,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. L. U.</given-names>
            <surname>Limtiaco</surname>
          </string-name>
          ,
          <string-name>
            R. S. John,
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          , M. GuajardoCéspedes, S. Yuan,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tar</surname>
          </string-name>
          , Y. hsuan
          <string-name>
            <surname>Sung</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Strope</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kurzweil</surname>
          </string-name>
          , Universal sentence encoder, arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>11175</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <article-title>Improving a tf-idf weighted document vector embedding</article-title>
          ., arXiv preprint arXiv:
          <year>1902</year>
          .
          <volume>09875</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          , T. Ma,
          <article-title>A simple but tough-to-beat baseline for sentence embeddings</article-title>
          ,
          <source>in: ICLR 2017 : International Conference on Learning Representations</source>
          <year>2017</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          , E. Messina, Cage:
          <article-title>Constrained deep attributed graph embedding</article-title>
          ,
          <source>Information Sciences 518</source>
          (
          <year>2020</year>
          )
          <fpage>56</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Gutiérrez-Basulto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          ,
          <article-title>From knowledge graph embedding to ontology embedding? an analysis of the compatibility between vector space representations and rules</article-title>
          ,
          <source>in: KR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>379</fpage>
          -
          <lpage>388</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>