<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>OntoClue, a framework to compare vector-based approaches for document relatedness using the RELISH corpus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rohitha Ravinder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Fellerhoff</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vishnu Dadi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukas Geist</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillermo Rocamora</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Talha</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dietrich Rebholz-Schuhmann</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leyla Jael Castro</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bonn-Aachen International Centre for Information Technology (B-IT), University of Bonn</institution>
          ,
          <addr-line>Friedrich-Hirzebruch-Allee 6, Bonn, 53115</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Heinrich-Heine University Düsseldorf</institution>
          ,
          <addr-line>Universitätsstraße 1, Düsseldorf, 40225</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Hochschule Bonn-Rhein-Sieg</institution>
          ,
          <addr-line>Grantham-Allee 20, Sankt Augustin, 53757</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universidad de Murcia</institution>
          ,
          <addr-line>Avda. Teniente Flomesta 5, Murcia, 30003</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Cologne</institution>
          ,
          <addr-line>Albertus-Magnus-Platz, Cologne, 50923</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>ZB MED Information Centre for Life Sciences</institution>
          ,
          <addr-line>Gleueler Str. 60, Cologne, 50931</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The continuous increase of biomedical scholarly publications makes it challenging to construct document recommendation algorithms to navigate through literature, an important feature for researchers to keep up with relevant publications. Understanding semantic relatedness and similarity between two documents could improve document recommendations. The objective of this study is performing a comparative analysis of vector-based approaches to assess document similarity in the RELISH corpus. Here we present our approach to compare five different techniques to generate vectors representing the text in the documents. These techniques employ a combination of various Natural Language Processing frameworks such as Word2Vec, Doc2Vec, dictionary-based Named Entity Recognition as well as state-of-the-art models based on BERT.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Document similarity</kwd>
        <kwd>Word embeddings</kwd>
        <kwd>Named Entity Recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recommendation systems are a successful method to cope with information overload wrt scientific
publications [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For biomedical publications, PubMed Related Articles (PMRA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is still considered
the de facto standard; however, Natural Language Processing (NLP) advances, including
word-embeddings, offer alternative paths to improve the state of the art and explore further similarity,
relatedness and relevance. The RELISH [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] dataset corresponds to a document-to-document relevance
assessment (definitely relevant, partially relevant, non-relevant) that can be used for comparing,
improving and translating newly developed literature search techniques, including recommendation
systems. Here we present OntoClue, a framework to compare different approaches to generate vectors
for articles in the RELISH corpus.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. OntoClue framework</title>
      <p>OntoClue can be summarized in the following steps: (i) retrieve title and abstract for the RELISH
articles in XML recording those that cannot be retrieved, (ii) trim the RELISH corpus so it includes
only retrieved articles, (iii) reduce the RELISH corpus so only relevance assessment for which there is
a clear consensus are kept, (iv) connect approaches to be compared by OntoClue in a workflow
fashion, (v) optimize the approaches using an Area Under the Curve (AUC) approach, (vi) evaluate
precision and cumulative gain for each approach using the optimal parameters, (vii) provide
comparison tables for the different approaches. The hyperparameter optimization follows a
multi-classification approach using Cosine Similarity intervals from 0 to 1 with increments of 0.1 and
counting the number of definitely relevant, partially relevant and non-relevant RELISH pairs for each
interval. The optimization is based on the best AUC score obtained from different hyperparameter
combinations for each participating approach. The optimization can also be simplified to two classes
by combining definitely and partially relevant into one single class “relevant”.</p>
      <p>
        We are testing and tuning our OntoClue framework with five approaches: (i) Doc2Vec [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
existing approach for document vectors; (ii) word2doc2vec, in-house approach to document vectors;
(iii) whatizit-dictionary, using Whatizit [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a dictionary-based named entity recognition approach;
(iv) hybrid-doc2vec, combination of Doc2Vec and Whatizit; and (v) a BERT-based approach using
BERT pre-trained models (all of the others are trained with the RELISH articles only).
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Future Work</title>
      <p>We plan to use our OntoClue framework to compare the five mentioned approaches so we can select
the best approach to propose a new recommendation system that should cover not only the biomedical
domain but also the agricultural one as they correspond to our use case LIVIVO, the ZB MED
literature portal. The recommendation system should also integrate multilingualism as LIVIVO
contains publications in English, German, French, Portuguese and Spanish. In addition, we want to
support coverage for non-traditional, e.g., data and software, and non-peer-reviewed journal
publications, e.g., conference papers and preprints.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Acknowledgements</title>
      <p>This work was partially supported by the STELLA project funded by DFG (project no. 407518790),
the NFDI4DataScience project funded by GWK and DFG (no. NFDI 34/1), and the BMBF-funded
de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B,
031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D,
031A538A)</p>
    </sec>
    <sec id="sec-5">
      <title>5. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Zhu</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patra</surname>
            <given-names>BG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yaseen</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <article-title>Recommender system of scholarly papers using public datasets</article-title>
          .
          <source>AMIA Jt Summits Transl Sci Proc. 2021 May</source>
          <volume>17</volume>
          ;
          <year>2021</year>
          :
          <fpage>672</fpage>
          -
          <lpage>679</lpage>
          . PMID: 34457183; PMCID:
          <fpage>PMC8378599</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lin</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilbur</surname>
            <given-names>WJ</given-names>
          </string-name>
          .
          <article-title>PubMed related articles: a probabilistic topic-based model for content similarity</article-title>
          .
          <source>BMC Bioinformatics</source>
          .
          <source>2007 Oct</source>
          <volume>30</volume>
          ;
          <fpage>8</fpage>
          :
          <fpage>423</fpage>
          . doi:
          <volume>10</volume>
          .1186/
          <fpage>1471</fpage>
          -2105-8-
          <lpage>423</lpage>
          . PMID: 17971238; PMCID:
          <fpage>PMC2212667</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brown</surname>
            <given-names>P</given-names>
          </string-name>
          ; RELISH Consortium, Zhou Y.
          <article-title>Large expert-curated database for benchmarking document similarity detection in biomedical literature search</article-title>
          .
          <source>Database (Oxford)</source>
          .
          <source>2019 Jan</source>
          <volume>1</volume>
          ;
          <year>2019</year>
          :baz085. doi:
          <volume>10</volume>
          .1093/database/baz085. PMID: 33326193; PMCID:
          <fpage>PMC7291946</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mikolov</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            <given-names>GS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>InAdvances in neural information processing systems</source>
          .
          <year>2013</year>
          :
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Dietrich</given-names>
            <surname>Rebholz-Schuhmann</surname>
          </string-name>
          , Miguel Arregui, Sylvain Gaudan, Harald Kirsch, Antonio Jimeno.
          <source>Text processing through Web services: calling Whatizit, Bioinformatics</source>
          , Volume
          <volume>24</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>2</given-names>
          </string-name>
          ,
          <issue>15</issue>
          <year>January 2008</year>
          , Pages
          <fpage>296</fpage>
          -298, https://doi.org/10.1093/bioinformatics/btm557.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>