<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparative analysis of protein function text-based embeddings and their applicability to prediction tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rohitha Ravinder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leyla Jael Castro</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Hofmann-Apitius</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dietrich Rebholz-Schuhmann</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bonn-Aachen International Centre for Information Technology (B-IT), University of Bonn</institution>
          ,
          <addr-line>Friedrich-Hirzebruch-Allee 6, Bonn, 53115</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fraunhofer Institute for Algorithms and Scientific Computing (SCAI)</institution>
          ,
          <addr-line>Schloss Birlinghoven 1, Sankt Augustin, 53757</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Cologne</institution>
          ,
          <addr-line>Albertus-Magnus-Platz, Cologne, 50923</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>ZB MED Information Centre for Life Sciences</institution>
          ,
          <addr-line>Gleueler Str. 60, Cologne, 50931</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Predicting protein function is a difficult problem in bioinformatics. Many recent techniques employ embeddings to learn representations of protein sequences and infer function from these; however there have been no studies that have utilized protein function text embeddings to forecast protein function. Here, we propose to learn and explore text-driven embedding representations of protein function comment sections kept as part of the Swiss-Prot entries and understand how the resulting data can be used to enhance protein function annotations. The comparative study is based on protein function text embeddings derived from two approaches which include a combination of natural language processing frameworks such as Word2Vec, Doc2Vec and dictionary-based Named Entity Recognition and acts as a preliminary assessment based on direct propagation techniques such as sequence similarity and by-similarity prediction.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Protein function prediction</kwd>
        <kwd>Word embeddings</kwd>
        <kwd>Named Entity Recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Understanding the role of proteins is crucial to life. However, there exists only a small subset of
proteins whose function is well predicted thereby making protein function prediction a fundamental
task in the field of Bioinformatics. Numerous techniques have been developed for protein function
prediction using sequence embeddings, protein structures or protein-protein interactions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Nevertheless, to the best of our knowledge no research has been made yet that makes use of protein
function text-based embeddings to evaluate their use for protein function prediction tasks. In this
study, our goal is to get a better understanding of how information for protein functions can be
exploited through embeddings so that the produced information can be used to improve protein
function annotations.
      </p>
      <p>Our work is based on the hypothesis that states a direct correlation between sequence similarity
(corresponding to the BLAST identity score) and similar biological function (as expressed in the
protein function comment). The idea here is to capture this correlation with the help of the
corresponding protein embeddings. Here we consider the text-based embeddings that are derived from
the protein function comment sections of the UniProtKB reviewed entries: SwissProt entries.
Specifically, we aim to learn and compare two embedding models that map functions of protein to
sequences of vector representations such that two proteins having similar function as stated in the
function comment section appear closer in the embedding vector space. A secondary objective is to
analyze the vocabulary that emerges from the text-based embeddings.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>
        Methods of representing textual data range from traditional term-frequency based methods to
embeddings. In our study, we experiment and generate embeddings using two methods namely:
word2doc2vec and hybrid-word2doc2vec. word2doc2vec is an approach that makes use of the
Word2Vec framework [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This framework is a two-layer neural network that is trained to reconstruct
linguistic contexts of words with each unique word being assigned to a corresponding vector using
one of the two architectures: skip-gram and continuous bag-of-words. We employ a strategy to
generate document embeddings from these word embeddings by calculating the centroids of all given
word embeddings in a function text.
      </p>
      <p>
        Our second method employs a hybrid approach exploring a combination of a dictionary-based
Named Entity Recognition using Whatizit tool [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and word2doc2vec framework. The
dictionary-based NER approach herein aims to identify and index all the biomedical words as well as
accounts for additional pieces of information such as annotations denoting Gene Ontology such as
Molecular function (MF) and Biological process (BP). In order to do so, we make use of Whatizit, a
text processing system based on MONQjfa, a nondeterministic and deterministic finite automata for
Java. Whatizit takes as input a dictionary to recognize entities in a text and normalizes them against a
controlled vocabulary. We make use of two ontologies: Medical Subject Headings (MeSH) and Gene
Ontology (GO) as our controlled vocabulary for the required annotation process. These embeddings
ultimately treat each function text as a document and each word in the function text as a word
embedding.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Future Work</title>
      <p>In order to analyze the potential offered by these embeddings we intend to perform a visualization
based on clustering between both our approaches, judge clusters from protein embeddings against
UniRef clusters accounting for the direct propagation technique based on sequence similarity, perform
a basic analysis on the emerging protein text vocabulary as well as test prediction potential. To test the
prediction potential, we intend to narrow down the emerging results to two criterias: (i) high sequence
similarity but low embedding similarity and (ii) low sequence similarity but high embedding
similarity. The purpose of doing so is to ultimately define and implement a strategy on how
embeddings could propagate function annotations as well as estimate their scope for curation work.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Acknowledgements</title>
      <p>This work was partially supported by the BMBF-funded de.NBI Cloud within the German Network
for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A,
031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Villegas-Morcillo</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makrodimitris</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van Ham</surname>
            <given-names>RCHJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            <given-names>AM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchez</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reinders</surname>
            <given-names>MJT</given-names>
          </string-name>
          .
          <article-title>Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function</article-title>
          .
          <source>Bioinformatics. 2021 Apr</source>
          <volume>19</volume>
          ;
          <issue>37</issue>
          (
          <issue>2</issue>
          ):
          <fpage>162</fpage>
          -
          <lpage>170</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btaa701.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Mikolov</surname>
            <given-names>T.</given-names>
          </string-name>
          et al.
          <article-title>Distributed representations of words and phrases and their</article-title>
          compositionality In:
          <string-name>
            <surname>Burges C.J.C</surname>
          </string-name>
          .et al.
          <source>(eds) Advances in Neural Information Processing Systems. Lake Tahoe, Nevada</source>
          , Vol.
          <volume>26</volume>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          (
          <year>2013a</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Dietrich</given-names>
            <surname>Rebholz-Schuhmann</surname>
          </string-name>
          , Miguel Arregui, Sylvain Gaudan, Harald Kirsch, Antonio Jimeno.
          <source>Text processing through Web services: calling Whatizit, Bioinformatics</source>
          , Volume
          <volume>24</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>2</given-names>
          </string-name>
          ,
          <issue>15</issue>
          <year>January 2008</year>
          , Pages
          <fpage>296</fpage>
          -
          <lpage>298</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btm557.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>