<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ProtSTonKGs: A Sophisticated Transformer Trained on Protein Sequences, Text, and Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Helena Balabin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charles Tapley Hoyt</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>min M Gyori</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John B</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tom Ko</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>mullil</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bonn-Rhein-Sieg University of Applied Sciences</institution>
          ,
          <addr-line>53757, Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scienti c Computing</institution>
          ,
          <addr-line>Sankt Augustin 53757</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Enveda Biosciences</institution>
          ,
          <addr-line>Boulder, CO, 80301</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Fraunhofer Center for Machine Learning</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Laboratory of Systems Pharmacology, Harvard Medical School</institution>
          ,
          <addr-line>02115, Boston, MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2046</year>
      </pub-date>
      <abstract>
        <p>While most approaches individually exploit unstructured data from the biomedical literature or structured data from biomedical knowledge graphs, their union can better exploit the advantages of such approaches, ultimately improving representations of biology. Using multimodal transformers for such purposes can improve performance on context dependent classi cation tasks, as demonstrated by our previous model, the Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs (STonKGs). In this work, we introduce ProtSTonKGs, a transformer aimed at learning all-encompassing representations of protein-protein interactions. ProtSTonKGs presents an extension to our previous work by adding textual protein descriptions and amino acid sequences (i.e., structural information) to the text- and knowledge graph-based input sequence used in STonKGs. We benchmark ProtSTonKGs against STonKGs, resulting in improved F1 scores by up to 0.066 (i.e., from 0.204 to 0.270) in several tasks such as predicting protein interactions in several contexts. Our work demonstrates how multimodal transformers can be used to integrate heterogeneous sources of information, paving the foundation for future approaches that use multiple modalities for biomedical applications.</p>
      </abstract>
      <kwd-group>
        <kwd>Natural Language Processing</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Transformers</kwd>
        <kwd>Bioinformatics</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        While machine learning approaches have recently been applied in biomedical
applications such as drug discovery and protein structure prediction, they tend
Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
to be tailored toward speci c applications, and the resulting models often do
not generalize well. Thus, transfer learning approaches have enormous
potential, since information from one generic setting can be exploited to improve
generalization in another speci c application. Language models used in natural
language processing leverage the transfer learning paradigm to learn general
representations of unstructured text data. For instance, the Bidirectional Encoder
Representations from Transformers for biomedical text mining (BioBERT) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
model has been pre-trained on millions of articles from PubMed to represent
biomedical knowledge. A complementary approach that represents knowledge in
a structured way is the use of knowledge graphs (KGs), which aggregate facts (in
the form of (source, relation, target) triples) from heterogeneous data sources.
For example, a KG can be constructed to represent all protein-protein
interactions (i.e., an interactome). The goal of such KGs is to model biology at the
protein-level in order to better understand the underlying processes regulating
the cell. By combining the advantages of both approaches (text and KG), we
can better represent biology by modelling the interdependencies between the
information comprised in unstructured text (e.g., amino acid sequences or protein
descriptions) and structured KG data (e.g., known protein interactions).
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Building on the success of the Bidirectional Encoder from Transformers (BERT)
model introduced by Devlin et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], several natural language processing
approaches have extended the original transformer model architecture through
auxiliary information from KGs. However, most approaches are either restricted to
the general domain [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], or they require explicit alignments between text and
KG entities [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Multimodal transformers pose as a generalization to the
incorporation of multiple modalities (e.g., text, image and video data) based on the
transformer model architecture. Inspired by the the cross encoder presented in
the Modulated Detection Transformer (MDETR) model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we previously
introduced STonKGs, a Sophisticated Transformer trained on biomedical text and
Knowledge Graphs [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In more detail, STonKGs uses concatenated embedding
sequences derived from unstructured text data from biomedical text corpora as
well as from structured information from KGs (referred to as text-triple pairs) as
input to a joint transformer. However, the concept of multimodal transformers
can be extended to further modalities to incorporate additional biological data
sources.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>
        In this work, we present ProtSTonKGs, a protein-speci c extension of the STonKGs
model architecture with an additional modality representing protein sequences
as well as further textual information. Given the focus of the model on
proteins, we generated a subset of the statements from the Integrated Network and
Dynamical Reasoning Assembler (INDRA) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used for pre-training STonKGs
by ltering for text-triple pairs in which both the source and the target nodes
represent proteins. For these text-triple pairs, we augmented the text evidence
with textual node descriptions for source and target nodes obtained from Entrez
Gene [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and the respective amino acid sequences from UniProt [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], resulting
in the overall input sequence format shown in Figure 1. In total, we employed
666; 334 protein-speci c multimodal inputs, based on statements for which
complete information could be obtained for all modalities.
      </p>
      <p>
        Since the inclusion of complete amino acid sequences results in input
sequences that exceed the maximum input length of BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we used BigBird
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] as a basis for the cross encoder in ProtSTonKGs instead, as this model is
particularly well-suited for handling longer sequence lengths. In parallel to the
original STonKGs model, the initial embeddings for text and KG nodes were
derived from BioBERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and node2vec [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (i.e., a re-trained model on the
protein-speci c subgraph of the INDRA KG), respectively. Moreover, the initial
embeddings for the protein sequences were generated using ProtBERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
ProtSTonKGs was pre-trained for n = 15; 000 training steps on the protein-speci c
multimodal inputs with a batch size of b = 256, and the remaining
hyperparameters are equivalent to those used in STonKGs. We evaluated ProtSTonKGs
and compared it against STonKGs using the same pre-training and ne-tuning
procedures introduced in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] using weighted F1 scores. However, given the
increased computational cost of the longer input sequence lengths used in
ProtSTonKGs, we used a single (80/20) train-test split instead of cross-validation
for evaluating the models. We created protein-speci c subsets of the
benchmark datasets used in STonKGs based on text-triple pairs consisting of
proteins with textual descriptions from Entrez Gene [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as well as amino acid
sequences from UniProt [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Finally, the implementation and the pre-trained
ProtSTonKGs model are available at https://github.com/stonkgs/stonkgs and
https://huggingface.co/stonkgs/protstonkgs.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>As shown in Table 1, ProtSTonKGs outperformed STonKGs on three out of
eight classi cation tasks, and achieved equal F1 scores on two additional tasks.
While ProtSTonKGs resulted in only a minor improvement on task 5 (i.e., a
relative performance gain of 1.06%), it led to considerable improvements on task
3 and 4 (i.e., relative performance gains of 10.09% and 32.25%, respectively).
The improvement of ProtSTonKGs on these three context classi cation tasks
indicates the potential bene t of including protein-speci c information for the
disambiguation of various biological contexts of a given text-triple pair. On the
two relation type tasks (task 1 and 2), as well as the species task (task 6), the
original STonKGs300k performed better than ProtSTonKGs. However, STonKGs
outperformed ProtSTonKGs by a smaller margin (a relative di erence in
performance of less than 5%) on these tasks. Moreover, there is no di erence between
STonKGs and ProtSTonKGs on the two annotation error tasks (task 7 and 8),
which is expected due to the lack of additional informative value (with regards to
the prediction of (in)correctly extracted text-triple pairs) of the protein-speci c
information added in ProtSTonKGs.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>We have presented an extension of our previous STonKGs model, ProtSTonKGs,
focused on proteins by incorporating another modality (i.e., protein sequences)
as well as additional text data (i.e., textual node descriptions). While this is
one of the rst e orts towards generating multimodal single stream
transformers with more than two modalities in the biomedical eld, we envision several
possibilities to expand the presented work. For instance, we plan to incorporate
other biological entities in the future (e.g., chemicals with node descriptions and
simpli ed molecular-input line-entry system (SMILES) sequences). Furthermore,
the pre-trained or ne-tuned models can be used to predict the role of novel
proteins in a speci c context. Finally, the same multimodal cross encoder can be
further pre-trained on other data sources.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Rolf</given-names>
            <surname>Apweiler</surname>
          </string-name>
          et al. \
          <article-title>UniProt: the Universal Protein knowledgebase"</article-title>
          .
          <source>In: Nucleic Acids Research</source>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Helena</given-names>
            <surname>Balabin</surname>
          </string-name>
          et al. \
          <article-title>STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs"</article-title>
          . In: bioRxiv (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          et al. \
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Ahmed</given-names>
            <surname>Elnaggar</surname>
          </string-name>
          et al. \
          <article-title>ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Learning"</article-title>
          . In: bioRxiv (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Grover</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jure</given-names>
            <surname>Leskovec</surname>
          </string-name>
          . \node2vec:
          <article-title>Scalable Feature Learning for Networks"</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Mate</surname>
          </string-name>
          Gyori et al. \
          <article-title>From word models to executable models of signaling networks using automated assembly"</article-title>
          .
          <source>In: Molecular Systems Biology</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Bin</given-names>
            <surname>He</surname>
          </string-name>
          et al. \
          <article-title>BERT-MK: Integrating Graph Contextualized Knowledge into Pre-trained Language Models"</article-title>
          .
          <source>In: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2020</year>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Aishwarya</given-names>
            <surname>Kamath</surname>
          </string-name>
          et al. \
          <article-title>MDETR - Modulated Detection for End-to-End Multi-Modal Understanding"</article-title>
          . In: arXiv (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jinhyuk</given-names>
            <surname>Lee</surname>
          </string-name>
          et al. \
          <article-title>BioBERT: a pre-trained biomedical language representation model for biomedical text mining"</article-title>
          . In: Bioinformatics (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Donna</given-names>
            <surname>Maglott</surname>
          </string-name>
          et al. \
          <article-title>Entrez Gene: gene-centered information at NCBI"</article-title>
          .
          <source>In: Nucleic Acids Research</source>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Manzil</given-names>
            <surname>Zaheer</surname>
          </string-name>
          et al. \
          <article-title>Big Bird: Transformers for Longer Sequences"</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Zhengyan</surname>
          </string-name>
          Zhang et al. \
          <article-title>ERNIE: Enhanced Language Representation with Informative Entities"</article-title>
          .
          <source>In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>