ProtSTonKGs: A Sophisticated Transformer Trained on Protein Sequences, Text, and Knowledge Graphs Helena Balabin1,2[0000−0002−6392−9306] , Charles Tapley 3[0000−0003−4423−4370] Hoyt , Benjamin M Gyori3[0000−0001−9439−5346] , John 3[0000−0001−6095−2466] Bachman , Alpha Tom Kodamullil1[0000−0001−9896−3531] , Martin Hofmann-Apitius1[0000−0001−9012−6720] , and Daniel Domingo-Fernández1,4,5[0000−0002−2046−6145] 1 Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin 53757, Germany 2 Bonn-Rhein-Sieg University of Applied Sciences, 53757, Sankt Augustin, Germany 3 Laboratory of Systems Pharmacology, Harvard Medical School, 02115, Boston, MA, USA 4 Fraunhofer Center for Machine Learning, Germany 5 Enveda Biosciences, Boulder, CO, 80301, USA Abstract. While most approaches individually exploit unstructured data from the biomedical literature or structured data from biomedical knowl- edge graphs, their union can better exploit the advantages of such ap- proaches, ultimately improving representations of biology. Using mul- timodal transformers for such purposes can improve performance on context dependent classification tasks, as demonstrated by our previ- ous model, the Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs (STonKGs). In this work, we introduce Prot- STonKGs, a transformer aimed at learning all-encompassing representa- tions of protein-protein interactions. ProtSTonKGs presents an extension to our previous work by adding textual protein descriptions and amino acid sequences (i.e., structural information) to the text- and knowledge graph-based input sequence used in STonKGs. We benchmark Prot- STonKGs against STonKGs, resulting in improved F1 scores by up to 0.066 (i.e., from 0.204 to 0.270) in several tasks such as predicting protein interactions in several contexts. Our work demonstrates how multimodal transformers can be used to integrate heterogeneous sources of infor- mation, paving the foundation for future approaches that use multiple modalities for biomedical applications. Keywords: Natural Language Processing · Knowledge Graphs · Trans- formers · Bioinformatics · Machine Learning. 1 Introduction While machine learning approaches have recently been applied in biomedical applications such as drug discovery and protein structure prediction, they tend Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Balabin et al. to be tailored toward specific applications, and the resulting models often do not generalize well. Thus, transfer learning approaches have enormous poten- tial, since information from one generic setting can be exploited to improve generalization in another specific application. Language models used in natural language processing leverage the transfer learning paradigm to learn general rep- resentations of unstructured text data. For instance, the Bidirectional Encoder Representations from Transformers for biomedical text mining (BioBERT) [9] model has been pre-trained on millions of articles from PubMed to represent biomedical knowledge. A complementary approach that represents knowledge in a structured way is the use of knowledge graphs (KGs), which aggregate facts (in the form of (source, relation, target) triples) from heterogeneous data sources. For example, a KG can be constructed to represent all protein-protein interac- tions (i.e., an interactome). The goal of such KGs is to model biology at the protein-level in order to better understand the underlying processes regulating the cell. By combining the advantages of both approaches (text and KG), we can better represent biology by modelling the interdependencies between the in- formation comprised in unstructured text (e.g., amino acid sequences or protein descriptions) and structured KG data (e.g., known protein interactions). 2 Related Work Building on the success of the Bidirectional Encoder from Transformers (BERT) model introduced by Devlin et al. [3], several natural language processing ap- proaches have extended the original transformer model architecture through aux- iliary information from KGs. However, most approaches are either restricted to the general domain [12], or they require explicit alignments between text and KG entities [7]. Multimodal transformers pose as a generalization to the incor- poration of multiple modalities (e.g., text, image and video data) based on the transformer model architecture. Inspired by the the cross encoder presented in the Modulated Detection Transformer (MDETR) model [8], we previously in- troduced STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs [2]. In more detail, STonKGs uses concatenated embedding sequences derived from unstructured text data from biomedical text corpora as well as from structured information from KGs (referred to as text-triple pairs) as input to a joint transformer. However, the concept of multimodal transformers can be extended to further modalities to incorporate additional biological data sources. 3 Approach In this work, we present ProtSTonKGs, a protein-specific extension of the STonKGs model architecture with an additional modality representing protein sequences as well as further textual information. Given the focus of the model on pro- teins, we generated a subset of the statements from the Integrated Network and Dynamical Reasoning Assembler (INDRA) [6] used for pre-training STonKGs ProtSTonKGs 3 by filtering for text-triple pairs in which both the source and the target nodes represent proteins. For these text-triple pairs, we augmented the text evidence with textual node descriptions for source and target nodes obtained from Entrez Gene [10] and the respective amino acid sequences from UniProt [1], resulting in the overall input sequence format shown in Figure 1. In total, we employed 666, 334 protein-specific multimodal inputs, based on statements for which com- plete information could be obtained for all modalities. Fig. 1. Cross-modal attention between text, KG, and protein data. Each input is a concatenation of a token, triple, and protein sequence. In pre-training, three different heads are used to convert text, KG, and protein-based inputs to probabilities over the respective vocabularies. These three vocabularies represent the total set of possible text tokens, nodes in the KG, and amino acids. Since the inclusion of complete amino acid sequences results in input se- quences that exceed the maximum input length of BERT [3], we used BigBird [11] as a basis for the cross encoder in ProtSTonKGs instead, as this model is particularly well-suited for handling longer sequence lengths. In parallel to the original STonKGs model, the initial embeddings for text and KG nodes were derived from BioBERT [9] and node2vec [5] (i.e., a re-trained model on the protein-specific subgraph of the INDRA KG), respectively. Moreover, the initial embeddings for the protein sequences were generated using ProtBERT [4]. Prot- STonKGs was pre-trained for n = 15, 000 training steps on the protein-specific multimodal inputs with a batch size of b = 256, and the remaining hyperpa- rameters are equivalent to those used in STonKGs. We evaluated ProtSTonKGs and compared it against STonKGs using the same pre-training and fine-tuning procedures introduced in [2] using weighted F1 scores. However, given the in- 4 Balabin et al. Table 1. Benchmark performances of STonKGs300k and ProtSTonKGs. The reported scores are weighted F1 scores on the test partition of each protein-specific fine-tuning dataset, using a single (80/20) train-test split. Additionally, both absolute and relative ( ProtSTonKGs−STonKGs STonKGs300k 300k ∗ 100) performance gains are reported as well. creased computational cost of the longer input sequence lengths used in Prot- STonKGs, we used a single (80/20) train-test split instead of cross-validation for evaluating the models. We created protein-specific subsets of the bench- mark datasets used in STonKGs based on text-triple pairs consisting of pro- teins with textual descriptions from Entrez Gene [10] as well as amino acid sequences from UniProt [1]. Finally, the implementation and the pre-trained ProtSTonKGs model are available at https://github.com/stonkgs/stonkgs and https://huggingface.co/stonkgs/protstonkgs. 4 Experimental Results As shown in Table 1, ProtSTonKGs outperformed STonKGs on three out of eight classification tasks, and achieved equal F1 scores on two additional tasks. While ProtSTonKGs resulted in only a minor improvement on task 5 (i.e., a relative performance gain of 1.06%), it led to considerable improvements on task 3 and 4 (i.e., relative performance gains of 10.09% and 32.25%, respectively). The improvement of ProtSTonKGs on these three context classification tasks indicates the potential benefit of including protein-specific information for the disambiguation of various biological contexts of a given text-triple pair. On the two relation type tasks (task 1 and 2), as well as the species task (task 6), the original STonKGs300k performed better than ProtSTonKGs. However, STonKGs outperformed ProtSTonKGs by a smaller margin (a relative difference in perfor- mance of less than 5%) on these tasks. Moreover, there is no difference between STonKGs and ProtSTonKGs on the two annotation error tasks (task 7 and 8), which is expected due to the lack of additional informative value (with regards to the prediction of (in)correctly extracted text-triple pairs) of the protein-specific information added in ProtSTonKGs. ProtSTonKGs 5 5 Conclusion and Future Work We have presented an extension of our previous STonKGs model, ProtSTonKGs, focused on proteins by incorporating another modality (i.e., protein sequences) as well as additional text data (i.e., textual node descriptions). While this is one of the first efforts towards generating multimodal single stream transform- ers with more than two modalities in the biomedical field, we envision several possibilities to expand the presented work. For instance, we plan to incorporate other biological entities in the future (e.g., chemicals with node descriptions and simplified molecular-input line-entry system (SMILES) sequences). Furthermore, the pre-trained or fine-tuned models can be used to predict the role of novel pro- teins in a specific context. Finally, the same multimodal cross encoder can be further pre-trained on other data sources. References [1] Rolf Apweiler et al. “UniProt: the Universal Protein knowledgebase”. In: Nucleic Acids Research (2004). [2] Helena Balabin et al. “STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs”. In: bioRxiv (2021). [3] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transform- ers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies. 2019. [4] Ahmed Elnaggar et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning”. In: bioRxiv (2021). [5] Aditya Grover and Jure Leskovec. “node2vec: Scalable Feature Learning for Networks”. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. [6] Benjamin Mate Gyori et al. “From word models to executable models of signaling networks using automated assembly”. In: Molecular Systems Biology (2017). [7] Bin He et al. “BERT-MK: Integrating Graph Contextualized Knowledge into Pre-trained Language Models”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. [8] Aishwarya Kamath et al. “MDETR - Modulated Detection for End-to-End Multi-Modal Understanding”. In: arXiv (2021). [9] Jinhyuk Lee et al. “BioBERT: a pre-trained biomedical language represen- tation model for biomedical text mining”. In: Bioinformatics (2020). [10] Donna Maglott et al. “Entrez Gene: gene-centered information at NCBI”. In: Nucleic Acids Research (2011). [11] Manzil Zaheer et al. “Big Bird: Transformers for Longer Sequences”. In: Advances in Neural Information Processing Systems 33. 2020. [12] Zhengyan Zhang et al. “ERNIE: Enhanced Language Representation with Informative Entities”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.