=Paper=
{{Paper
|id=Vol-2831/paper13
|storemode=property
|title=Understood in Translation: Transformers for Domain Understanding
|pdfUrl=https://ceur-ws.org/Vol-2831/paper13.pdf
|volume=Vol-2831
|authors=Dimitrios Christofidellis,Matteo Manica,Leonidas Georgopoulos,Hans Vandierendonck
|dblpUrl=https://dblp.org/rec/conf/aaai/Christofidellis21
}}
==Understood in Translation: Transformers for Domain Understanding==
Understood in Translation: Transformers for Domain Understanding Dimitrios Christofidellis,1, 2 Matteo Manica,1 Leonidas Georgopoulos,1 Hans Vandierendonck 2 1 IBM Research Europe 2 Queen’s University Belfast dic@zurich.ibm.com, tte@zurich.ibm.com, leg@zurich.ibm.com, h.vandierendonck@qub.ac.uk Abstract The knowledge acquisition process is either referred to KG construction, where the KG is built from scratch us- Knowledge acquisition is the essential first step of any Knowledge Graph (KG) application. This knowledge can be ing a specific corpus, or KG specification, where a subgraph extracted from a given corpus (KG generation process) or of interest is extracted from an existing KG. In both cases, specified from an existing KG (KG specification process). Fo- the acquisition process can follow a bottom-up or top-down cusing on domain specific solutions, knowledge acquisition approach (Zhao, Han, and So 2018). In a bottom-up ap- is a labor intensive task usually orchestrated and supervised proach, all the entities and their connections are extracted by subject matter experts. Specifically, the domain of interest as a first step of the process. Then, the underlying hierarchy is usually manually defined and then the needed generation and structure of the domain can be inferred from the enti- or extraction tools are utilized to produce the KG. Herein, ties and their connections. Conversely, a top-down approach we propose a supervised machine learning method, based on starts with the definition of the domain’s schema that is then Transformers, for domain definition of a corpus. We argue used to guide the extraction of the needed entities and con- why such automated definition of the domain’s structure is beneficial both in terms of construction time and quality of nections. For general KG generation, a bottom-up approach the generated graph. The proposed method is extensively vali- is usually preferred as we typically wish to include all en- dated on three public datasets (WebNLG, NYT and DocRED) tities and relations that we can extract from the given cor- by comparing it with two reference methods based on CNNs pus. Contrarily, a top-down approach better suits a domain- and RNNs models. The evaluation shows the efficiency of specific KG generation or KG specification, where entities our model in this task. Focusing on scientific document un- and relations are strongly linked to the domain of interest. derstanding, we present a new health domain dataset based on The structure of typical bottom-up and top-down pipelines, publications extracted from PubMed and we successfully uti- focusing on the case of KG generation, are presented in fig- lize our method on this. Lastly, we demonstrate how this work ures 1a and 1b respectively. lays the foundation for fully automated and unsupervised KG generation. Herein, we focus on domain-specific, i.e., top-down, ac- quisition for two main reasons. Firstly, the acquisition pro- cess can be faster and more accurate in this way. By specify- Introduction ing the schema of the domain of interest, then we only need Knowledge Graphs (KGs) are among the most popu- to select the proper and needed tools (i.e. pretrained mod- lar data management paradigms and their application is els) for the actual entity and relation extraction. Secondly, widespread across different fields, e.g., recommendation such approach minimizes the presence of irrelevant data and systems, question-answering tools and knowledge discovery restricts queries and graph operations to a carefully tailored applications. This is due to the fact that KGs share simul- KG. This generally improves the accuracy of KG applica- taneously several advantages of databases (information re- tions (Lalithsena, Kapanipathi, and Sheth 2016). Further- trieval via structured queries), graphs (representing loosely more, the graph’s size is significantly reduced by excluding or irregularly structured data) and knowledge bases (repre- irrelevant content. Thus, execution time of queries can be senting semantic relationship among the data). KG research reduced by more than one order of magnitude (Lalithsena, can be divided in two main streams (Ji et al. 2020): knowl- Kapanipathi, and Sheth 2016). edge representation learning, which investigates the repre- The domain definition is usually performed by subject sentation of KG into vector representations (KG embed- matter experts. Yet, knowledge acquisition by expert cu- dings), and knowledge acquisition, which considers the KG ration can be extremely slow as the process is essentially generation process. The latter being a fundamental aspect manual. Moreover, human error may affect the data qual- since a malformed graph will not be able to serve reliably ity and lead to malformed KGs. In this work, we propose any kind of downstream task. to overcome these issues by introducing an automated ma- Copyright ©2021 for this paper by its authors. Use permitted under chine learning-based approach to understand the domain of Creative Commons License Attribution 4.0 International (CC BY a collection of text snippets. Specifically, given sample in- 4.0). put texts, we infer the schema of the domain to which they (a) Bottom-up pipeline. (b) Top-down pipeline. Figure 1: Typical pipelines of bottom-up and top-down KG generation. belong. This task can be incorporated into both domain- relations in a text given its content and information about specific KG generation and KG specification process, where the position of entities in it. The positional information of the domain definition is the essential first step. For the KG the entities is typically extracted in a previous step of KG generation, the input texts can be samples from the corpus generation using NER methods (Nadeau and Sekine 2007). of interest, while for the KG specification, these text snip- Lately, there is a high interest of methods that can combine pets can express possible questions that need to be answered the NER and relation extraction tasks into a single model from the specified KG. We introduce a seq2seq-based model (Zheng et al. 2017; Zeng et al. 2018; Fu, Li, and Ma 2019). relying on transformer architecture to infer the relation types While our work is linked to relation extraction it has two characterizing the domain of interest. Such model lets us major differences. Firstly, we focus on the relation type and to define the domain’s schema including all the needed en- the entity types that compose a relation rather than the actual tity and relation types. The model can be trained using any triplet. Secondly, the training process differs and requires available previous schema (i.e., schema of a general KG like coarser annotations. We solely provide texts and the respec- DBpedia) and respective text examples for each possible re- tive existing sequence of relation types. Contrarily in a typ- lation type. We show that our proposed model outperforms ical relation extraction training process, information about other baseline approaches, it can be successfully utilized for the position of the entities in the text is also needed. Here, scientific documents and it has interesting potential exten- we propose to improve knowledge acquisition by perform- sion in the field of automated KG generation. ing a data-driven domain definition providing an approach that is currently unexplored in KG research. Related work Seq2seq-based model for domain At the best of our knowledge, our method is the first attempt to introduce a supervised machine learning based domain understanding understanding tool that can be incorporated into domain- The domain understanding task attempts to uncover the specific KG generation and specification pipelines. Cur- structured knowledge underlying a dataset. In order to de- rently, the main research interest related to KG generation pict this structure we can leverage the so called domain’s workflows is associated with attempts to improve the named metagraph. A domain’s metagraph is a graph that has as entity recognition (NER) and the relation extraction tasks or vertices all the entity types and as edges all their connec- provide end-to-end pipelines for general or domain-specific tions/relations in the context of this domain. The generation KG generation (Ji et al. 2020). The majority of such work of such a metagraph entails obtaining all the entity types and focuses on the actual generation step and rely solely on man- their relations. Assuming that each of entity types that are ual identification of the domain definition (Luan et al. 2018; presented in the domain has at least one interaction with an- Manica et al. 2019; Wang et al. 2020). other entity type, the metagraph of this domain can be pro- As it concerns the KG specification field, the subgraph duced by inferring all the possible relation types as all the extraction is usually based on graph traversals or more so- entity types are included in at least one of them. Thus, our phisticated heuristics techniques and some providing initial approach aims to build an accurate model to detect a do- entities or entity types (Lalithsena, Kapanipathi, and Sheth main’s relation types, and leverages this model to extract 2016). Such approaches are effective, yet a significant en- those relations from a given corpus. Aggregating all ex- gineering effort is required to tune the heuristics for each tracted relations yields the domain’s metagraph. different case. Let alone, the crucial task of proper selec- tion of the initial entities or entity types is mainly performed Seq2seq model for domain’s relation types manually. extraction The relation extraction task is also related to our work. Sequence to sequence models (seq2seq) (Bahdanau, Cho, It aims at the extraction of triplets of the form of (subject, and Bengio 2014; Cho et al. 2014; Sutskever, Vinyals, and relation, object) from the texts. The neural network based Le 2014; Jozefowicz et al. 2016) attempt to learn the map- methods, such as Nguyen and Grishman; Zhou et al.; Zhang ping from an input X to its corresponding target Y where et al., dominate the field. These methods are CNN (Zeng et both of them are represented by sequences. To achieve this al. 2014; Nguyen and Grishman 2015) or LSTM (Zhou et al. they follow an encoder-decoder based approach. Encoders 2016; Zhang et al. 2017) models, which attempt to identify and decoders can be recurrent neural networks (Cho et al. Figure 2: Architecture of our utilized Transformer model for domain understanding. 2014) or convolutional based neural networks (Gehring et entities and their relations. al. 2017). In addition, an attention mechanism can also be Based on the given training set, we consider that the incorporated into the encoder (Bahdanau, Cho, and Bengio model is aware of a general domain anatomy, i.e., the sets 2014; Luong, Pham, and Manning 2015) for further boosting of possible entity types and relation types are known, and of the model’s performance. Lately, Transformer architec- we would like to identify which of them are depicted in a tures (Vaswani et al. 2017; Devlin et al. 2018; Liu et al. 2019; given corpus. In both cases of EntityPairOverlap and Radford et al. 2018), a family of models whose components SingleEntityOverlap type text snippets, there is one are entirely made up of attention layers, linear layers and main entity type from which all the other entity types can be batch normalization layers, have established themselves as found by performing only one hop traversal in the general the state of the art for sequence modeling, outperforming domain’s metagraph. The class of Normal text snippets is a the typically recurrent based components. Seq2seq models broader case in which one can identify heterogeneous con- have been successfully utilized for various tasks such as neu- nectivity patterns among the entity types represented. Yet, ral machine translation (Bahdanau, Cho, and Bengio 2014) a sentence typically describes facts that are expected to be and natural language generation (Pust et al. 2015). Recently, connected somehow, thus the entity types included in such their scope has also been extended beyond language process- texts usually are not more than 1 or 2 hops away from each ing in fields such as chemical reaction prediction (Schwaller other in the general metagraph. In the light of the consider- et al. 2019). ations above, we propose to sort the relations in a breadth- We consider the domain’s relation type extraction task as first-search (BFS) order starting from a specific node (entity a specific version of machine translation from the language type) in the general metagraph. In this way, we confine the of the corpus to the “relation” language that includes all the output in a much lower dimensional space by adhering to a different relations between the entity types of the domain. semantically meaningful order. A relation type R which connects the entity type i to j is Inspired by state-of-the-art approaches in the field of represented as “i.R.j” in the “relation” language. In the case neural machine translation, our model architecture is a of undirected connections, “i.R.j” is the same as “j.R.i” and multi-layer bidirectional Transformer. We follow the lead for simplicity we can discard one of them. of Vaswani et al. in implementing the architecture, with the Seq2seq models have been designed to address tasks only difference that we adopt a learned positional encod- where both the input and the output sequences are ordered. ing instead of a static one (see Appendix for further details In our case the target “relation” language does not have any on the positional encoding). As the overall architecture of defined ordering as per definition the edges of a graph do the encoder and the decoder are otherwise the same as in not have any ordering. In theory the order does not mat- Vaswani et al., we omit an in-depth description of the Trans- ter, yet in practice unordered sequences will lead to slower fomer model and refer readers to original paper. convergence of the model and requirements of more train- To boost the model’s performance, we also propose an en- ing data to achieve our goal (Vinyals, Bengio, and Kudlur semble approach exploiting different Transformers and ag- 2015). To overcome this issue, we propose a specific order- gregating their results to construct the domain’s metagraph. ing of the “relation” language influenced from the semantic Each of the Transformers differs in the selected ordering of context that the majority of the text snippets hold. the “relation” vocabulary. The selection of different starting According to (Zeng et al. 2018), in the context entity type for the breadth-first-search will lead to different of relation extraction, text snippets can be divided orderings. We expect that multiple orderings could facilitate into three types: Normal, EntityPairOverlap and the prediction of different connection patterns that can not SingleEntityOverlap. A text snippet is categorized be easily detected using a single ordering. The sequence of as Normal if none of its triplets have overlapping entities. steps for an ensemble domain understanding is the follow- If some of its triplets express a relation on the same pair ing: Firstly, train k Transformers using different orderings. of entities, then it belongs to the EntityPairOverlap Secondly, given a set of text snippets, predict sequences of category and if some of its triplets have one entity in relations using all the Transformers. Finally, use late fusion common but no overlapped pairs, then it belongs to to aggregate the results and form the final predictions. SingleEntityOverlap class. These three categories It is worth mentioning that in the last step, we omit the un- are also relevant in the metagraph case, even if we are work- derlying ordering that we follow in each model and we per- ing with entity types and relation types rather than the actual form a relation-based aggregation. We examine each relation Dataset # instances size of “relation” language mean #relation types per instance WebNLG 23794 70 2.15 NYT 70029 31 1.30 DocRED 30289 511 2.67 PubMed-DU 58761 15 1.22 Table 1: Datasets’s statistics separately in order to include it or not in the final metagraph. Tator (Wei et al. 2019). The available entity types are Gene, For the aggregation step, we use the standard Wisdom of Mutation, Chemical, Disease and Species. The respective re- Crowds (WOC) (Marbach et al. 2012) consensus technique, lation types are in form x.to.y where x and y are two of the yet other consensus methods can also be leveraged for the possible entity types. We assume that the relations are sym- task. The overall structure of our approach is summarized in metric. For text annotation, the following rule was used: a Figure 2. text has the relation x.to.y if two entities with types x and y co-occurred in the text and the syntax path between them Experiments contains at least one keyword of this relation type. These keywords have been manually identified based on the pro- We evaluate our Transformer-based approach against three vided instances and are words, mainly verbs, related to the baselines on a selection of datasets representing different do- relation. Table 1 depicts the statistics of all four utilised mains. As baselines, we use CNN and RNN based methods datasets. influenced by (Nguyen and Grishman 2015) and (Zhou et al. 2016) respectively. For the CNN based method, we slightly For all datasets, we use the same model parameters. modified the architecture to exclude the component which Specifically, we use Adam (Kingma and Ba 2014) optimizer provides information about the position of the entities in the with a learning rate of 0.0005. The gradients norm is clipped text snippet, as we do not have such information available in to 1.0 and dropout (LeCun, Bengio, and Hinton 2015) is set our task. Additionally, we also include a Transformer-based to 0.1. Both encoder and decoder consist of 2 layers with model without applying any ordering in the target sequences 10 attention heads each, the positional feed-forward hidden as an extra baseline. dimension is 512. Lastly, we utilize the token embedding To our knowledge, there is no standard dataset available layers using GloVe pretrained word embeddings (Penning- for the relation type extraction task in the literature. How- ton, Socher, and Manning 2014) which have dimensional- ever there is a plethora of published datasets for the stan- ity of m=300. Our code and the datasets are available at dard task of relation extraction that can be utilized for our https://github.com/christofid/DomainUnderstanding. case with limited effort. For our task, the leveraged datasets should contain tuples of texts and their respective sets of re- The evaluation of the models is performed both at instance lation types. We use WebNLG (Gardent et al. 2017), NYT and graph level. In the instance level, we examine the ability (Riedel, Yao, and McCallum 2010) and DocRED (Yao et al. of the model to predict the relation types that exist in a given 2019), three of the most popular datasets for relation extrac- text. To investigate this, we use F1-score and accuracy. F1- tion. Both NYT and DocRED datasets provide the needed score is the harmonic mean of model’s precision and recall. information such as entity types and relation type for the Accuracy is computed at an instance level and it measures triplets of each instance. Thus their transformation for our for how many of the testing texts, the model manage to infer task can be conducted by just converting these triplets to the correctly the whole set of their relation types. For the meta- relation type format, for instance the triplet (x,y,z) will be graph level evaluation, we use our model to predict the meta- transformed as type(x).type(y).type(z). On the other hand, graph of a domain and we examine how close to the actual WebNLG doesn’t share such information for the entity types metagraph is. For this comparison, we utilize F1-score for and thus manual curation is needed. Therefore, all the possi- both edges and nodes of the metagraph as well as the simi- ble entities are examined and replaced with the proper entity larity of the distribution of the degree and eigenvector cen- type. For the WebNLG dataset, we avoid including rare en- trality (Zaki and Meira 2014) of the two metagraphs. For the tity and relation types which are occurred less than 10 times comparison of the centralities distribution, we construct the in the dataset. We either omit them or replace them with sim- histogram of the centralities for each graph using 10 fixed ilar or more general types that exists in it. size bins and we utilize Jensen-Shannon Divergence (JSD) To emphasize the application of such model in the sci- metric (Endres and Schindelin 2003) to examine the similar- entific document understanding, we produce a new task- ity of the two distributions (see Appendix for the definition specific dataset called PubMed-DU related to the general of JSD). We have selected degree and eigenvector centrali- health domain. We download paper abstracts from PubMed ties as the former gives as localized structure information as focusing on work related to 4 specific health subdomains: measure the importance of a node based on the direct con- Covid-19, mental health, breast cancer and coronary heart nections of it and the latter gives as a broader structure infor- disease. We split the abstracts into sentences. The entities mation as measure the importance of a node based on infinite and their types for each text have been extracted using Pub- walks. Dataset Model Accuracy F1 score CNN (Nguyen and Grishman 2015)* 0.8156 ± 0.0071 0.9459 ± 0.0021 RNN (Zhou et al. 2016)* 0.8517 ±0.0058 0.9543 ± 0.0021 WebNLG Transformer - unordered 0.8798 ±0.0053 0.9646 ± 0.0018 Transformer - BFSrecord label 0.9000 ± 0.0046 0.9699 ± 0.0013 Transformer - WOC k=20 0.9235 ± 0.0014 0.9780 ± 0.0003 CNN (Nguyen and Grishman 2015)* 0.7341 ± 0.0035 0.8385 ± 0.0025 RNN (Zhou et al. 2016)* 0.7520 ± 0.0027 0.8353 ± 0.0029 NYT Transformer - unordered 0.7426 ± 0.0061 0.8009 ± 0.0057 Transformer - BFSperson 0.7491 ± 0.0048 0.8049 ± 0.0073 Transformer - WOC k=8 0.7669 ± 0.0011 0.8307 ± 0.0006 CNN (Nguyen and Grishman 2015)* 0.1096 ± 0.0073 0.4434 ± 0.0133 RNN (Zhou et al. 2016)* 0.2178± 0.0088 0.6192 ± 0.0093 DocRED Transformer - unordered 0.4869 ±0.0069 0.7081 ±0.0032 Transformer - BFSORG 0.5252 ± 0.0048 0.7133 ± 0.0049 Transformer - WOC k=6 0.5722 ± 0.0001 0.7607 ± 0.0001 CNN (Nguyen and Grishman 2015)* 0.5573 ± 0.0030 0.7063 ± 0.0048 RNN (Zhou et al. 2016)* 0.5772 ± 0.0048 0.7234 ± 0.0032 PubMed-DU Transformer - unordered 0.5499 ± 0.0059 0.6725 ± 0.0056 Transformer - BFSSpecies 0.5691 ± 0.0109 0.6752 ± 0.0045 Transformer - WOC k=5 0.5946 ± 0.0001 0.7132 ± 0.0004 Table 2: Comparison of CNN model, RNN model and Transformer-based methods on WebNLG, NYT, DocRED and PubMed- DU datasets. *The architecture of the CNN and RNN models has been modified to exclude the component which provides information about the position of the entities in the text snippet. Instance level evaluation of the models sure that the produced metagraphs are meaningful. Firstly, To study the performance of our model, we perform 10 in- each subdomain should have a connected metagraph and dependent runs each with different random splitting of the secondly each existing relation type is appeared at least two datasets into training, validation and testing set. Table 2 de- times in the provided instances. For the PubMed-DU dataset, picts the median value and the standard error of the baselines we already know the existence of 4 subdomains in it, so we and our method for the two metrics. Our method is better in focus on the inference of them. For each subdomain, we se- terms of accuracy for all the four datasets and in terms of F1- lect randomly 100 instances from the testing set that belong score for the WebNLG and DocRED datasets. For the NYT to this subdomain and we attempt to produce the domain and PubMed-DU datasets, the F1-score of CNN and RNN based on them. We infer the relation types for each instance models outperform our approach. We observed that the base- and then we generate the domain’s metagraph by including line models profit from the fact that, in these datasets, the all the relation types that were found in the instances. Then, majority of the instances depict only one relation and many we compare how close the actual domain’s metagraph and of the relations appear in a limited number of instances. In the predicted metagraph are. general, there is lack of sequences of relations that hinders Table 3 presents the results of the evaluation of the pre- the Transformer’s ability to learn the underlying distribution dicted versus the actual domain’s metagraph for 10 subdo- in these two cases(see Appendix). Lastly, the decreased per- mains extracted from the testing set of the WebNLG, NYT formances of all the models in the DocRED dataset is due and DocRED datasets. All the presented values for these to the long tail characteristic that this dataset shows as 66% datasets are the mean over all the 10 subdomains. For the of the relations appeared in no more than 50 instances (see PubMed-DU dataset, we include only the Covid-19 subdo- Appendix). main case. Results for the remaining subdomains of this dataset can be found in the Appendix. Our approach using Metagraph level evaluation of the models Transformer + BFS based ordering outperforms or is close The above comparisons focus only on the ability of the to the baselines for all cases in terms of edges and nodes F1- model to predict the relation types given a text snippet. Since score. Furthermore, the degree and eigenvector centralities our ultimate goal is to infer the domain’s metagraph from a distribution of the generated metagraphs using our method given corpus, we divide the testing sets of the datasets into are closer to the groundtruth in comparison to other meth- small corpora and we attempt to define their domain using ods in all cases. This indicates that the graphs produced with our model. For WebNLG, NYT and DocRED dataset, 10 ar- our method are both element-wise and structurally closer to tificial corpora and their respective domains have been cre- the actual ones. More detailed comparisons of the different ated by selecting randomly 10 instances from each of the methods at both instance and metagraph level have been in- testing sets. We set two constraints into this selection to as- cluded in the Appendix. Edges Nodes Eigenvector Dataset Model Degree JSD F1-score F1-score JSD CNN (Nguyen and Grishman 2015)* 0.9747 0.9879 0.1836 0.2059 RNN (Zhou et al. 2016)* 0.9639 0.9735 0.2708 0.2364 WebNLG Transformer - unordered 0.9598 0.9775 0.2380 0.2280 Transformer - BFSrecord label 0.9806 0.9772 0.1923 0.1593 Transformer - WOC k=5 0.9808 0.9772 0.1765 0.1261 CNN (Nguyen and Grishman 2015)* 0.9059 0.9800 0.0564 0.0832 RNN (Zhou et al. 2016)* 0.9205 1 0 0 NYT Transformer - unordered 0.8184 0.9800 0.0967 0.1396 Transformer - BFSperson 0.8806 0.9666 0 0 Transformer - WOC k=8 0.8672 1 0 0 CNN (Nguyen and Grishman 2015)* 0.4819 0.9019 0.5717 0.6965 RNN (Zhou et al. 2016)* 0.6823 0.9714 0.5187 0.6954 DocRED Transformer - unordered 0.7530 1 0.2950 0.5997 Transformer - BFSPER 0.7830 0.9777 0.2892 0.4267 Transformer - WOC k=6 0.8045 1 0.2349 0.3688 CNN (Nguyen and Grishman 2015)* 0.9140 0.9888 0.4175 0.4791 RNN (Zhou et al. 2016)* 0.9736 0.9888 0.1002 0.1579 PubMed-DU Transformer - unordered 0.9631 1 0.3253 0.3987 Covid-19 domain Transformer - BFSChemical 0.9583 0.9777 0.2880 0.4136 Transformer - WOC k=5 0.9789 0.9888 0.2048 0.2299 Table 3: Evaluation of metagraph’s reconstruction on the four datasets using CNN, RNN and Transformer-based models. For the PubMed-DU dataset, we focus only on the COVID-19 domain. *The architecture of the CNN and RNN models has been modified to exclude the component which provides information about the position of the entities in the text snippet. The ensemble variant of our approach, based on the WOC we extract 24 text instances from the PubMed-DU dataset consensus strategy, outperforms the simple Transformer + related to the COVID-19 domain. After generating the do- BFS ordering in all cases. Based on the evaluation at both main’s metagraph, we analyze the attention to triples to build instance and metagraph level, our ensemble variant seems a KG. We rely on the syntax dependencies to propagate the to be the most reliable approach for the task of domain’s attention weights throughout the connected tokens and we relation type extraction as it achieves some of the best scores examine the noun chunks to extract the entities of interest for any dataset and metric. based on their accumulated attention weight (see Appendix for further details). We select the head which achieves the Towards automated KG generation best accuracy in order to generate the KG. Figure 3 depicts the generated metagraph and the KG. Using the aforemen- The proposed domain understanding method enables the in- tioned attention analysis, we manage to achieve 82% and ference of the domain of interest and its components. This 64% accuracy in the entity extraction and the relation extrac- enables a partial automation and a speed up of the KG gen- tion respectively. These values might not be able to compete eration process as, without manual intervention, we are able the state of the art respective models and the investigation to identify the metagraph, and inherently the needed mod- is limited in only few instances. Yet it indicates that a com- els for the entity and relation extraction in the context of the pletely unsupervised generation based on attention analysis domain of interest. To achieve this, we adopt a Transformer- is possible and deserves further investigation. based approach that relies heavily on attention mechanisms. Recent efforts are focusing on the analysis of such atten- tion mechanisms to explain and interpret the predictions Conclusion and the quality of the models (Vig and Belinkov 2019; Herein, we proposed a method to speed up the knowledge Hoover, Strobelt, and Gehrmann 2019). Interestingly, it has acquisition process of any domain specific KG application been shown how the analysis of the attention pattern can elu- by defining the domain of interest in an automated manner. cidate complex relations between the entities fed as input to This is achieved by using a Transformer-based approach to the Transformer, e.g., mapping atoms in chemical reactions estimate the metagraph representing the schema of the do- with no supervision (Schwaller et al. 2020). Even if it is out main. Such schema can indicate the proper and needed tools of the scope of our current work, we observe that a similar for the actual entity and relation extraction. Thus our method analysis of the attention patterns in our model can identify can be considering as the stepping stone in any KG gen- not only parts of text in which relations exist but directly eration pipeline. The evaluation and the comparison over the entities of the respective triplets. To illustrate this and different datasets against state-of-the-art methods indicates emphasize its application in the domain understanding field, that our approach produces accurately the metagraph. Espe- Figure 3: KG extracted from 24 text snippets related to the COVID-19 domain using our model and the respective attention analysis. Green color means that the respective node/edge exists in both actual and predicted graph, while pink color means that this element exists in the actual but not in the predicted graph. Entities in bold indicate the path that has been extracted from the text “ace2 and tmprss2 variants and expression as candidates to sex and country differences in covid-19 severity in italy.”. cially, in datasets where text instances contain multiple re- Bert: Pre-training of deep bidirectional transformers for lan- lation types our model outperforms the baselines. This is an guage understanding. arXiv preprint arXiv:1810.04805. important observation as text describing multiple relations is Endres, D. M., and Schindelin, J. E. 2003. A new metric for the most common scenario. Based on that and relying on the probability distributions. IEEE Transactions on Information capability of the transformers to catch longer dependencies, theory 49(7):1858–1860. future investigation of how our model performs in larger pieces of texts, like full paragraphs, could be interesting and Fu, T.-J.; Li, P.-H.; and Ma, W.-Y. 2019. Graphrel: Mod- indicate a clearer advantage of our work. The needed defi- eling text as relational graphs for joint entity and relation nition of a general domain for the training phase might be a extraction. In Proceedings of the 57th Annual Meeting of limitation of this method. However, schema and data from the Association for Computational Linguistics, 1409–1418. already existing KGs can be utilized for training purposes. Gardent, C.; Shimorina, A.; Narayan, S.; and Perez- Unsupervised or semi-supervised extension of this work can Beltrachini, L. 2017. Creating training corpora for nlg also be explored in the future to mitigate the issue. micro-planning. In 55th annual meeting of the Association Our work paves the way towards an automated knowl- for Computational Linguistics (ACL). edge acquisition, as our model minimizes the need of hu- Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, man intervention in the process. So in the near future the Y. N. 2017. Convolutional sequence to sequence learning. currently needed manual curation can be avoided and lead In Proceedings of the 34th International Conference on Ma- to faster and more accurate knowledge acquisition. Interest- chine Learning-Volume 70, 1243–1252. JMLR. org. ingly, using the PubMed-DU dataset, we underline that our method can be utilized for scientific documents. The infer- Guerini, M.; Magnolini, S.; Balaraman, V.; and Magnini, B. ence of their domain can assist both in their general under- 2018. Toward zero-shot entity recognition in task-oriented standing but also lead to more robust knowledge acquisition conversational agents. In Proceedings of the 19th Annual from them. As a side effect, it is also important to notice that, SIGdial Meeting on Discourse and Dialogue, 317–326. such attention-based model can be directly applied to triplet Hoover, B.; Strobelt, H.; and Gehrmann, S. 2019. exbert: extraction from the text without retraining and without su- A visual analysis tool to explore learned representations in pervision. Triplet extraction in an unsupervised way repre- transformers models. arXiv preprint arXiv:1910.05276. sents a breakthrough, especially if combined with most re- Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; and Yu, P. S. 2020. cent advances in zero-shot learning for NER (Li et al. 2020; A survey on knowledge graphs: Representation, acquisition Pasupat and Liang 2014; Guerini et al. 2018). Further anal- and applications. arXiv preprint arXiv:2002.00388. ysis of our Transformer-based approach could give a better insight into these capabilities. Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; and Wu, Y. 2016. Exploring the limits of language modeling. References arXiv preprint arXiv:1602.02410. Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma- Kingma, D. P., and Ba, J. 2014. Adam: A method for chine translation by jointly learning to align and translate. stochastic optimization. arXiv preprint arXiv:1412.6980. arXiv preprint arXiv:1409.0473. Lalithsena, S.; Kapanipathi, P.; and Sheth, A. 2016. Har- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; nessing relationships for domain-specific subgraph extrac- Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning tion: A recommendation use case. In 2016 IEEE Interna- phrase representations using rnn encoder-decoder for statis- tional Conference on Big Data (Big Data), 706–715. IEEE. tical machine translation. arXiv preprint arXiv:1406.1078. LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. nature 521(7553):436–444. Li, J.; Sun, A.; Han, J.; and Li, C. 2020. A survey on deep Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence learning for named entity recognition. IEEE Transactions to sequence learning with neural networks. In Advances in on Knowledge and Data Engineering. neural information processing systems, 3104–3112. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At- Roberta: A robustly optimized bert pretraining approach. tention is all you need. In Advances in neural information arXiv preprint arXiv:1907.11692. processing systems, 5998–6008. Luan, Y.; He, L.; Ostendorf, M.; and Hajishirzi, H. 2018. Vig, J., and Belinkov, Y. 2019. Analyzing the structure of Multi-task identification of entities, relations, and corefer- attention in a transformer language model. arXiv preprint ence for scientific knowledge graph construction. arXiv arXiv:1906.04284. preprint arXiv:1808.09602. Vinyals, O.; Bengio, S.; and Kudlur, M. 2015. Order Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effec- matters: Sequence to sequence for sets. arXiv preprint tive approaches to attention-based neural machine transla- arXiv:1511.06391. tion. arXiv preprint arXiv:1508.04025. Wang, Q.; Li, M.; Wang, X.; Parulian, N.; Han, G.; Ma, J.; Manica, M.; Auer, C.; Weber, V.; Zipoli, F.; Dolfi, M.; Staar, Tu, J.; Lin, Y.; Zhang, H.; Liu, W.; et al. 2020. Covid-19 lit- P.; Laino, T.; Bekas, C.; Fujita, A.; Toda, H.; et al. 2019. erature knowledge graph construction and drug repurposing An information extraction and knowledge graph platform report generation. arXiv preprint arXiv:2007.00576. for accelerating biochemical discoveries. arXiv preprint arXiv:1907.08400. Wei, C.-H.; Allot, A.; Leaman, R.; and Lu, Z. 2019. Pubta- tor central: automated concept annotation for biomedical full Marbach, D.; Costello, J. C.; Küffner, R.; Vega, N. M.; Prill, text articles. Nucleic acids research 47(W1):W587–W593. R. J.; Camacho, D. M.; Allison, K. R.; Kellis, M.; Collins, J. J.; and Stolovitzky, G. 2012. Wisdom of crowds for robust Yao, Y.; Ye, D.; Li, P.; Han, X.; Lin, Y.; Liu, Z.; Liu, gene network inference. Nature methods 9(8):796–804. Z.; Huang, L.; Zhou, J.; and Sun, M. 2019. Docred: A Nadeau, D., and Sekine, S. 2007. A survey of named entity large-scale document-level relation extraction dataset. arXiv recognition and classification. Lingvisticae Investigationes preprint arXiv:1906.06127. 30(1):3–26. Zaki, M. J., and Meira, W. 2014. Data mining and analysis: Nguyen, T. H., and Grishman, R. 2015. Relation extrac- fundamental concepts and algorithms. Cambridge Univer- tion: Perspective from convolutional neural networks. In sity Press. Proceedings of the 1st Workshop on Vector Space Modeling Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; and Zhao, J. 2014. for Natural Language Processing, 39–48. Relation classification via convolutional deep neural net- Pasupat, P., and Liang, P. 2014. Zero-shot entity extraction work. In Proceedings of COLING 2014, the 25th Interna- from web pages. In Proceedings of the 52nd Annual Meeting tional Conference on Computational Linguistics: Technical of the Association for Computational Linguistics (Volume 1: Papers, 2335–2344. Dublin, Ireland: Dublin City University Long Papers), 391–401. and Association for Computational Linguistics. Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Zeng, X.; Zeng, D.; He, S.; Liu, K.; and Zhao, J. 2018. Ex- Global vectors for word representation. In Empirical Meth- tracting relational facts by an end-to-end neural model with ods in Natural Language Processing (EMNLP), 1532–1543. copy mechanism. In Proceedings of the 56th Annual Meet- Pust, M.; Hermjakob, U.; Knight, K.; Marcu, D.; and May, J. ing of the Association for Computational Linguistics (Vol- 2015. Parsing english into abstract meaning representation ume 1: Long Papers), 506–514. using syntax-based machine translation. In Proceedings of Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; and Manning, the 2015 Conference on Empirical Methods in Natural Lan- C. D. 2017. Position-aware attention and supervised data guage Processing, 1143–1154. improve slot filling. In Proceedings of the 2017 Conference Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, on Empirical Methods in Natural Language Processing, 35– I. 2018. Improving language understanding by generative 45. pre-training. Zhao, Z.; Han, S.-K.; and So, I.-M. 2018. Architecture Riedel, S.; Yao, L.; and McCallum, A. 2010. Modeling of knowledge graph construction techniques. International relations and their mentions without labeled text. In Joint Journal of Pure and Applied Mathematics 118(19):1869– European Conference on Machine Learning and Knowledge 1883. Discovery in Databases, 148–163. Springer. Zheng, S.; Wang, F.; Bao, H.; Hao, Y.; Zhou, P.; and Xu, B. Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Hunter, 2017. Joint extraction of entities and relations based on a C. A.; Bekas, C.; and Lee, A. A. 2019. Molecular trans- novel tagging scheme. arXiv preprint arXiv:1706.05075. former: A model for uncertainty-calibrated chemical reac- Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; and Xu, B. tion prediction. ACS Central Science 5(9):1572–1583. 2016. Attention-based bidirectional long short-term mem- Schwaller, P.; Hoover, B.; Reymond, J.-L.; Strobelt, H.; ory networks for relation classification. In Proceedings of and Laino, T. 2020. Unsupervised attention-guided atom- the 54th Annual Meeting of the Association for Computa- mapping. tional Linguistics (Volume 2: Short Papers), 207–212. Appendix weights of a token based on the attention weights of it- Learned positional encoding self and its neighbors in the syntax dependencies graph. Let ar be the attention vector of a predefined model’s attention In our model, we adopt a learned positional encoding instead head, which contains all the attention weights related to the of a static one. Specifically, the tokens are passed through a relation type r. The final attention weight w of the token i standard embedding layer as a first step in the encoder. The for the relation r is defined as: model has no recurrent layers and therefore it has no idea about the order of the tokens within the sequence. To over- X wir = 2 ∗ ari + arj come this, we utilize a second embedding layer called a po- j∈neigi sitional embedding layer. This is a standard embedding layer where the input is not the token itself but the position of the where neigi is the set containing all the neighbors of i in token within the sequence, starting with the first token, the the syntax dependencies graph. Then for each noun chunk k(start of sequence) token, in position 0. The position (nk ) of the text we compute its total attention weight for the embedding has a ”vocabulary” size equal to the maximum relation type r as: length of the input sequence. The token embedding and po- X sitional embedding are element-wise summed together to get nrk = f (wjr ) the final token embedding which contains information about j∈nck both the token and its position within the sequence. This fi- where nck is the set of tokens which belong to the nk and nal token embedding is then provided as input in the stack f is a function defined as of attention layers of the encoder. r wj if j is stop-word Dataset characteristics f (wjr ) = 2 ∗ wjr if j is not stop-word For a better understanding of the datasets, we analyzed the distribution of occurrences for all relation types. These dis- Finally, we extract as entities which are connected via the tributions are depicted in Figure 4. A percentage of relation relation type r the two noun chunks with the highest weight types with number of appearances close or less to 10 is ob- nr . As this work is a proof of concept rather than an ac- served for all datasets. The lack of many examples can pose tual method, the selection of the attention head is based on problems in the learning process for these specific relation whichever gives as the best outcome. Yet, in actual scenarios types. This is highlighted especially in the DocRED case, as it is recommended the use of a training set, based on which we attributed the decreased performances of all the models the optimal head will be identified. For the creation of the in this dataset in its long tail characteristic that it holds. Es- syntax dependencies graph and the extraction of the noun pecially for the DocRED, almost the 50% of the relations chunks of the text we use spacy1 and its en core web lg appeared in no more than 10 instances and the 66% of the pretrained model. Table 4 includes all the texts that have relations appeared in no more than 50 instances (4). used in the proof of concept that is presented in the main paper and the respective predicted triplets for each of them. Jensen-Shannon distance Models’s comparison The Jensen-Shannon divergence metric between two proba- bility vectors p and q is defined as: Table 5 presents a detailed evaluation of our approach and the baselines models. We have included the 3 best BFS or- dering variants (in terms of accuracy) and 3 consensus vari- r D(p k m) + D(q k m) ants. To cover the range of all the available values of k, [1, 2 number of entity types], we select a case with just a few where m is the pointwise mean of p and q and D is the Transformers, one with a value close to half of the total Kullback-Leibler divergence. number of entity types and one close to the total number The Kullback-Leibler divergence for two probability vec- of entity types. For each different k value, we utilize the top tors p and q of length n is defined as: k best orderings based on their accuracy. In addition to the n per instance accuracy and the per relation F1-score, the table X pi also includes the per relation precision and the recall of each D(p k q) = pi log2 ( ) i=1 qi model. Our proposed method, especially its ensemble vari- ant, produces the best outcome in all datasets apart from the The Jensen–Shannon metric is bounded by 1, given that NYT case where the CNN and RNN models manage to be we use the base 2 logarithm. more precise. This is attributed to the characteristics of NYT Triplets extraction based on attention analysis dataset, where there are many one-only relation instances. Similarly, in tables 6 and 7 we perform an in-depth meta- In this section, the procedure of automated triplets extrac- graph level evaluation of the models. For all three datasets, tion based on the predicted relation types and the respective our proposed method and its ensemble extension produce attention weights is described . We generate an undirected the best or one of the top-3 best outcomes. graph that connects the tokens of the sentence based on their syntax dependencies for each instance. Then for each dif- 1 ferent predicted relation type, we define the final attention https://spacy.io/ (a) WebNLG (b) NYT (c) DocRED (d) PubMed-DU Figure 4: Appearances distribution for the relation types of all utilized datasets. Text Predicted triplets acute deep vein thrombosis in covid-19 hospitalized patients. (thrombosis, Disease.to.Disease, COVID 19) (ace-2) receptor for its attachment similar to sars-cov-1, which is followed by priming of spike protein by transmembrane protease serine 2 (spike, Gene.to.Gene, Transmembrane protease serine 2) (tmprss2) which can be targeted by a proven inhibitor of tmprss2, camostat. temporal trends in decompensated heart failure and outcomes during covid-19: (heart failure, Disease.to.Disease, COVID-19) a multisite report from heart failure referral centres in london. prevalence, risk factors and clinical correlates of depression in quarantined (depression, Disease.to.Disease, COVID-19) population during the covid-19 outbreak. tocilizumab plus glucocorticoids in severe and critically covid-19 patients. (Tocilizumab, Chemical.to.Species, patients) effects of progressive muscle relaxation on anxiety and sleep quality in (anxiety, Disease.to.Disease, COVID-19) patients with covid-19. special section: covid-19 among people living with hiv. (people, Disease.to.Species, HIV) ace2 and tmprss2 variants and expression as candidates to sex and country (ACE2, Gene.to.Gene, TMPRSS2) differences in covid-19 severity in italy. (COVID-19, Disease.to.Gene, TMPRSS2) did the covid-19 pandemic cause a delay in the diagnosis of acute appendicitis? (COVID-19, Disease.to.Disease, Acute Appendicitis) practice observed in managing gynaecological problems in post-menopausal (COVID-19, Disease.to.Species, women) women during the covid-19 pandemic. hypogammaglobulinemia causing pneumonia: an overlooked curable entity (hypogammaglobulinemia, Disease.to.Disease, COVID-19) in the chaotic covid-19 pandemic. chemokine receptor gene polymorphisms and covid-19: could knowledge (AIDS, Disease.to.Gene, Chemokine receptor) gained from hiv/aids be important? serotonin syndrome in two covid-19 patients treated with lopinavir/ritonavir. (lopinavir/ritonavir, Chemical.to.Species, patients) mortality rate and predictors of mortality in hospitalized covid-19 (COVID-19, Disease.to.Disease, Diabetes) patients with diabetes. in conclusion, self-reported depression occurred at an early stage in convalescent covid-19 patients, and changes in immune function were apparent during (depression, Disease.to.Disease, COVID-19) short-term follow-up of these patients after discharge. gb-2 inhibits ace2 and tmprss2 expression: in vivo and in vitro studies. (ACE2, Gene.to.Gene, TMPRSS2) preemptive interleukin-6 blockade in patients with covid-19. (COVID-19, Disease.to.Gene, interleukin-6) response to: ’clinical course of covid-19 in patients with systemic lupus (hydroxychloroquine, Chemical.to.Species, patients) erythematosus under long-term treatment with hydroxychloroquine’ by carbillon et al. obsessive-compulsive disorder during the covid-19 pandemic (Obsessive-compulsive disorder, Disease.to.Disease, COVID-19) repeated monitoring of ferritin, interleukin-6, c-reactive protein, lactic acid dehydrogenase, and erythrocyte sedimentation rate during covid-19 treatment may assist the prediction of (COVID-19, Disease.to.Gene, interleukin-6) disease severity and evaluation of treatment effects. respiratory and pulmonary complications in head and neck cancer patients: evidence-based (head and neck cancer, Disease.to.Disease, COVID-19) review for the covid-19 era. targeting the immune system for pulmonary inflammation and cardiovascular complications (Cardiovascular Complications, Disease.to.Disease, COVID-19) in covid-19 patients. risk of peripheral arterial thrombosis in covid-19. (Thrombosis, Disease.to.Disease, COVID-19) preadmission diabetes-specific risk factors for mortality in hospitalized patients with (Diabetes, Disease.to.Disease, COVID-19) diabetes and coronavirus disease 2019. Table 4: Utilized texts from PubMed-DU dataset and their respective extracted triplets based on attention analysis. Dataset Model Accuracy Precision Recall F1 score CNN (Nguyen and Grishman 2015)* 0.8156 ± 0.0071 0.9550 ± 0.0029 0.9370 ± 0.0049 0.9459 ± 0.0021 RNN (Zhou et al. 2016)* 0.8517 ±0.0058 0.9614 ±0.0022 0.9472±0.0043 0.9543 ±0.0021 Transformer - unordered 0.8798 ±0.0053 0.9678 ±0.0032 0.9614 ±0.0042 0.9646 ±0.0018 Transformer - BFSoccupation 0.8987±0.0068 0.9693±0.0030 0.9705 ± 0.0035 0.9699 ± 0.0018 WebNLG Transformer - BFSmusic genre 0.8983 ± 0.0053 0.9694 ±0.0039 0.9703 ±0.0030 0.9699 ± 0.0015 Transformer - BFSrecord label 0.9000 ± 0.0046 0.9707 ± 0.0031 0.9691 ±0.0028 0.9699 ±0.0013 Transformer - WOC k=5 0.9210 ± 0.0017 0.9789 ±0.0016 0.9755 ± 0.0013 0.9772 ± 0.0004 Transformer - WOC k=20 0.9235 ± 0.0014 0.9786 ±0.0006 0.9774 ± 0.0004 0.9780 ±0.0003 Transformer - WOC k=45 0.9235 ± 0.0002 0.9795 ±0.0001 0.9767 ±0.0001 0.9781 ±0.0001 CNN (Nguyen and Grishman 2015)* 0.7341 ± 0.0035 0.8873 ± 0.0032 0.7948 ± 0.0053 0.8385 ± 0.0025 RNN (Zhou et al. 2016)* 0.7520 ± 0.0027 0.8530 ±0.0041 0.8183 ±0.0047 0.8353 ±0.0029 Transformer - unordered 0.7426 ± 0.0061 0.8059 ±0.0079 0.7960 ± 0.0070 0.8009 ±0.0057 Transformer - BFSlocation 0.7461 ± 0.0053 0.8093 ±0.0067 0.7988 ±0.0067 0.8040 ±0.0043 NYT Transformer - BFSperson 0.7491 ±0.0048 0.8119 ± 0.0036 0.8022 ±0.0032 0.8049 ± 0.0073 Transformer - BFScompany 0.7461 ±0.0081 0.8056 ±0.0088 0.8043 ±0.0117 0.8049 ±0.0073 Transformer - WOC k=4 0.7547 ±0.0038 0.8162 ± 0.0047 0.8355 ± 0.0022 0.8257 ±0.0023 Transformer - WOC k=8 0.7669 ±0.0011 0.8325 ± 0.0025 0.8289 ±0.0018 0.8307 ± 0.0006 Transformer - WOC k=12 0.7698 ± 0.0007 0.8381 ±0.0017 0.8296 ±0.0009 0.8320 ±0.0004 CNN (Nguyen and Grishman 2015)* 0.1096 ± 0.0073 0.7838 ± 0.0081 0.3094 ± 0.0140 0.4434 ± 0.0133 RNN (Zhou et al. 2016)* 0.2178± 0.0088 0.7716 ±0.0100 0.5173 ±0.0143 0.6192 ± 0.0093 Transformer - unordered 0.4869 ±0.0069 0.7365 ±0.0156 0.6822 ±0.0127 0.7081 ±0.0032 Transformer - BFSLOC 0.5235 ± 0.0049 0.7145 ± 0.0112 0.7053 ±0.0076 0.7098 ± 0.0040 DocRED Transformer - BFSPER 0.5234 ± 0.0077 0.7234 ± 0.0179 0.7029 ± 0.0091 0.7128 ±0.0060 Transformer - BFSORG 0.5252 ± 0.0048 0.7216 ± 0.0068 0.7053 ± 0.0069 0.7133 ± 0.0049 Transfomer - WOC k=4 0.5678 ±0.0037 0.7939 ±0.0137 0.7227 ±0.0103 0.7564 ± 0.0032 Transformer - WOC k=5 0.5697 ± 0.0016 0.8012 ±0.0112 0.7210 ±0.0053 0.7589 ±0.0025 Transformer - WOC k=6 0.5722 ± 0.0001 0.7970 ± 0.0035 0.7276 ± 0.0022 0.7607 ± 0.0001 CNN (Nguyen and Grishman 2015)* 0.5573 ± 0.0030 0.7856 ± 0.0063 0.6417 ± 0.0100 0.7063 ± 0.0048 RNN (Zhou et al. 2016)* 0.5772 ± 0.0041 0.7401 ± 0.0051 0.7075 ± 0.0051 0.7234 ± 0.0032 Transformer - unordered 0.5693 ± 0.0075 0.6920 ± 0.0099 0.6837 ± 0.0111 0.6877 ± 0.0047 Transformer - BFSSpecies 0.5707 ± 0.0060 0.6878 ± 0.0084 0.6921 ± 0.0097 0.6898 ± 0.0040 PubMed-DU Transformer - BFSChemical 0.5703 ± 0.0048 0.6864 ± 0.0056 0.6942 ± 0.0057 0.6903 ± 0.0048 Transformer - BFSGene 0.5718 ± 0.0068 0.6872 ± 0.0058 0.6906 ± 0.0110 0.6889 ± 0.0065 Transformer - WOC k=3 0.5909 ± 0.0070 0.7261 ± 0.0162 0.6958 ± 0.0031 0.7106 ± 0.0094 Transformer - WOC k=4 0.5685 ± 0.0069 0.6864 ± 0.0109 0.7418 ± 0.0033 0.7130 ± 0.0074 Transformer - WOC k=5 0.5960 ± 0.0001 0.7346 ± 0.0028 0.6974 ± 0.0025 0.7155 ± 0.0004 Table 5: Comparison of CNN, RNN and Transformer-based methods on WebNLG, NYT, DocRED and PubMed-DU datasets for the relation type extractiont task. *The architecture of the CNN and RNN models has been modified to exclude the component which provides information about the position of the entities in the text snippet. Edges Nodes Eigenvector Dataset Model Degree JSD F1-score F1-score JSD CNN (Nguyen and Grishman 2015)* 0.9747 0.9879 0.1836 0.2059 RNN (Zhou et al. 2016)* 0.9639 0.9735 0.2708 0.2364 Transformer - unordered 0.9598 0.9775 0.2380 0.2280 Transformer - BFSoccupation 0.9831 0.9805 0.1743 0.1840 WebNLG Transformer - BFSmusic genre 0.9755 0.9746 0.1131 0.1306 Transformer - BFSrecord label 0.9806 0.9772 0.1923 0.1593 Transformer - WOC k=5 0.9808 0.9772 0.1765 0.1261 Transformer - WOC k=20 0.9864 0.9840 0.1456 0.070 Transformer - WOC k=45 0.9930 0.9916 0.1313 0.0891 CNN (Nguyen and Grishman 2015)* 0.9059 0.9800 0.0564 0.0832 RNN (Zhou et al. 2016)* 0.9205 1 0 0 Transformer - unordered 0.8184 0.9800 0.0967 0.1396 Transformer - BFSlocation 0.8141 0.9800 0.0832 0.0832 NYT Transformer - BFSperson 0.8806 0.9666 0 0 Transformer - BFScompany 0.8305 0.9657 0 0 Transformer - WOC k=4 0.8442 1 0 0 Transformer - WOC k=8 0.8672 1 0 0 Transformer - WOC k=12 0.8666 1 0 0 CNN (Nguyen and Grishman 2015)* 0.4819 0.9019 0.5717 0.6965 RNN (Zhou et al. 2016)* 0.6823 0.9714 0.5187 0.6954 Transformer - unordered 0.7530 1 0.2950 0.5997 Transformer - BFSLOC 0.7710 1 0.2744 0.5140 DocRED Transformer - BFSPER 0.7830 0.9777 0.2892 0.4267 Transformer - BFSORG 0.7243 0.9777 0.2892 0.4267 Transformer - WOC k=4 0.7997 1 0.2714 0.4787 Transformer - WOC k=5 0.8090 1 0.2673 0.4433 Transformer - WOC k=6 0.8045 1 0.2349 0.3688 Table 6: Evaluation of metagraph’s reconstruction on WebNLG, NYT and DocRED datasets using CNN, RNN and Transformer- based models. *The architecture of the CNN and RNN models has been modified to exclude the component which provides information about the position of the entities in the text snippet. Edges Nodes Eigenvector Domain Model Degree JSD F1-score F1-score JSD CNN (Nguyen and Grishman 2015)* 0.9140 0.9888 0.4175 0.4791 RNN (Zhou et al. 2016)* 0.9736 0.9888 0.1002 0.1579 Transformer - unordered 0.9631 1 0.3253 0.3987 Transformer - BFSSpecies 0.9583 0.9777 0.2880 0.4136 Covid-19 Transformer - BFSChemical 0.9583 0.9777 0.2880 0.4136 Transformer - BFSGene 0.9525 0.9777 0.3344 0.2919 Transformer - WOC k=3 0.9531 0.9777 0.4271 0.4096 Transformer - WOC k=4 0.9642 0.9777 0.4271 0.4096 Transformer - WOC k=5 0.9789 0.9888 0.2048 0.2299 CNN (Nguyen and Grishman 2015)* 0.9135 1 0.3575 0.3987 RNN (Zhou et al. 2016)* 0.9730 1 0.1962 0.2594 Transformer - unordered 0.9367 0.9777 0.4534 0.4879 Transformer - BFSSpecies 0.9531 0.9777 0.4453 0.4879 Breast cancer Transformer - BFSChemical 0.9531 0.9777 0.4453 0.4879 Transformer - BFSGene 0.9379 0.9666 0.5385 0.5343 Transformer - WOC k=3 0.9525 0.9777 0.4089 0.4068 Transformer - WOC k=4 0.9584 0.9777 0.3999 0.4414 Transformer - WOC k=5 0.9514 0.9888 0.3821 0.3987 CNN (Nguyen and Grishman 2015)* 0.8902 0.9777 0.4349 0.5865 RNN (Zhou et al. 2016)* 0.9498 0.9666 0.3661 0.3391 Transformer - unordered 0.9484 0.9777 0.4025 0.5153 Transformer - BFSSpecies 0.9703 0.9777 0.2586 0.2558 Coronary heart diseases Transformer - BFSChemical 0.9703 0.9777 0.2586 0.2558 Transformer - BFSGene 0.9756 0.9777 0.1659 0.1571 Transformer - WOC k=3 0.9644 0.9777 0.2705 0.3023 Transformer - WOC k=4 0.9703 0.9777 0.2705 0.3023 Transformer - WOC k=5 0.9644 0.9777 0.2705 0.3023 CNN (Nguyen and Grishman 2015)* 0.9419 1 0.2477 0.4634 RNN (Zhou et al. 2016)* 0.9924 1 0.067 0.0663 Transformer - unordered 0.9697 1 0.2028 0.2601 Transformer - BFSSpecies 0.9765 1 0.1503 0.3568 Mental health Transformer - BFSChemical 0.9765 1 0.1503 0.3568 Transformer - BFSGene 0.9849 1 0.1333 0.1852 Transformer - WOC k=3 0.9765 1 0.1503 0.3042 Transformer - WOC k=4 0.9765 1 0.1503 0.3568 Transformer - WOC k=5 0.9840 1 0.0840 0.2378 Table 7: Evaluation of metagraph’s reconstruction on the 4 predefined subdomains of PubMed-DU dataset using CNN, RNN and Transformer-based models. *The architecture of the CNN and RNN models has been modified to exclude the component which provides information about the position of the entities in the text snippet.