=Paper= {{Paper |id=Vol-2831/paper13 |storemode=property |title=Understood in Translation: Transformers for Domain Understanding |pdfUrl=https://ceur-ws.org/Vol-2831/paper13.pdf |volume=Vol-2831 |authors=Dimitrios Christofidellis,Matteo Manica,Leonidas Georgopoulos,Hans Vandierendonck |dblpUrl=https://dblp.org/rec/conf/aaai/Christofidellis21 }} ==Understood in Translation: Transformers for Domain Understanding== https://ceur-ws.org/Vol-2831/paper13.pdf
           Understood in Translation: Transformers for Domain Understanding

   Dimitrios Christofidellis,1, 2 Matteo Manica,1 Leonidas Georgopoulos,1 Hans Vandierendonck 2
                                                           1
                                                  IBM Research Europe
                                                       2
                                                 Queen’s University Belfast
                dic@zurich.ibm.com, tte@zurich.ibm.com, leg@zurich.ibm.com, h.vandierendonck@qub.ac.uk



                            Abstract                                     The knowledge acquisition process is either referred to
                                                                     KG construction, where the KG is built from scratch us-
  Knowledge acquisition is the essential first step of any
  Knowledge Graph (KG) application. This knowledge can be            ing a specific corpus, or KG specification, where a subgraph
  extracted from a given corpus (KG generation process) or           of interest is extracted from an existing KG. In both cases,
  specified from an existing KG (KG specification process). Fo-      the acquisition process can follow a bottom-up or top-down
  cusing on domain specific solutions, knowledge acquisition         approach (Zhao, Han, and So 2018). In a bottom-up ap-
  is a labor intensive task usually orchestrated and supervised      proach, all the entities and their connections are extracted
  by subject matter experts. Specifically, the domain of interest    as a first step of the process. Then, the underlying hierarchy
  is usually manually defined and then the needed generation         and structure of the domain can be inferred from the enti-
  or extraction tools are utilized to produce the KG. Herein,        ties and their connections. Conversely, a top-down approach
  we propose a supervised machine learning method, based on          starts with the definition of the domain’s schema that is then
  Transformers, for domain definition of a corpus. We argue
                                                                     used to guide the extraction of the needed entities and con-
  why such automated definition of the domain’s structure is
  beneficial both in terms of construction time and quality of       nections. For general KG generation, a bottom-up approach
  the generated graph. The proposed method is extensively vali-      is usually preferred as we typically wish to include all en-
  dated on three public datasets (WebNLG, NYT and DocRED)            tities and relations that we can extract from the given cor-
  by comparing it with two reference methods based on CNNs           pus. Contrarily, a top-down approach better suits a domain-
  and RNNs models. The evaluation shows the efficiency of            specific KG generation or KG specification, where entities
  our model in this task. Focusing on scientific document un-        and relations are strongly linked to the domain of interest.
  derstanding, we present a new health domain dataset based on       The structure of typical bottom-up and top-down pipelines,
  publications extracted from PubMed and we successfully uti-        focusing on the case of KG generation, are presented in fig-
  lize our method on this. Lastly, we demonstrate how this work      ures 1a and 1b respectively.
  lays the foundation for fully automated and unsupervised KG
  generation.                                                            Herein, we focus on domain-specific, i.e., top-down, ac-
                                                                     quisition for two main reasons. Firstly, the acquisition pro-
                                                                     cess can be faster and more accurate in this way. By specify-
                        Introduction                                 ing the schema of the domain of interest, then we only need
Knowledge Graphs (KGs) are among the most popu-                      to select the proper and needed tools (i.e. pretrained mod-
lar data management paradigms and their application is               els) for the actual entity and relation extraction. Secondly,
widespread across different fields, e.g., recommendation             such approach minimizes the presence of irrelevant data and
systems, question-answering tools and knowledge discovery            restricts queries and graph operations to a carefully tailored
applications. This is due to the fact that KGs share simul-          KG. This generally improves the accuracy of KG applica-
taneously several advantages of databases (information re-           tions (Lalithsena, Kapanipathi, and Sheth 2016). Further-
trieval via structured queries), graphs (representing loosely        more, the graph’s size is significantly reduced by excluding
or irregularly structured data) and knowledge bases (repre-          irrelevant content. Thus, execution time of queries can be
senting semantic relationship among the data). KG research           reduced by more than one order of magnitude (Lalithsena,
can be divided in two main streams (Ji et al. 2020): knowl-          Kapanipathi, and Sheth 2016).
edge representation learning, which investigates the repre-              The domain definition is usually performed by subject
sentation of KG into vector representations (KG embed-               matter experts. Yet, knowledge acquisition by expert cu-
dings), and knowledge acquisition, which considers the KG            ration can be extremely slow as the process is essentially
generation process. The latter being a fundamental aspect            manual. Moreover, human error may affect the data qual-
since a malformed graph will not be able to serve reliably           ity and lead to malformed KGs. In this work, we propose
any kind of downstream task.                                         to overcome these issues by introducing an automated ma-
Copyright ©2021 for this paper by its authors. Use permitted under   chine learning-based approach to understand the domain of
Creative Commons License Attribution 4.0 International (CC BY        a collection of text snippets. Specifically, given sample in-
4.0).                                                                put texts, we infer the schema of the domain to which they
                              (a) Bottom-up pipeline.                         (b) Top-down pipeline.

                             Figure 1: Typical pipelines of bottom-up and top-down KG generation.


belong. This task can be incorporated into both domain-            relations in a text given its content and information about
specific KG generation and KG specification process, where         the position of entities in it. The positional information of
the domain definition is the essential first step. For the KG      the entities is typically extracted in a previous step of KG
generation, the input texts can be samples from the corpus         generation using NER methods (Nadeau and Sekine 2007).
of interest, while for the KG specification, these text snip-      Lately, there is a high interest of methods that can combine
pets can express possible questions that need to be answered       the NER and relation extraction tasks into a single model
from the specified KG. We introduce a seq2seq-based model          (Zheng et al. 2017; Zeng et al. 2018; Fu, Li, and Ma 2019).
relying on transformer architecture to infer the relation types       While our work is linked to relation extraction it has two
characterizing the domain of interest. Such model lets us          major differences. Firstly, we focus on the relation type and
to define the domain’s schema including all the needed en-         the entity types that compose a relation rather than the actual
tity and relation types. The model can be trained using any        triplet. Secondly, the training process differs and requires
available previous schema (i.e., schema of a general KG like       coarser annotations. We solely provide texts and the respec-
DBpedia) and respective text examples for each possible re-        tive existing sequence of relation types. Contrarily in a typ-
lation type. We show that our proposed model outperforms           ical relation extraction training process, information about
other baseline approaches, it can be successfully utilized for     the position of the entities in the text is also needed. Here,
scientific documents and it has interesting potential exten-       we propose to improve knowledge acquisition by perform-
sion in the field of automated KG generation.                      ing a data-driven domain definition providing an approach
                                                                   that is currently unexplored in KG research.
                      Related work
                                                                            Seq2seq-based model for domain
At the best of our knowledge, our method is the first attempt
to introduce a supervised machine learning based domain
                                                                                    understanding
understanding tool that can be incorporated into domain-           The domain understanding task attempts to uncover the
specific KG generation and specification pipelines. Cur-           structured knowledge underlying a dataset. In order to de-
rently, the main research interest related to KG generation        pict this structure we can leverage the so called domain’s
workflows is associated with attempts to improve the named         metagraph. A domain’s metagraph is a graph that has as
entity recognition (NER) and the relation extraction tasks or      vertices all the entity types and as edges all their connec-
provide end-to-end pipelines for general or domain-specific        tions/relations in the context of this domain. The generation
KG generation (Ji et al. 2020). The majority of such work          of such a metagraph entails obtaining all the entity types and
focuses on the actual generation step and rely solely on man-      their relations. Assuming that each of entity types that are
ual identification of the domain definition (Luan et al. 2018;     presented in the domain has at least one interaction with an-
Manica et al. 2019; Wang et al. 2020).                             other entity type, the metagraph of this domain can be pro-
   As it concerns the KG specification field, the subgraph         duced by inferring all the possible relation types as all the
extraction is usually based on graph traversals or more so-        entity types are included in at least one of them. Thus, our
phisticated heuristics techniques and some providing initial       approach aims to build an accurate model to detect a do-
entities or entity types (Lalithsena, Kapanipathi, and Sheth       main’s relation types, and leverages this model to extract
2016). Such approaches are effective, yet a significant en-        those relations from a given corpus. Aggregating all ex-
gineering effort is required to tune the heuristics for each       tracted relations yields the domain’s metagraph.
different case. Let alone, the crucial task of proper selec-
tion of the initial entities or entity types is mainly performed   Seq2seq model for domain’s relation types
manually.                                                          extraction
   The relation extraction task is also related to our work.       Sequence to sequence models (seq2seq) (Bahdanau, Cho,
It aims at the extraction of triplets of the form of (subject,     and Bengio 2014; Cho et al. 2014; Sutskever, Vinyals, and
relation, object) from the texts. The neural network based         Le 2014; Jozefowicz et al. 2016) attempt to learn the map-
methods, such as Nguyen and Grishman; Zhou et al.; Zhang           ping from an input X to its corresponding target Y where
et al., dominate the field. These methods are CNN (Zeng et         both of them are represented by sequences. To achieve this
al. 2014; Nguyen and Grishman 2015) or LSTM (Zhou et al.           they follow an encoder-decoder based approach. Encoders
2016; Zhang et al. 2017) models, which attempt to identify         and decoders can be recurrent neural networks (Cho et al.
                      Figure 2: Architecture of our utilized Transformer model for domain understanding.


2014) or convolutional based neural networks (Gehring et           entities and their relations.
al. 2017). In addition, an attention mechanism can also be            Based on the given training set, we consider that the
incorporated into the encoder (Bahdanau, Cho, and Bengio           model is aware of a general domain anatomy, i.e., the sets
2014; Luong, Pham, and Manning 2015) for further boosting          of possible entity types and relation types are known, and
of the model’s performance. Lately, Transformer architec-          we would like to identify which of them are depicted in a
tures (Vaswani et al. 2017; Devlin et al. 2018; Liu et al. 2019;   given corpus. In both cases of EntityPairOverlap and
Radford et al. 2018), a family of models whose components          SingleEntityOverlap type text snippets, there is one
are entirely made up of attention layers, linear layers and        main entity type from which all the other entity types can be
batch normalization layers, have established themselves as         found by performing only one hop traversal in the general
the state of the art for sequence modeling, outperforming          domain’s metagraph. The class of Normal text snippets is a
the typically recurrent based components. Seq2seq models           broader case in which one can identify heterogeneous con-
have been successfully utilized for various tasks such as neu-     nectivity patterns among the entity types represented. Yet,
ral machine translation (Bahdanau, Cho, and Bengio 2014)           a sentence typically describes facts that are expected to be
and natural language generation (Pust et al. 2015). Recently,      connected somehow, thus the entity types included in such
their scope has also been extended beyond language process-        texts usually are not more than 1 or 2 hops away from each
ing in fields such as chemical reaction prediction (Schwaller      other in the general metagraph. In the light of the consider-
et al. 2019).                                                      ations above, we propose to sort the relations in a breadth-
   We consider the domain’s relation type extraction task as       first-search (BFS) order starting from a specific node (entity
a specific version of machine translation from the language        type) in the general metagraph. In this way, we confine the
of the corpus to the “relation” language that includes all the     output in a much lower dimensional space by adhering to a
different relations between the entity types of the domain.        semantically meaningful order.
A relation type R which connects the entity type i to j is            Inspired by state-of-the-art approaches in the field of
represented as “i.R.j” in the “relation” language. In the case     neural machine translation, our model architecture is a
of undirected connections, “i.R.j” is the same as “j.R.i” and      multi-layer bidirectional Transformer. We follow the lead
for simplicity we can discard one of them.                         of Vaswani et al. in implementing the architecture, with the
   Seq2seq models have been designed to address tasks              only difference that we adopt a learned positional encod-
where both the input and the output sequences are ordered.         ing instead of a static one (see Appendix for further details
In our case the target “relation” language does not have any       on the positional encoding). As the overall architecture of
defined ordering as per definition the edges of a graph do         the encoder and the decoder are otherwise the same as in
not have any ordering. In theory the order does not mat-           Vaswani et al., we omit an in-depth description of the Trans-
ter, yet in practice unordered sequences will lead to slower       fomer model and refer readers to original paper.
convergence of the model and requirements of more train-              To boost the model’s performance, we also propose an en-
ing data to achieve our goal (Vinyals, Bengio, and Kudlur          semble approach exploiting different Transformers and ag-
2015). To overcome this issue, we propose a specific order-        gregating their results to construct the domain’s metagraph.
ing of the “relation” language influenced from the semantic        Each of the Transformers differs in the selected ordering of
context that the majority of the text snippets hold.               the “relation” vocabulary. The selection of different starting
   According to (Zeng et al. 2018), in the context                 entity type for the breadth-first-search will lead to different
of relation extraction, text snippets can be divided               orderings. We expect that multiple orderings could facilitate
into three types: Normal, EntityPairOverlap and                    the prediction of different connection patterns that can not
SingleEntityOverlap. A text snippet is categorized                 be easily detected using a single ordering. The sequence of
as Normal if none of its triplets have overlapping entities.       steps for an ensemble domain understanding is the follow-
If some of its triplets express a relation on the same pair        ing: Firstly, train k Transformers using different orderings.
of entities, then it belongs to the EntityPairOverlap              Secondly, given a set of text snippets, predict sequences of
category and if some of its triplets have one entity in            relations using all the Transformers. Finally, use late fusion
common but no overlapped pairs, then it belongs to                 to aggregate the results and form the final predictions.
SingleEntityOverlap class. These three categories                     It is worth mentioning that in the last step, we omit the un-
are also relevant in the metagraph case, even if we are work-      derlying ordering that we follow in each model and we per-
ing with entity types and relation types rather than the actual    form a relation-based aggregation. We examine each relation
                 Dataset         # instances     size of “relation” language     mean #relation types per instance
                WebNLG              23794                     70                               2.15
                  NYT               70029                     31                               1.30
                DocRED              30289                    511                               2.67
               PubMed-DU            58761                     15                               1.22

                                                    Table 1: Datasets’s statistics


separately in order to include it or not in the final metagraph.      Tator (Wei et al. 2019). The available entity types are Gene,
For the aggregation step, we use the standard Wisdom of               Mutation, Chemical, Disease and Species. The respective re-
Crowds (WOC) (Marbach et al. 2012) consensus technique,               lation types are in form x.to.y where x and y are two of the
yet other consensus methods can also be leveraged for the             possible entity types. We assume that the relations are sym-
task. The overall structure of our approach is summarized in          metric. For text annotation, the following rule was used: a
Figure 2.                                                             text has the relation x.to.y if two entities with types x and
                                                                      y co-occurred in the text and the syntax path between them
                       Experiments                                    contains at least one keyword of this relation type. These
                                                                      keywords have been manually identified based on the pro-
We evaluate our Transformer-based approach against three              vided instances and are words, mainly verbs, related to the
baselines on a selection of datasets representing different do-       relation. Table 1 depicts the statistics of all four utilised
mains. As baselines, we use CNN and RNN based methods                 datasets.
influenced by (Nguyen and Grishman 2015) and (Zhou et al.
2016) respectively. For the CNN based method, we slightly                For all datasets, we use the same model parameters.
modified the architecture to exclude the component which              Specifically, we use Adam (Kingma and Ba 2014) optimizer
provides information about the position of the entities in the        with a learning rate of 0.0005. The gradients norm is clipped
text snippet, as we do not have such information available in         to 1.0 and dropout (LeCun, Bengio, and Hinton 2015) is set
our task. Additionally, we also include a Transformer-based           to 0.1. Both encoder and decoder consist of 2 layers with
model without applying any ordering in the target sequences           10 attention heads each, the positional feed-forward hidden
as an extra baseline.                                                 dimension is 512. Lastly, we utilize the token embedding
   To our knowledge, there is no standard dataset available           layers using GloVe pretrained word embeddings (Penning-
for the relation type extraction task in the literature. How-         ton, Socher, and Manning 2014) which have dimensional-
ever there is a plethora of published datasets for the stan-          ity of m=300. Our code and the datasets are available at
dard task of relation extraction that can be utilized for our         https://github.com/christofid/DomainUnderstanding.
case with limited effort. For our task, the leveraged datasets
should contain tuples of texts and their respective sets of re-          The evaluation of the models is performed both at instance
lation types. We use WebNLG (Gardent et al. 2017), NYT                and graph level. In the instance level, we examine the ability
(Riedel, Yao, and McCallum 2010) and DocRED (Yao et al.               of the model to predict the relation types that exist in a given
2019), three of the most popular datasets for relation extrac-        text. To investigate this, we use F1-score and accuracy. F1-
tion. Both NYT and DocRED datasets provide the needed                 score is the harmonic mean of model’s precision and recall.
information such as entity types and relation type for the            Accuracy is computed at an instance level and it measures
triplets of each instance. Thus their transformation for our          for how many of the testing texts, the model manage to infer
task can be conducted by just converting these triplets to the        correctly the whole set of their relation types. For the meta-
relation type format, for instance the triplet (x,y,z) will be        graph level evaluation, we use our model to predict the meta-
transformed as type(x).type(y).type(z). On the other hand,            graph of a domain and we examine how close to the actual
WebNLG doesn’t share such information for the entity types            metagraph is. For this comparison, we utilize F1-score for
and thus manual curation is needed. Therefore, all the possi-         both edges and nodes of the metagraph as well as the simi-
ble entities are examined and replaced with the proper entity         larity of the distribution of the degree and eigenvector cen-
type. For the WebNLG dataset, we avoid including rare en-             trality (Zaki and Meira 2014) of the two metagraphs. For the
tity and relation types which are occurred less than 10 times         comparison of the centralities distribution, we construct the
in the dataset. We either omit them or replace them with sim-         histogram of the centralities for each graph using 10 fixed
ilar or more general types that exists in it.                         size bins and we utilize Jensen-Shannon Divergence (JSD)
   To emphasize the application of such model in the sci-             metric (Endres and Schindelin 2003) to examine the similar-
entific document understanding, we produce a new task-                ity of the two distributions (see Appendix for the definition
specific dataset called PubMed-DU related to the general              of JSD). We have selected degree and eigenvector centrali-
health domain. We download paper abstracts from PubMed                ties as the former gives as localized structure information as
focusing on work related to 4 specific health subdomains:             measure the importance of a node based on the direct con-
Covid-19, mental health, breast cancer and coronary heart             nections of it and the latter gives as a broader structure infor-
disease. We split the abstracts into sentences. The entities          mation as measure the importance of a node based on infinite
and their types for each text have been extracted using Pub-          walks.
                    Dataset                     Model                      Accuracy              F1 score
                                  CNN (Nguyen and Grishman 2015)*       0.8156 ± 0.0071      0.9459 ± 0.0021
                                      RNN (Zhou et al. 2016)*           0.8517 ±0.0058       0.9543 ± 0.0021
                   WebNLG              Transformer - unordered          0.8798 ±0.0053       0.9646 ± 0.0018
                                     Transformer - BFSrecord label      0.9000 ± 0.0046      0.9699 ± 0.0013
                                      Transformer - WOC k=20            0.9235 ± 0.0014      0.9780 ± 0.0003
                                  CNN (Nguyen and Grishman 2015)*       0.7341 ± 0.0035      0.8385 ± 0.0025
                                      RNN (Zhou et al. 2016)*           0.7520 ± 0.0027      0.8353 ± 0.0029
                     NYT               Transformer - unordered          0.7426 ± 0.0061      0.8009 ± 0.0057
                                       Transformer - BFSperson          0.7491 ± 0.0048      0.8049 ± 0.0073
                                      Transformer - WOC k=8             0.7669 ± 0.0011      0.8307 ± 0.0006
                                  CNN (Nguyen and Grishman 2015)*       0.1096 ± 0.0073      0.4434 ± 0.0133
                                      RNN (Zhou et al. 2016)*           0.2178± 0.0088       0.6192 ± 0.0093
                   DocRED              Transformer - unordered          0.4869 ±0.0069       0.7081 ±0.0032
                                        Transformer - BFSORG            0.5252 ± 0.0048      0.7133 ± 0.0049
                                      Transformer - WOC k=6             0.5722 ± 0.0001      0.7607 ± 0.0001
                                  CNN (Nguyen and Grishman 2015)*       0.5573 ± 0.0030      0.7063 ± 0.0048
                                      RNN (Zhou et al. 2016)*           0.5772 ± 0.0048      0.7234 ± 0.0032
                 PubMed-DU             Transformer - unordered          0.5499 ± 0.0059      0.6725 ± 0.0056
                                       Transformer - BFSSpecies         0.5691 ± 0.0109      0.6752 ± 0.0045
                                      Transformer - WOC k=5             0.5946 ± 0.0001      0.7132 ± 0.0004

Table 2: Comparison of CNN model, RNN model and Transformer-based methods on WebNLG, NYT, DocRED and PubMed-
DU datasets. *The architecture of the CNN and RNN models has been modified to exclude the component which provides
information about the position of the entities in the text snippet.


Instance level evaluation of the models                           sure that the produced metagraphs are meaningful. Firstly,
To study the performance of our model, we perform 10 in-          each subdomain should have a connected metagraph and
dependent runs each with different random splitting of the        secondly each existing relation type is appeared at least two
datasets into training, validation and testing set. Table 2 de-   times in the provided instances. For the PubMed-DU dataset,
picts the median value and the standard error of the baselines    we already know the existence of 4 subdomains in it, so we
and our method for the two metrics. Our method is better in       focus on the inference of them. For each subdomain, we se-
terms of accuracy for all the four datasets and in terms of F1-   lect randomly 100 instances from the testing set that belong
score for the WebNLG and DocRED datasets. For the NYT             to this subdomain and we attempt to produce the domain
and PubMed-DU datasets, the F1-score of CNN and RNN               based on them. We infer the relation types for each instance
models outperform our approach. We observed that the base-        and then we generate the domain’s metagraph by including
line models profit from the fact that, in these datasets, the     all the relation types that were found in the instances. Then,
majority of the instances depict only one relation and many       we compare how close the actual domain’s metagraph and
of the relations appear in a limited number of instances. In      the predicted metagraph are.
general, there is lack of sequences of relations that hinders        Table 3 presents the results of the evaluation of the pre-
the Transformer’s ability to learn the underlying distribution    dicted versus the actual domain’s metagraph for 10 subdo-
in these two cases(see Appendix). Lastly, the decreased per-      mains extracted from the testing set of the WebNLG, NYT
formances of all the models in the DocRED dataset is due          and DocRED datasets. All the presented values for these
to the long tail characteristic that this dataset shows as 66%    datasets are the mean over all the 10 subdomains. For the
of the relations appeared in no more than 50 instances (see       PubMed-DU dataset, we include only the Covid-19 subdo-
Appendix).                                                        main case. Results for the remaining subdomains of this
                                                                  dataset can be found in the Appendix. Our approach using
Metagraph level evaluation of the models                          Transformer + BFS based ordering outperforms or is close
The above comparisons focus only on the ability of the            to the baselines for all cases in terms of edges and nodes F1-
model to predict the relation types given a text snippet. Since   score. Furthermore, the degree and eigenvector centralities
our ultimate goal is to infer the domain’s metagraph from a       distribution of the generated metagraphs using our method
given corpus, we divide the testing sets of the datasets into     are closer to the groundtruth in comparison to other meth-
small corpora and we attempt to define their domain using         ods in all cases. This indicates that the graphs produced with
our model. For WebNLG, NYT and DocRED dataset, 10 ar-             our method are both element-wise and structurally closer to
tificial corpora and their respective domains have been cre-      the actual ones. More detailed comparisons of the different
ated by selecting randomly 10 instances from each of the          methods at both instance and metagraph level have been in-
testing sets. We set two constraints into this selection to as-   cluded in the Appendix.
                                                                     Edges       Nodes                      Eigenvector
              Dataset                         Model                                         Degree JSD
                                                                    F1-score    F1-score                       JSD
                              CNN (Nguyen and Grishman 2015)*        0.9747      0.9879        0.1836         0.2059
                                  RNN (Zhou et al. 2016)*            0.9639      0.9735        0.2708         0.2364
             WebNLG               Transformer - unordered            0.9598      0.9775        0.2380         0.2280
                                 Transformer - BFSrecord label       0.9806      0.9772        0.1923         0.1593
                                  Transformer - WOC k=5              0.9808      0.9772        0.1765         0.1261
                              CNN (Nguyen and Grishman 2015)*        0.9059      0.9800        0.0564         0.0832
                                  RNN (Zhou et al. 2016)*            0.9205         1             0              0
               NYT                Transformer - unordered            0.8184      0.9800        0.0967         0.1396
                                   Transformer - BFSperson           0.8806      0.9666           0              0
                                  Transformer - WOC k=8              0.8672         1             0              0
                              CNN (Nguyen and Grishman 2015)*        0.4819      0.9019        0.5717         0.6965
                                  RNN (Zhou et al. 2016)*            0.6823      0.9714        0.5187         0.6954
             DocRED               Transformer - unordered            0.7530         1          0.2950         0.5997
                                    Transformer - BFSPER             0.7830     0.9777         0.2892         0.4267
                                  Transformer - WOC k=6              0.8045         1          0.2349         0.3688
                              CNN (Nguyen and Grishman 2015)*        0.9140      0.9888        0.4175         0.4791
                                  RNN (Zhou et al. 2016)*            0.9736      0.9888        0.1002         0.1579
           PubMed-DU
                                  Transformer - unordered            0.9631         1          0.3253         0.3987
         Covid-19 domain
                                  Transformer - BFSChemical          0.9583      0.9777        0.2880         0.4136
                                  Transformer - WOC k=5              0.9789      0.9888        0.2048         0.2299

Table 3: Evaluation of metagraph’s reconstruction on the four datasets using CNN, RNN and Transformer-based models. For
the PubMed-DU dataset, we focus only on the COVID-19 domain. *The architecture of the CNN and RNN models has been
modified to exclude the component which provides information about the position of the entities in the text snippet.


   The ensemble variant of our approach, based on the WOC          we extract 24 text instances from the PubMed-DU dataset
consensus strategy, outperforms the simple Transformer +           related to the COVID-19 domain. After generating the do-
BFS ordering in all cases. Based on the evaluation at both         main’s metagraph, we analyze the attention to triples to build
instance and metagraph level, our ensemble variant seems           a KG. We rely on the syntax dependencies to propagate the
to be the most reliable approach for the task of domain’s          attention weights throughout the connected tokens and we
relation type extraction as it achieves some of the best scores    examine the noun chunks to extract the entities of interest
for any dataset and metric.                                        based on their accumulated attention weight (see Appendix
                                                                   for further details). We select the head which achieves the
Towards automated KG generation                                    best accuracy in order to generate the KG. Figure 3 depicts
                                                                   the generated metagraph and the KG. Using the aforemen-
The proposed domain understanding method enables the in-           tioned attention analysis, we manage to achieve 82% and
ference of the domain of interest and its components. This         64% accuracy in the entity extraction and the relation extrac-
enables a partial automation and a speed up of the KG gen-         tion respectively. These values might not be able to compete
eration process as, without manual intervention, we are able       the state of the art respective models and the investigation
to identify the metagraph, and inherently the needed mod-          is limited in only few instances. Yet it indicates that a com-
els for the entity and relation extraction in the context of the   pletely unsupervised generation based on attention analysis
domain of interest. To achieve this, we adopt a Transformer-       is possible and deserves further investigation.
based approach that relies heavily on attention mechanisms.
Recent efforts are focusing on the analysis of such atten-
tion mechanisms to explain and interpret the predictions
                                                                                           Conclusion
and the quality of the models (Vig and Belinkov 2019;              Herein, we proposed a method to speed up the knowledge
Hoover, Strobelt, and Gehrmann 2019). Interestingly, it has        acquisition process of any domain specific KG application
been shown how the analysis of the attention pattern can elu-      by defining the domain of interest in an automated manner.
cidate complex relations between the entities fed as input to      This is achieved by using a Transformer-based approach to
the Transformer, e.g., mapping atoms in chemical reactions         estimate the metagraph representing the schema of the do-
with no supervision (Schwaller et al. 2020). Even if it is out     main. Such schema can indicate the proper and needed tools
of the scope of our current work, we observe that a similar        for the actual entity and relation extraction. Thus our method
analysis of the attention patterns in our model can identify       can be considering as the stepping stone in any KG gen-
not only parts of text in which relations exist but directly       eration pipeline. The evaluation and the comparison over
the entities of the respective triplets. To illustrate this and    different datasets against state-of-the-art methods indicates
emphasize its application in the domain understanding field,       that our approach produces accurately the metagraph. Espe-
Figure 3: KG extracted from 24 text snippets related to the COVID-19 domain using our model and the respective attention
analysis. Green color means that the respective node/edge exists in both actual and predicted graph, while pink color means that
this element exists in the actual but not in the predicted graph. Entities in bold indicate the path that has been extracted from
the text “ace2 and tmprss2 variants and expression as candidates to sex and country differences in covid-19 severity in italy.”.


cially, in datasets where text instances contain multiple re-       Bert: Pre-training of deep bidirectional transformers for lan-
lation types our model outperforms the baselines. This is an        guage understanding. arXiv preprint arXiv:1810.04805.
important observation as text describing multiple relations is      Endres, D. M., and Schindelin, J. E. 2003. A new metric for
the most common scenario. Based on that and relying on the          probability distributions. IEEE Transactions on Information
capability of the transformers to catch longer dependencies,        theory 49(7):1858–1860.
future investigation of how our model performs in larger
pieces of texts, like full paragraphs, could be interesting and     Fu, T.-J.; Li, P.-H.; and Ma, W.-Y. 2019. Graphrel: Mod-
indicate a clearer advantage of our work. The needed defi-          eling text as relational graphs for joint entity and relation
nition of a general domain for the training phase might be a        extraction. In Proceedings of the 57th Annual Meeting of
limitation of this method. However, schema and data from            the Association for Computational Linguistics, 1409–1418.
already existing KGs can be utilized for training purposes.         Gardent, C.; Shimorina, A.; Narayan, S.; and Perez-
Unsupervised or semi-supervised extension of this work can          Beltrachini, L. 2017. Creating training corpora for nlg
also be explored in the future to mitigate the issue.               micro-planning. In 55th annual meeting of the Association
   Our work paves the way towards an automated knowl-               for Computational Linguistics (ACL).
edge acquisition, as our model minimizes the need of hu-            Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin,
man intervention in the process. So in the near future the          Y. N. 2017. Convolutional sequence to sequence learning.
currently needed manual curation can be avoided and lead            In Proceedings of the 34th International Conference on Ma-
to faster and more accurate knowledge acquisition. Interest-        chine Learning-Volume 70, 1243–1252. JMLR. org.
ingly, using the PubMed-DU dataset, we underline that our
method can be utilized for scientific documents. The infer-         Guerini, M.; Magnolini, S.; Balaraman, V.; and Magnini, B.
ence of their domain can assist both in their general under-        2018. Toward zero-shot entity recognition in task-oriented
standing but also lead to more robust knowledge acquisition         conversational agents. In Proceedings of the 19th Annual
from them. As a side effect, it is also important to notice that,   SIGdial Meeting on Discourse and Dialogue, 317–326.
such attention-based model can be directly applied to triplet       Hoover, B.; Strobelt, H.; and Gehrmann, S. 2019. exbert:
extraction from the text without retraining and without su-         A visual analysis tool to explore learned representations in
pervision. Triplet extraction in an unsupervised way repre-         transformers models. arXiv preprint arXiv:1910.05276.
sents a breakthrough, especially if combined with most re-
                                                                    Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; and Yu, P. S. 2020.
cent advances in zero-shot learning for NER (Li et al. 2020;
                                                                    A survey on knowledge graphs: Representation, acquisition
Pasupat and Liang 2014; Guerini et al. 2018). Further anal-
                                                                    and applications. arXiv preprint arXiv:2002.00388.
ysis of our Transformer-based approach could give a better
insight into these capabilities.                                    Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; and
                                                                    Wu, Y. 2016. Exploring the limits of language modeling.
                        References                                  arXiv preprint arXiv:1602.02410.
Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-              Kingma, D. P., and Ba, J. 2014. Adam: A method for
chine translation by jointly learning to align and translate.       stochastic optimization. arXiv preprint arXiv:1412.6980.
arXiv preprint arXiv:1409.0473.                                     Lalithsena, S.; Kapanipathi, P.; and Sheth, A. 2016. Har-
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.;          nessing relationships for domain-specific subgraph extrac-
Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning            tion: A recommendation use case. In 2016 IEEE Interna-
phrase representations using rnn encoder-decoder for statis-        tional Conference on Big Data (Big Data), 706–715. IEEE.
tical machine translation. arXiv preprint arXiv:1406.1078.          LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.          nature 521(7553):436–444.
Li, J.; Sun, A.; Han, J.; and Li, C. 2020. A survey on deep       Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence
learning for named entity recognition. IEEE Transactions          to sequence learning with neural networks. In Advances in
on Knowledge and Data Engineering.                                neural information processing systems, 3104–3112.
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy,   Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019.            L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
Roberta: A robustly optimized bert pretraining approach.          tention is all you need. In Advances in neural information
arXiv preprint arXiv:1907.11692.                                  processing systems, 5998–6008.
Luan, Y.; He, L.; Ostendorf, M.; and Hajishirzi, H. 2018.         Vig, J., and Belinkov, Y. 2019. Analyzing the structure of
Multi-task identification of entities, relations, and corefer-    attention in a transformer language model. arXiv preprint
ence for scientific knowledge graph construction. arXiv           arXiv:1906.04284.
preprint arXiv:1808.09602.
                                                                  Vinyals, O.; Bengio, S.; and Kudlur, M. 2015. Order
Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effec-           matters: Sequence to sequence for sets. arXiv preprint
tive approaches to attention-based neural machine transla-        arXiv:1511.06391.
tion. arXiv preprint arXiv:1508.04025.
                                                                  Wang, Q.; Li, M.; Wang, X.; Parulian, N.; Han, G.; Ma, J.;
Manica, M.; Auer, C.; Weber, V.; Zipoli, F.; Dolfi, M.; Staar,    Tu, J.; Lin, Y.; Zhang, H.; Liu, W.; et al. 2020. Covid-19 lit-
P.; Laino, T.; Bekas, C.; Fujita, A.; Toda, H.; et al. 2019.      erature knowledge graph construction and drug repurposing
An information extraction and knowledge graph platform            report generation. arXiv preprint arXiv:2007.00576.
for accelerating biochemical discoveries. arXiv preprint
arXiv:1907.08400.                                                 Wei, C.-H.; Allot, A.; Leaman, R.; and Lu, Z. 2019. Pubta-
                                                                  tor central: automated concept annotation for biomedical full
Marbach, D.; Costello, J. C.; Küffner, R.; Vega, N. M.; Prill,
                                                                  text articles. Nucleic acids research 47(W1):W587–W593.
R. J.; Camacho, D. M.; Allison, K. R.; Kellis, M.; Collins,
J. J.; and Stolovitzky, G. 2012. Wisdom of crowds for robust      Yao, Y.; Ye, D.; Li, P.; Han, X.; Lin, Y.; Liu, Z.; Liu,
gene network inference. Nature methods 9(8):796–804.              Z.; Huang, L.; Zhou, J.; and Sun, M. 2019. Docred: A
Nadeau, D., and Sekine, S. 2007. A survey of named entity         large-scale document-level relation extraction dataset. arXiv
recognition and classification. Lingvisticae Investigationes      preprint arXiv:1906.06127.
30(1):3–26.                                                       Zaki, M. J., and Meira, W. 2014. Data mining and analysis:
Nguyen, T. H., and Grishman, R. 2015. Relation extrac-            fundamental concepts and algorithms. Cambridge Univer-
tion: Perspective from convolutional neural networks. In          sity Press.
Proceedings of the 1st Workshop on Vector Space Modeling          Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; and Zhao, J. 2014.
for Natural Language Processing, 39–48.                           Relation classification via convolutional deep neural net-
Pasupat, P., and Liang, P. 2014. Zero-shot entity extraction      work. In Proceedings of COLING 2014, the 25th Interna-
from web pages. In Proceedings of the 52nd Annual Meeting         tional Conference on Computational Linguistics: Technical
of the Association for Computational Linguistics (Volume 1:       Papers, 2335–2344. Dublin, Ireland: Dublin City University
Long Papers), 391–401.                                            and Association for Computational Linguistics.
Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:       Zeng, X.; Zeng, D.; He, S.; Liu, K.; and Zhao, J. 2018. Ex-
Global vectors for word representation. In Empirical Meth-        tracting relational facts by an end-to-end neural model with
ods in Natural Language Processing (EMNLP), 1532–1543.            copy mechanism. In Proceedings of the 56th Annual Meet-
Pust, M.; Hermjakob, U.; Knight, K.; Marcu, D.; and May, J.       ing of the Association for Computational Linguistics (Vol-
2015. Parsing english into abstract meaning representation        ume 1: Long Papers), 506–514.
using syntax-based machine translation. In Proceedings of         Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; and Manning,
the 2015 Conference on Empirical Methods in Natural Lan-          C. D. 2017. Position-aware attention and supervised data
guage Processing, 1143–1154.                                      improve slot filling. In Proceedings of the 2017 Conference
Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever,         on Empirical Methods in Natural Language Processing, 35–
I. 2018. Improving language understanding by generative           45.
pre-training.                                                     Zhao, Z.; Han, S.-K.; and So, I.-M. 2018. Architecture
Riedel, S.; Yao, L.; and McCallum, A. 2010. Modeling              of knowledge graph construction techniques. International
relations and their mentions without labeled text. In Joint       Journal of Pure and Applied Mathematics 118(19):1869–
European Conference on Machine Learning and Knowledge             1883.
Discovery in Databases, 148–163. Springer.                        Zheng, S.; Wang, F.; Bao, H.; Hao, Y.; Zhou, P.; and Xu, B.
Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Hunter,         2017. Joint extraction of entities and relations based on a
C. A.; Bekas, C.; and Lee, A. A. 2019. Molecular trans-           novel tagging scheme. arXiv preprint arXiv:1706.05075.
former: A model for uncertainty-calibrated chemical reac-         Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; and Xu, B.
tion prediction. ACS Central Science 5(9):1572–1583.              2016. Attention-based bidirectional long short-term mem-
Schwaller, P.; Hoover, B.; Reymond, J.-L.; Strobelt, H.;          ory networks for relation classification. In Proceedings of
and Laino, T. 2020. Unsupervised attention-guided atom-           the 54th Annual Meeting of the Association for Computa-
mapping.                                                          tional Linguistics (Volume 2: Short Papers), 207–212.
                          Appendix                                   weights of a token based on the attention weights of it-
Learned positional encoding                                          self and its neighbors in the syntax dependencies graph. Let
                                                                     ar be the attention vector of a predefined model’s attention
In our model, we adopt a learned positional encoding instead         head, which contains all the attention weights related to the
of a static one. Specifically, the tokens are passed through a       relation type r. The final attention weight w of the token i
standard embedding layer as a first step in the encoder. The         for the relation r is defined as:
model has no recurrent layers and therefore it has no idea
about the order of the tokens within the sequence. To over-
                                                                                                         X
                                                                                        wir = 2 ∗ ari +       arj
come this, we utilize a second embedding layer called a po-
                                                                                                         j∈neigi
sitional embedding layer. This is a standard embedding layer
where the input is not the token itself but the position of the         where neigi is the set containing all the neighbors of i in
token within the sequence, starting with the first token, the        the syntax dependencies graph. Then for each noun chunk k
 (start of sequence) token, in position 0. The position         (nk ) of the text we compute its total attention weight for the
embedding has a ”vocabulary” size equal to the maximum               relation type r as:
length of the input sequence. The token embedding and po-                                        X
sitional embedding are element-wise summed together to get                                 nrk =       f (wjr )
the final token embedding which contains information about                                       j∈nck
both the token and its position within the sequence. This fi-           where nck is the set of tokens which belong to the nk and
nal token embedding is then provided as input in the stack           f is a function defined as
of attention layers of the encoder.                                                       r
                                                                                          wj         if j is stop-word
Dataset characteristics                                                       f (wjr ) =
                                                                                          2 ∗ wjr    if j is not stop-word
For a better understanding of the datasets, we analyzed the
distribution of occurrences for all relation types. These dis-           Finally, we extract as entities which are connected via the
tributions are depicted in Figure 4. A percentage of relation        relation type r the two noun chunks with the highest weight
types with number of appearances close or less to 10 is ob-          nr . As this work is a proof of concept rather than an ac-
served for all datasets. The lack of many examples can pose          tual method, the selection of the attention head is based on
problems in the learning process for these specific relation         whichever gives as the best outcome. Yet, in actual scenarios
types. This is highlighted especially in the DocRED case, as         it is recommended the use of a training set, based on which
we attributed the decreased performances of all the models           the optimal head will be identified. For the creation of the
in this dataset in its long tail characteristic that it holds. Es-   syntax dependencies graph and the extraction of the noun
pecially for the DocRED, almost the 50% of the relations             chunks of the text we use spacy1 and its en core web lg
appeared in no more than 10 instances and the 66% of the             pretrained model. Table 4 includes all the texts that have
relations appeared in no more than 50 instances (4).                 used in the proof of concept that is presented in the main
                                                                     paper and the respective predicted triplets for each of them.
Jensen-Shannon distance
                                                                     Models’s comparison
The Jensen-Shannon divergence metric between two proba-
bility vectors p and q is defined as:                                Table 5 presents a detailed evaluation of our approach and
                                                                     the baselines models. We have included the 3 best BFS or-
                                                                     dering variants (in terms of accuracy) and 3 consensus vari-
                  r
                    D(p k m) + D(q k m)
                                                                     ants. To cover the range of all the available values of k, [1,
                                 2                                   number of entity types], we select a case with just a few
   where m is the pointwise mean of p and q and D is the             Transformers, one with a value close to half of the total
Kullback-Leibler divergence.                                         number of entity types and one close to the total number
   The Kullback-Leibler divergence for two probability vec-          of entity types. For each different k value, we utilize the top
tors p and q of length n is defined as:                              k best orderings based on their accuracy. In addition to the
                               n                                     per instance accuracy and the per relation F1-score, the table
                               X                 pi                  also includes the per relation precision and the recall of each
                 D(p k q) =          pi log2 (      )
                               i=1
                                                 qi                  model. Our proposed method, especially its ensemble vari-
                                                                     ant, produces the best outcome in all datasets apart from the
  The Jensen–Shannon metric is bounded by 1, given that
                                                                     NYT case where the CNN and RNN models manage to be
we use the base 2 logarithm.
                                                                     more precise. This is attributed to the characteristics of NYT
Triplets extraction based on attention analysis                      dataset, where there are many one-only relation instances.
                                                                        Similarly, in tables 6 and 7 we perform an in-depth meta-
In this section, the procedure of automated triplets extrac-         graph level evaluation of the models. For all three datasets,
tion based on the predicted relation types and the respective        our proposed method and its ensemble extension produce
attention weights is described . We generate an undirected           the best or one of the top-3 best outcomes.
graph that connects the tokens of the sentence based on their
syntax dependencies for each instance. Then for each dif-
                                                                        1
ferent predicted relation type, we define the final attention               https://spacy.io/
(a) WebNLG                                                                   (b) NYT




(c) DocRED                                                               (d) PubMed-DU

  Figure 4: Appearances distribution for the relation types of all utilized datasets.
Text                                                                                             Predicted triplets
acute deep vein thrombosis in covid-19 hospitalized patients.                                    (thrombosis, Disease.to.Disease, COVID 19)
(ace-2) receptor for its attachment similar to sars-cov-1, which
is followed by priming of spike protein by transmembrane protease serine 2                       (spike, Gene.to.Gene, Transmembrane protease serine 2)
(tmprss2) which can be targeted by a proven inhibitor of tmprss2, camostat.
temporal trends in decompensated heart failure and outcomes during covid-19:
                                                                                                 (heart failure, Disease.to.Disease, COVID-19)
a multisite report from heart failure referral centres in london.
prevalence, risk factors and clinical correlates of depression in quarantined
                                                                                                 (depression, Disease.to.Disease, COVID-19)
population during the covid-19 outbreak.
tocilizumab plus glucocorticoids in severe and critically covid-19 patients.                     (Tocilizumab, Chemical.to.Species, patients)
effects of progressive muscle relaxation on anxiety and sleep quality in
                                                                                                 (anxiety, Disease.to.Disease, COVID-19)
patients with covid-19.
special section: covid-19 among people living with hiv.                                          (people, Disease.to.Species, HIV)
ace2 and tmprss2 variants and expression as candidates to sex and country                        (ACE2, Gene.to.Gene, TMPRSS2)
differences in covid-19 severity in italy.                                                       (COVID-19, Disease.to.Gene, TMPRSS2)
did the covid-19 pandemic cause a delay in the diagnosis of acute appendicitis?                  (COVID-19, Disease.to.Disease, Acute Appendicitis)
practice observed in managing gynaecological problems in post-menopausal
                                                                                                 (COVID-19, Disease.to.Species, women)
women during the covid-19 pandemic.
hypogammaglobulinemia causing pneumonia: an overlooked curable entity
                                                                                                 (hypogammaglobulinemia, Disease.to.Disease, COVID-19)
in the chaotic covid-19 pandemic.
chemokine receptor gene polymorphisms and covid-19: could knowledge
                                                                                                 (AIDS, Disease.to.Gene, Chemokine receptor)
gained from hiv/aids be important?
serotonin syndrome in two covid-19 patients treated with lopinavir/ritonavir.                    (lopinavir/ritonavir, Chemical.to.Species, patients)
mortality rate and predictors of mortality in hospitalized covid-19
                                                                                                 (COVID-19, Disease.to.Disease, Diabetes)
patients with diabetes.
in conclusion, self-reported depression occurred at an early stage in convalescent
covid-19 patients, and changes in immune function were apparent during                           (depression, Disease.to.Disease, COVID-19)
short-term follow-up of these patients after discharge.
gb-2 inhibits ace2 and tmprss2 expression: in vivo and in vitro studies.                         (ACE2, Gene.to.Gene, TMPRSS2)
preemptive interleukin-6 blockade in patients with covid-19.                                     (COVID-19, Disease.to.Gene, interleukin-6)
response to: ’clinical course of covid-19 in patients with systemic lupus
                                                                                                 (hydroxychloroquine, Chemical.to.Species, patients)
erythematosus under long-term treatment with hydroxychloroquine’ by carbillon et al.
obsessive-compulsive disorder during the covid-19 pandemic                                       (Obsessive-compulsive disorder, Disease.to.Disease, COVID-19)
repeated monitoring of ferritin, interleukin-6, c-reactive protein, lactic acid dehydrogenase,
and erythrocyte sedimentation rate during covid-19 treatment may assist the prediction of        (COVID-19, Disease.to.Gene, interleukin-6)
disease severity and evaluation of treatment effects.
respiratory and pulmonary complications in head and neck cancer patients: evidence-based
                                                                                                 (head and neck cancer, Disease.to.Disease, COVID-19)
review for the covid-19 era.
targeting the immune system for pulmonary inflammation and cardiovascular complications
                                                                                                 (Cardiovascular Complications, Disease.to.Disease, COVID-19)
in covid-19 patients.
risk of peripheral arterial thrombosis in covid-19.                                              (Thrombosis, Disease.to.Disease, COVID-19)
preadmission diabetes-specific risk factors for mortality in hospitalized patients with
                                                                                                 (Diabetes, Disease.to.Disease, COVID-19)
diabetes and coronavirus disease 2019.

     Table 4: Utilized texts from PubMed-DU dataset and their respective extracted triplets based on attention analysis.
    Dataset                   Model                    Accuracy          Precision          Recall           F1 score
                CNN (Nguyen and Grishman 2015)*    0.8156 ± 0.0071   0.9550 ± 0.0029   0.9370 ± 0.0049   0.9459 ± 0.0021
                    RNN (Zhou et al. 2016)*        0.8517 ±0.0058    0.9614 ±0.0022     0.9472±0.0043    0.9543 ±0.0021
                     Transformer - unordered       0.8798 ±0.0053    0.9678 ±0.0032    0.9614 ±0.0042    0.9646 ±0.0018
                   Transformer - BFSoccupation      0.8987±0.0068     0.9693±0.0030    0.9705 ± 0.0035   0.9699 ± 0.0018
   WebNLG          Transformer - BFSmusic genre    0.8983 ± 0.0053   0.9694 ±0.0039    0.9703 ±0.0030    0.9699 ± 0.0015
                   Transformer - BFSrecord label   0.9000 ± 0.0046   0.9707 ± 0.0031   0.9691 ±0.0028    0.9699 ±0.0013
                    Transformer - WOC k=5          0.9210 ± 0.0017   0.9789 ±0.0016    0.9755 ± 0.0013   0.9772 ± 0.0004
                    Transformer - WOC k=20         0.9235 ± 0.0014   0.9786 ±0.0006    0.9774 ± 0.0004   0.9780 ±0.0003
                    Transformer - WOC k=45         0.9235 ± 0.0002   0.9795 ±0.0001    0.9767 ±0.0001    0.9781 ±0.0001
                CNN (Nguyen and Grishman 2015)*    0.7341 ± 0.0035   0.8873 ± 0.0032   0.7948 ± 0.0053   0.8385 ± 0.0025
                    RNN (Zhou et al. 2016)*        0.7520 ± 0.0027   0.8530 ±0.0041    0.8183 ±0.0047    0.8353 ±0.0029
                     Transformer - unordered       0.7426 ± 0.0061   0.8059 ±0.0079    0.7960 ± 0.0070   0.8009 ±0.0057
                    Transformer - BFSlocation      0.7461 ± 0.0053   0.8093 ±0.0067    0.7988 ±0.0067    0.8040 ±0.0043
     NYT             Transformer - BFSperson       0.7491 ±0.0048    0.8119 ± 0.0036   0.8022 ±0.0032    0.8049 ± 0.0073
                    Transformer - BFScompany       0.7461 ±0.0081    0.8056 ±0.0088    0.8043 ±0.0117    0.8049 ±0.0073
                    Transformer - WOC k=4          0.7547 ±0.0038    0.8162 ± 0.0047   0.8355 ± 0.0022   0.8257 ±0.0023
                    Transformer - WOC k=8          0.7669 ±0.0011    0.8325 ± 0.0025   0.8289 ±0.0018    0.8307 ± 0.0006
                    Transformer - WOC k=12         0.7698 ± 0.0007   0.8381 ±0.0017    0.8296 ±0.0009    0.8320 ±0.0004
                CNN (Nguyen and Grishman 2015)*    0.1096 ± 0.0073   0.7838 ± 0.0081   0.3094 ± 0.0140   0.4434 ± 0.0133
                    RNN (Zhou et al. 2016)*        0.2178± 0.0088    0.7716 ±0.0100    0.5173 ±0.0143    0.6192 ± 0.0093
                     Transformer - unordered       0.4869 ±0.0069    0.7365 ±0.0156    0.6822 ±0.0127    0.7081 ±0.0032
                      Transformer - BFSLOC         0.5235 ± 0.0049   0.7145 ± 0.0112   0.7053 ±0.0076    0.7098 ± 0.0040
   DocRED             Transformer - BFSPER         0.5234 ± 0.0077   0.7234 ± 0.0179   0.7029 ± 0.0091   0.7128 ±0.0060
                      Transformer - BFSORG         0.5252 ± 0.0048   0.7216 ± 0.0068   0.7053 ± 0.0069   0.7133 ± 0.0049
                     Transfomer - WOC k=4          0.5678 ±0.0037    0.7939 ±0.0137    0.7227 ±0.0103    0.7564 ± 0.0032
                    Transformer - WOC k=5          0.5697 ± 0.0016   0.8012 ±0.0112    0.7210 ±0.0053    0.7589 ±0.0025
                    Transformer - WOC k=6          0.5722 ± 0.0001   0.7970 ± 0.0035   0.7276 ± 0.0022   0.7607 ± 0.0001
                CNN (Nguyen and Grishman 2015)*    0.5573 ± 0.0030   0.7856 ± 0.0063   0.6417 ± 0.0100   0.7063 ± 0.0048
                    RNN (Zhou et al. 2016)*        0.5772 ± 0.0041   0.7401 ± 0.0051   0.7075 ± 0.0051   0.7234 ± 0.0032
                     Transformer - unordered       0.5693 ± 0.0075   0.6920 ± 0.0099   0.6837 ± 0.0111   0.6877 ± 0.0047
                     Transformer - BFSSpecies      0.5707 ± 0.0060   0.6878 ± 0.0084   0.6921 ± 0.0097   0.6898 ± 0.0040
 PubMed-DU          Transformer - BFSChemical      0.5703 ± 0.0048   0.6864 ± 0.0056   0.6942 ± 0.0057   0.6903 ± 0.0048
                      Transformer - BFSGene        0.5718 ± 0.0068   0.6872 ± 0.0058   0.6906 ± 0.0110   0.6889 ± 0.0065
                    Transformer - WOC k=3          0.5909 ± 0.0070   0.7261 ± 0.0162   0.6958 ± 0.0031   0.7106 ± 0.0094
                    Transformer - WOC k=4          0.5685 ± 0.0069   0.6864 ± 0.0109   0.7418 ± 0.0033   0.7130 ± 0.0074
                    Transformer - WOC k=5          0.5960 ± 0.0001   0.7346 ± 0.0028   0.6974 ± 0.0025   0.7155 ± 0.0004

Table 5: Comparison of CNN, RNN and Transformer-based methods on WebNLG, NYT, DocRED and PubMed-DU datasets for
the relation type extractiont task. *The architecture of the CNN and RNN models has been modified to exclude the component
which provides information about the position of the entities in the text snippet.
                                                            Edges       Nodes                       Eigenvector
  Dataset                       Model                                               Degree JSD
                                                           F1-score    F1-score                         JSD
               CNN (Nguyen and Grishman 2015)*              0.9747      0.9879         0.1836         0.2059
                   RNN (Zhou et al. 2016)*                  0.9639      0.9735         0.2708         0.2364
                    Transformer - unordered                 0.9598      0.9775         0.2380         0.2280
                  Transformer - BFSoccupation               0.9831      0.9805         0.1743         0.1840
  WebNLG          Transformer - BFSmusic genre              0.9755      0.9746         0.1131         0.1306
                  Transformer - BFSrecord label             0.9806      0.9772         0.1923         0.1593
                   Transformer - WOC k=5                    0.9808      0.9772         0.1765         0.1261
                   Transformer - WOC k=20                   0.9864      0.9840         0.1456          0.070
                   Transformer - WOC k=45                   0.9930      0.9916         0.1313         0.0891
               CNN (Nguyen and Grishman 2015)*              0.9059      0.9800         0.0564         0.0832
                   RNN (Zhou et al. 2016)*                  0.9205         1              0              0
                    Transformer - unordered                 0.8184      0.9800         0.0967         0.1396
                   Transformer - BFSlocation                0.8141      0.9800         0.0832         0.0832
    NYT             Transformer - BFSperson                 0.8806      0.9666            0              0
                   Transformer - BFScompany                 0.8305      0.9657            0              0
                   Transformer - WOC k=4                    0.8442         1              0              0
                   Transformer - WOC k=8                    0.8672         1              0              0
                   Transformer - WOC k=12                   0.8666         1              0              0
               CNN (Nguyen and Grishman 2015)*              0.4819      0.9019         0.5717         0.6965
                   RNN (Zhou et al. 2016)*                  0.6823      0.9714         0.5187         0.6954
                    Transformer - unordered                 0.7530         1           0.2950         0.5997
                     Transformer - BFSLOC                   0.7710         1           0.2744         0.5140
  DocRED             Transformer - BFSPER                   0.7830      0.9777         0.2892         0.4267
                     Transformer - BFSORG                   0.7243      0.9777         0.2892         0.4267
                   Transformer - WOC k=4                    0.7997         1           0.2714         0.4787
                   Transformer - WOC k=5                    0.8090         1           0.2673         0.4433
                   Transformer - WOC k=6                    0.8045         1           0.2349         0.3688
Table 6: Evaluation of metagraph’s reconstruction on WebNLG, NYT and DocRED datasets using CNN, RNN and Transformer-
based models. *The architecture of the CNN and RNN models has been modified to exclude the component which provides
information about the position of the entities in the text snippet.
                                                                  Edges       Nodes                   Eigenvector
         Domain                           Model                                         Degree JSD
                                                                 F1-score    F1-score                    JSD
                            CNN (Nguyen and Grishman 2015)*       0.9140      0.9888      0.4175        0.4791
                                RNN (Zhou et al. 2016)*           0.9736      0.9888      0.1002        0.1579
                                Transformer - unordered           0.9631         1        0.3253        0.3987
                                 Transformer - BFSSpecies         0.9583      0.9777      0.2880        0.4136
        Covid-19                Transformer - BFSChemical         0.9583      0.9777      0.2880        0.4136
                                  Transformer - BFSGene           0.9525      0.9777      0.3344        0.2919
                                Transformer - WOC k=3             0.9531      0.9777      0.4271        0.4096
                                Transformer - WOC k=4             0.9642      0.9777      0.4271        0.4096
                                Transformer - WOC k=5             0.9789      0.9888      0.2048        0.2299
                            CNN (Nguyen and Grishman 2015)*       0.9135         1        0.3575        0.3987
                                RNN (Zhou et al. 2016)*           0.9730         1        0.1962        0.2594
                                Transformer - unordered           0.9367      0.9777      0.4534        0.4879
                                 Transformer - BFSSpecies         0.9531      0.9777      0.4453        0.4879
      Breast cancer             Transformer - BFSChemical         0.9531      0.9777      0.4453        0.4879
                                  Transformer - BFSGene           0.9379      0.9666      0.5385        0.5343
                                Transformer - WOC k=3             0.9525      0.9777      0.4089        0.4068
                                Transformer - WOC k=4             0.9584      0.9777      0.3999        0.4414
                                Transformer - WOC k=5             0.9514      0.9888      0.3821        0.3987
                            CNN (Nguyen and Grishman 2015)*       0.8902      0.9777      0.4349        0.5865
                                RNN (Zhou et al. 2016)*           0.9498      0.9666      0.3661        0.3391
                                Transformer - unordered           0.9484      0.9777      0.4025        0.5153
                                 Transformer - BFSSpecies         0.9703      0.9777      0.2586        0.2558
  Coronary heart diseases       Transformer - BFSChemical         0.9703      0.9777      0.2586        0.2558
                                  Transformer - BFSGene           0.9756      0.9777      0.1659        0.1571
                                Transformer - WOC k=3             0.9644      0.9777      0.2705        0.3023
                                Transformer - WOC k=4             0.9703      0.9777      0.2705        0.3023
                                Transformer - WOC k=5             0.9644      0.9777      0.2705        0.3023
                            CNN (Nguyen and Grishman 2015)*       0.9419         1        0.2477        0.4634
                                RNN (Zhou et al. 2016)*           0.9924         1        0.067         0.0663
                                Transformer - unordered           0.9697         1        0.2028        0.2601
                                 Transformer - BFSSpecies         0.9765         1        0.1503        0.3568
      Mental health             Transformer - BFSChemical         0.9765         1        0.1503        0.3568
                                  Transformer - BFSGene           0.9849         1        0.1333        0.1852
                                Transformer - WOC k=3             0.9765         1        0.1503        0.3042
                                Transformer - WOC k=4             0.9765         1        0.1503        0.3568
                                Transformer - WOC k=5             0.9840         1        0.0840        0.2378

Table 7: Evaluation of metagraph’s reconstruction on the 4 predefined subdomains of PubMed-DU dataset using CNN, RNN
and Transformer-based models. *The architecture of the CNN and RNN models has been modified to exclude the component
which provides information about the position of the entities in the text snippet.