Long Tailed Entity Extraction of Model Names using
Distant Supervision
Swayatta Daw1 , Vikram Pudi1
1
 Data Sciences and Analytics Center
IIIT Hyderabad, India


                                         Abstract
                                         We introduce the task of long-tailed detection of model entities from scientific documents. We use
                                         distant supervision using an external Knowledge Base (KB) to generate synthetic training data and use a
                                         simple entity replacement technique to improve performance significantly by addressing the problem
                                         of overfitting in small sized datasets for supervised NER baselines. We introduce strong baselines for
                                         this task which are evaluated on our annotated gold standard dataset. We also release the distantly
                                         supervised silver labels generated using the KB. We introduce this model as part of a starting point for an
                                         end-to-end automated framework to extract relevant model names and link them with their respective
                                         cited papers from research documents. We believe this task will serve as an important starting point to
                                         map the research landscape in a scalable manner, needing minimal human intervention.

                                         Keywords
                                         Long-Tailed Entity, Entity Extraction, NER, Information Extraction, Scientific Literature,


1. Introduction
Long tailed entities are named entities which rarely occur in text documents. For these types of
entities, the task of Named Entity Recognition (NER) is non-trivial. Recent approaches have
aimed at solving the problem of NER using supervised training using deep learning models.
However, supervised learning techniques require a large amount of token-level labelled data
for NER tasks. Annotating a large number of tokens can be time-consuming, expensive and
laborious. For real-life applications, the lack of labelled data has become a bottleneck on adopting
deep learning models to NER tasks.
   Most scientific named entities can be classified as long-tailed entities because of the rarity
and domain-specificity of their occurrence. Recent work on NER in scientific documents has
been concentrated around detecting biomedical named entities [1] or scientific entities like
tasks, methods and datasets [2, 3, 4]. Some papers like [5] focus on the detection of a single
specific entity-type (like dataset names) from scientific documents. Although previous work
has focused on identifying methods [2, 3] as named entities, but what constitutes a method can
have a significant variance when it comes to human annotated data. The authors [2] report the
Kappa score of 76.9% for inter-annotator agreement in the SciERC dataset, which is widely used
as a benchmark for scientific entity extraction.
BIR 2022: 12th International Workshop on Bibliometric-enhanced Information Retrieval at ECIR 2022, April 10, 2022,
hybrid.
$ swayatta.daw@research.iiit.ac.in (S. Daw); vikram@iiit.ac.in (V. Pudi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                          28
   NER has traditionally been treated as a sequence labelling problem, using CRF [6] and HMM
[7]. Recent approaches have used deep learning-based models [8] to address this task, which
require a large amount of labelled data to train. The high cost of labelling remains the main
challenge to train such models on rare long tailed entity types, where availability of labelled data
is scarce. In order to address the label scarcity problem, several methods like Active Learning
[9], Distant Supervision [10, 11, 12], Reinforcement Learning-based Distant Supervision [13, 14]
have been proposed. [5] focused on detecting dataset mentions from scientific text and used
data augmentation to overcome the label scarcity problem. In this paper, we leverage an external
Knowledge Base and a large scale unlabelled corpora for our distantly supervised approach,
using a simple entity replacement technique to prevent overfitting. In this paper, we introduce
the task of detection of model entity names from scientific documents. Papers with Code (PwC1 )
is a community driven corpus that serves to automatically list models that solve particular
subtasks, with links to the scientific research paper that introduced the model. Our aim is to
build a similar but automated end-to-end pipeline that detects model names from scientific
papers and benchmarks them against other similar models that solve the same task. We believe
the task introduced in this paper (extraction of model names from scientific documents) to be a
significant step forward towards the whole pipeline. This task is non-trivial mainly due to the
lack of availability of token-level high-quality labelled data which is required for training deep
learning models and the shortage of human annotated gold standard dataset for evaluation.
   To address the above bottlenecks, we present a simple yet effective technique leveraging an
external Knowledge Base and a large unlabelled corpora (both of which are cheap and easy
to obtain) to generate our training dataset. We believe this simple technique can be easily
extended to any other domain given the availability of a domain-specific Knowledge Base and
unlabelled text corpora. Utilising this training set, we are able to establish a strong baseline for
this task using a standard BERT-CRF model. In order to evaluate our performance for this task,
we present a high quality human annotated gold standard evaluation dataset.
   Using our trained models, we create an automated framework of detecting model names of
related work from research papers. We define related work as prior research work done by the
scientific community for the same or a similar related task that has been investigated by the
original paper. Our pipeline contains two steps: Firstly, we build a sentence intent classifier that
classifies whether a citation sentence contains information regarding related work or not. Then
we extract model names from the positively labelled sentences using our trained NER model
and link them to their respective citation mentions using a string distance based technique,
introduced by [15]. We believe this framework is a starting point to effectively map the entire
research landscape in a scalable manner.


2. Annotation
In order to create our whole set of gold labels for the evaluation test-set, we randomly sample
abstracts from a large set of arxiv Research papers2 . We also introduce randomly sampled
papers from DBLP citation dataset to add to the diversity of train and test-set selection.
    1
        https://github.com/paperswithcode/paperswithcode-data
    2
        https://www.kaggle.com/Cornell-University/arxiv


                                                      29
                 Our labeling model was built upon SciBERT
             (Beltagy et al., 2019), a pre-trained language model
               based on BERT (Devlin et al., 2019) but trained
                      on a large corpus of scientific text.

                              There are two
    models in the transformers, which can handle multilingual posts –
              multilingual-BERT[19] and XLM-Roberta [57].

    For example, KG-BART encoded the graph structure of KGs with
                          knowledge embedding
     algorithms like TransE (Bordes et al., 2013), and then took the
                    informative entity embeddings as
                     auxiliary input (Liu et al., 2021).


Figure 1: A few example sentences with annotated model named entities highlighted in blue. We only
consider strict span matching while detection.


    Considering our end goal of automating a high precision framework of extracting related
model names and to minimise ambiguity, we consider only named models as model entities for
this task. Few examples are - BERT+BiLSTM+CRF, KG-BERT (with overlap), LSTM + Attention,
DeepWalk. We consider both single named entities and a combination of multiple model named
entities for annotation. A few example sentences with model entities are displayed in Figure 1.
    We consider strict span matches for the model entity names, and do not consider any partial
matches or synonyms. We also consider plural variants of entity names as matches.
    We aim to minimise ambiguity by considering only those model named entities that we can
verify about in Google Scholar and Semantic Scholar. We follow the process of identifying a
candidate model name and reviewing the existing Computer Science literature to verify whether
it is a model name entity or not by identifying its usage in the literature. A simple criteria that
we use is to observe if the model(or a variant of the model) has been mentioned in a Results
table and compared with baselines/other related models, in previous literature. Only after this
thorough review, we annotate a named entity as a model name. We discard any sentence if
a model named entity within the sentence does not follow the defining criteria. Hence, we
believe we reduce the ambiguity sufficiently enough to allow for a single annotator for the
entire annotation process. All the annotations has been done by a graduate NLP researcher
who is also a co-author of this paper. The overall statistics of the training and test set has been
provided in Table 2.


                                                30
Table 1
Statistics of the train-set and the annotated test-set
                  Sentences Tokens          Entities     Unique      Avg      #   Avg # En-
                                                         Entities    tokens per   tities per
                                                                     sentence     sentence
        Train    7800         232600        19012        14748       29.82        2.44
        Test     1000         22873         3647         1249        22.87        3.65
        Total    8800         255473        22659        15672       29.03        2.57


3. Training Set Creation with Entity Replacement
For the unlabelled corpus, we use the arxiv dataset containing 2̃27,000 abstracts from various
domains of Computer Science. We use the Papers with Code (PwC) corpus as a reference
Knowledge Base to obtain a total of 14,748 model entity names. We use this list of named
entities and create a set of distant silver labels by extracting the corresponding sentences out
of the arxiv dataset that contain the same entity mention. We aim for exact match while also
considering plural forms of the entity words. We obtain a total of 7800 sentences that contain a
model named entity mention.
   We plot a model entity vs frequency of occurrence in the entire corpus of our obtained
sentences. We provide the plot in Figure 2. We notice that the distribution is long-tailed in
nature, which is consistent with our hypothesis about scientific named entities as discussed in
the Introduction section. This means that there are certain pre-dominant popular models that
occur most frequently in the literature. The distribution tapers down and takes a long-tailed
form, where most of the entities have a much significant lower number of occurrence in the
literature. This can be attributed to the wide-spread use of certain models (like CNN), in the
existing Computer Science research literature.
   However, such a skewed distribution is unsuitable for training supervised custom NER models.
The models tend to memorise and overfit for certain named entities. Hence, we use a simple
entity replacement technique to deal with this bottleneck. More specifically, we detect the
entity span of the model named entity in the occurring sentence. Then, we replace this entity
with another entity from the entire set of model entities obtained from the Knowledge Base.
We execute the process keeping the number of entity distributions to atmost 2 to maintain
uniformity. After the entire process is completed, the entire train sentences set is set with
uniformly distributed entities.


Figure 2: Distribution of entity occurrence frequency in the training dataset pre-replacement


                                                    31
Table 2
Result on Evaluation Dataset
       Model                                Precision           Recall              F1
       BiLSTM + CRF (w/o replacement)       0.205               0.519               0.294
       BERT + CRF (w/o replacement)         0.389               0.310               0.345
       SciBERT+CRF (w/o replacement)        0.391               0.312               0.346
       BERT+CRF (with replacement)          0.575               0.563               0.569
       BiLSTM + CRF (with replacement)      0.628               0.631               0.629
       SciBERT+CRF (with replacement)       0.641               0.632               0.636


4. Distantly Supervised NER Model
We aim to classify each token into its candidate labels among the BIO-tags. We use pre-trained
BERT-based contextualised embeddings to capture the distributed representations from the
sequence of tokens. We aim to detect the entire entity span and classify the entity span into
specific entity types. We formulate this as a sequence labelling task, where we classify the
sequence of tokens into a sequence of tokens. We consider the entire training sentences as the
distantly labelled training data.
   We experiment with multiple baselines which are standard for the sequence labelling process.

    • BiLSTM + CRF: This BiLSTM-CRF model captures the contextual representations and
      encodes them into a bidirectional hidden state using BiLSTM. The CRF layer models
      the dependency among a sequence of tokens by considering the entire sequence label
      probability distribution.
    • SciBERT + CRF: This model contains pre-trained SciBERT [16] embeddings trained on
      large scientific corpus. The SciBERT-embeddings are passed onto a CRF layer that models
      each sequence probability distribution.
    • BERT+CRF: This consists of a pretrained BERT-model and a CRF layer to model sequence-
      level dependencies.

We evaluate the models on the gold test labels. We find that SciBERT model in combination
with CRF provides the best performance. We also find that the entity replacement technique
is particularly effective when dealing with long tailed entity distributions. We find that this
simple technique offers a significant boost in performance across all models. It is effective in
countering the overfitting bottleneck and successfully prevents memorisation of named entities.
We rely only on distant labels to obtain strong performance on gold labels. We illustrate our
best performing model in Figure 3.


5. Sentence Intent Classifier
We train a classifier to detect whether a sentence contains relevant information regarding
models that solve a similar task as specified in the target research paper. For a target scientific
document, we define a relevant model name as a model that the author has cited, which solves
a task that is similar or relevant to the original task that the target paper is solving. To create an


                                                 32
       0.4                O        O        I-Model       I-Model       O        O   B-Model
       0.9                 O       O        B-Model       I-Model       O        O        O


                                                        CRF


    B-Model          0.2        0.3         1.9         0.9         0.4        0.7        0.8

     I-Model         0.1        0.2         0.7         1.3         0.1        0.6        0.5

        O            1.2        1.5         0.1         0.4         1.9        1.6        1.3


                                                      Sci-BERT


                     We        present      SDP        LSTM         a        novel      network


Figure 3: Our SciBERT-CRF Model for sequence tagging


automatically labelled dataset, we iterate over all sentences in the research corpora. If a sentence
contains the words - ‘Related Work’ or ‘Previous Work’ or ‘Baseline’, then we take 15 sentences
occurring after it. We assign positive labels to sentences containing model entity mentions by
referring to our KB. We consider the maximal span for entity matching between our unlabelled
text and KB. For creating negative samples, we randomly sample from all sentences and make
sure the above words are absent and it also does not contain model entity mentions. We keep
an equal distribution of positive and negative labels. An example of a positive and negative
label is shown in Figure 4.

5.1. Training the classifier
The most commonly used approach of averaging BERT embeddings or using the output of the
first token (the [CLS] token) yields subpar sentence representations [17]. Hence, we choose
Sentence-BERT [17], a modification of the pretrained BERT network that uses siamese and
triplet network structures to derive semantically meaningful sentence embeddings that can


                                                33
     The authors have introduced a probabilistic
   framework based on Hidden Markov Random Fields
     (HMRFs) for semi-supervised clustering that
  combines the constraint-based and distance based
         approaches in a unified framework.

         All processing units perform the same
   computation, specified by equation (1), and are
     locally connected to their three neighbours.


Figure 4: Sentence Intent : Positive label labelled with green, negative labelled with red


be compared using dot product. It takes a sentence as input and returns the corresponding
sentence-level representation as output. We use Sentence-BERT to encode the sentences and use
Logistic Regression as our binary classifier to train it on 15,518 labelled sentences, containing
both citation and non-citation sentences. Positive and negative samples are equally distributed.
The sentence dataset size is kept small to avoid compromising on the quality of the labels. The
train-test split followed is 75-25. The testset accuracy (which, again, consists of both citation
and non-citation sentences) is 86.41%.


6. Entity Citation Linker
For the entity citation linking, we iterate between all possible extracted entities and citation
combination and get their closeness score, which is the string distance between an entity and
the citation occurrence. We first take all the citations and keep the closest entity per citation.
Then, we take all the entities and keep the closest citations per entity. This linking process is
able to accurately link most of the extracted entities with their closest citations, as demonstrated
by [15].


7. Pipeline Formation
We show two end-to-end pipelines in this paper. First, we show the entire training process for
both model entity extraction and sentence intent classification. We use the same unlabelled
corpora and Knowledge Base(KB) for both training processes. The automatic data labelling


                                                  34
                        Sentences

The optimized 4-layer BiLSTM model was then calibrated
and validated for multiple prediction horizons.

Furthermore, case studies show that SIMCLDA
 can effectively predict candidate lncRNAs
for renal cancer.                                                                                                   B-Model I-Model O O O O

Longformer's attention mechanism is a
drop-in replacement for the standard self-attention.                                   Entity Replacement
                                                                              The optimized 4-layer BiLSTM model          CRF Layer
                                                                                   was then calibrated and ...
 Bi-LSTM       MODEL
 SIMCLDA       MODEL
Longformer     MODEL                                                          The optimized 4-layer TransE model
                                                                                  was then calibrated and ...        SciBERT Embeddings


                 Unlabelled             Sentences
                  corpora
                                              The authors have
                                                    used...                                           Weak Labels
                                              All processing                                                          Distantly Labelled
                                                  units...                                                              Training Data

                                                                    Distantly labelled
 Knowledge                                                            Distantly
                                                                    Train       labelled
                                                                          Sentences
   Base                                                                 Distantly
                                                                      Train       labelled
                                                                            Sentences
                                                                        Train Sentences


                                                                       Sentence-BERT


                                                                     Classification Layer


                                                                              The authors
                                                                 Predicted                        Positive(+)
                                                                              have used ...
                                                                  Labels
                                                                              All processing     Negative(-)
                                                                                  units ...


Figure 5: Training Pipeline for the Sentence Intent Classifier and NER model using unlabelled
corpora and KB


process using the external KB followed by entity replacement is shown in Figure 5. This
transformed dataset is used as distantly supervised training labels for input to the SciBERT-CRF
NER model. Also, the sentence intent classifier approach is illustrated, where the KB is utilised
to obtain weak binary labelled sentences from ’Related Work’ section to train the classifier.
   For the automated framework, we use a two stage pipeline. It takes a scientific research paper
as input and obtains the citation sentences from it. Using the trained sentence intent classifier,
we segregate the sentences into positive and negative labels. Only positively labelled sentences
are passed into the next stage of the pipeline. We use the trained SciBERT-CRF NER model to
extract entity mentions from the sentences. The model mentions are then linked with their
respective citations using our Entity-Citation linker. The framework is illustrated in Figure 6.


                                                                              35
                      Citation                                                                        Entity-Citation
Target Scientific
                    Sentences                                                                             Linker
  Document
                                                            Trained Sentence Intent
                        The authors use CNN [1]                    Classifier
                        layer on top of BERT [2]                                                             The authors use CNN [1] layer on top of
                                                            (Sentence-BERT + Binary                                  BERT [2] embeddings.
                              embeddings                           Classifier)

                        The ImdB dataset [2] is
                         popular for sentiment                                                     B-Model
                             classification
                                                   Predicted Labels                                                            Predicted Entities ( with
                                                   (green denotes +,                  The authors use CNN [1] layer on              citation link)
                                                     red denotes -)                     top of BERT [2] embeddings
                                                                                                                                     CNN           [1]
                                                              The authors use
                                                              CNN [1] layer on                                                      BERT           [2]
                                                                                         SciBERT Embeddings +
                                                              top of BERT [2]                    CRF
                                                                embeddings
                                                                                           Trained NER Model
                                                                                         (Model Name Extractor)


                                                              The ImdB dataset
                                                              [2] is popular for
                                                                   sentiment
                                                                classification


Figure 6: Entire Automated Framework utilising the trained models. The input is a scientific
document and the output from the pipeline is a set of predicted model entities linked with
their citation.


8. Error Analysis
We conduct error analysis for model entity extraction, sentence intent classification and entity
citation linking. Some precision error is introduced into the model because for the training
set we consider the maximum span of each entity and the I-Model entity occurrence (a token
that lies inside a named entity) is high. We find in our evaluation dataset, the number of
B-Model entities is massively more, which leads to the model misclassifying an O as an I for
few sentences.
   Also, due to the usage of citation sentences in the evaluation dataset, our model recognises
the citation marker occurring right after the entity as an I-Model. Also, most of the citation
sentences in the evaluation dataset has a large number of named entities occurring adjacently,
as seen in many citation contexts. The model, which is trained on sentences from abstracts
only, is unable to recognise all of them as entities sometimes.
   For the sentence intent classification, our classifier often recognises sentences containing
dataset names as a positive label. This can be attributed to the fact that citation sentences that
refer to different datasets often have a similar structure to those citing model names of prior
work. Lastly, for the entity citation linker, sometimes an entity that is associated with a citation
marker occurs in the initial part of a sentence and its not the closest to the citation. This can
lead to missed out or incorrect linking.


9. Implementation details
We use PyTorch framework to implement our NER model. We use the pre-trained SciBERT
tokenizer and embeddings as input to a dropout layer with a dropout probability of 0.5 to prevent
overfitting. We use a learning rate of 1e-5 and train all models for 20 epochs. We pass the
output from the dropout layer through a linear layer with input dimension same as the hidden


                                                                        36
dimension of SciBERT embeddings (768) and output dimension same as the number of labels
(4). We train the BiLSTM-CRF model for 20 epochs. We annotate the evaluation dataset in the
standard CoNLL BIO format. For Sentence-BERT, we use pretrained models available in Pytorch.
We use the DBLP corpus consisting of 4̃3K papers as our unlabelled research corpora to obtain
the distant labels for training the classifier. For the Knowledge Base (KB), we use PwC public
data corpus. For the CRF layer, we use allennlp3 models library. We use regular expressions to
extract citation sentences from papers that are written in the Springer LNCS/LNAI format.


10. Conclusion and future work
We have introduced a novel task of long-tailed model entity recognition from scientific docu-
ments. We test our gold standard evaluation set on multiple baselines. We also find that a simple
strategy of entity replacement works well on small labelled datasets for distant supervision. We
hope to extend this technique to different types of entities with low labelled data availability.
We integrate our model in the automated pipeline framework to extract model names from
scientific research documents and link them to their respective citations. For future work, we
aim to utilise this pipeline on a large research corpus to obtain a map of benchmarked model
names linked with their respective papers on a much larger scale. We believe our work will
serve as an important starting point for mapping the entire research landscape of computer
science.


References
 [1] V. Kocaman, D. Talby, Biomedical named entity recognition at scale, CoRR abs/2011.06315
     (2020). URL: https://arxiv.org/abs/2011.06315. arXiv:2011.06315.
 [2] Y. Luan, L. He, M. Ostendorf, H. Hajishirzi, Multi-task identification of entities, rela-
     tions, and coreference for scientific knowledge graph construction, in: Proceedings
     of the 2018 Conference on Empirical Methods in Natural Language Processing, Asso-
     ciation for Computational Linguistics, Brussels, Belgium, 2018, pp. 3219–3232. URL:
     https://aclanthology.org/D18-1360. doi:10.18653/v1/D18-1360.
 [3] S. Jain, M. van Zuylen, H. Hajishirzi, I. Beltagy, Scirex: A challenge dataset for document-
     level information extraction, in: Proceedings of the 58th Annual Meeting of the Association
     for Computational Linguistics, 2020. arXiv:2005.00512.
 [4] S. Mesbah, C. Lofi, M. V. Torre, A. Bozzon, G.-J. Houben, Tse-ner: An iterative approach
     for long-tail entity extraction in scientific publications, in: International Semantic Web
     Conference, Springer, 2018, pp. 127–143.
 [5] Q. Liu, P. cheng Li, W. Lu, Q. Cheng, Long-tail dataset entity recognition based on data
     augmentation, in: EEKE@JCDL, 2020.
 [6] J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: Probabilistic models for
     segmenting and labeling sequence data, in: ICML, 2001.
 [7] H. L. Chieu, H. Ng, Named entity recognition with a maximum entropy approach, in:
     CoNLL, 2003.
   3
       https://github.com/allenai/allennlp-models


                                                    37
 [8] J. Li, A. Sun, J. Han, C. Li, A survey on deep learning for named entity recognition, ArXiv
     abs/1812.09449 (2018).
 [9] S. Goldberg, D. Z. Wang, C. Grant, A probabilistically integrated system for crowd-
     assisted text labeling and extraction, J. Data and Information Quality 8 (2017). URL:
     https://doi.org/10.1145/3012003. doi:10.1145/3012003.
[10] X. Wang, Y. Guan, Y. Zhang, Q. Li, J. Han, Pattern-enhanced named entity recognition
     with distant supervision, in: 2020 IEEE International Conference on Big Data (Big Data),
     2020, pp. 818–827. doi:10.1109/BigData50022.2020.9378052.
[11] C. Liang, Y. Yu, H. Jiang, S. Er, R. Wang, T. Zhao, C. Zhang, BOND: bert-assisted open-
     domain named entity recognition with distant supervision, CoRR abs/2006.15509 (2020).
     URL: https://arxiv.org/abs/2006.15509. arXiv:2006.15509.
[12] M. A. Hedderich, L. Lange, D. Klakow, ANEA: distant supervision for low-resource named
     entity recognition, CoRR abs/2102.13129 (2021). URL: https://arxiv.org/abs/2102.13129.
     arXiv:2102.13129.
[13] F. Nooralahzadeh, J. T. Lønning, L. Øvrelid, Reinforcement-based denoising of distantly
     supervised NER with partial annotation, in: Proceedings of the 2nd Workshop on Deep
     Learning Approaches for Low-Resource NLP (DeepLo 2019), Association for Computational
     Linguistics, Hong Kong, China, 2019, pp. 225–233. URL: https://aclanthology.org/D19-6125.
     doi:10.18653/v1/D19-6125.
[14] Y. Yang, W. Chen, Z. Li, Z. He, M. Zhang, Distantly supervised NER with partial annotation
     learning and reinforcement learning, in: Proceedings of the 27th International Conference
     on Computational Linguistics, Association for Computational Linguistics, Santa Fe, New
     Mexico, USA, 2018, pp. 2159–2169. URL: https://aclanthology.org/C18-1183.
[15] S. Ganguly, V. Pudi, Competing algorithm detection from research papers, in: Proceedings
     of the 3rd IKDD Conference on Data Science, 2016, CODS ’16, Association for Computing
     Machinery, New York, NY, USA, 2016. doi:10.1145/2888451.2888473.
[16] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, arXiv
     preprint arXiv:1903.10676 (2019). URL: https://www.aclweb.org/anthology/D19-1371/.
[17] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-
     networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
     Language Processing and the 9th International Joint Conference on Natural Language
     Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong,
     China, 2019, pp. 3982–3992. URL: https://www.aclweb.org/anthology/D19-1410/.


                                               38