<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modeling using Masked Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Riley Capshaw</string-name>
          <email>riley.capshaw@liu.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eva Blomqvist</string-name>
          <email>eva.blomqvist@liu.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Linköping University</institution>
          ,
          <addr-line>Linköping</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a methodology for leveraging aspects of ontology design principles to guide the use of a masked language model (MLM) as a query engine over raw text documents. By using targeted fill-in-theblank-style prompts to define relations, we show how a domain expert could use BERT, a well-known MLM, to extract triples from unseen documents without any fine-tuning. We evaluate our proposed methodology using a modified document-level relation extraction task, highlighting early successes but also numerous areas that need improvement. Despite these shortcomings, we then discuss why we are still hopeful that this paves the way toward flexible text-based query engines which use collections of unstructured documents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        At the same time, advances in language modeling have yielded results which can provide (an
approximation of) the same services as the KB envisioned above with no logical modeling, i.e. no
actual KG or ontology. Some work even considers large language models (LLMs) as knowledge
bases in their own right [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], with aspects like the structured data and requirements
being implicit in the learned patterns and distributions. For instance, consider the recently
released GPT3 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which can, among other things, answer arbitrary queries concerning factual
statements. While impressive, these generative systems often rely on clever prompt engineering
and may still assert false claims, so called “hallucinations,” requiring significant work on the
part of the user to ensure the quality of the generated text. Further, because these systems are
intended to be as general-purpose as possible, there is no obvious way to force them to only
generate text supported by a given document or corpus when using them out-of-the-box. This
means that potentially irrelevant or erroneous information present in the language model’s
training data could bias its output, and hampers their use as KBs.
      </p>
      <p>However, LLMs can still provide a natural language interface to knowledge, and this is what
we attempt to exploit in this work. Our goal is not to automatically generate a KG from raw
text, nor to use LLMs as KBs. Instead, we wish to treat text corpora as semi-structured KGs, i.e.
replacing the extraction step with a virtual querying using an LLM, and enable users to explore
the information within a text corpora through natural-language queries in a guided setting, e.g.
using a specific ontology, as if knowledge would have been already represented in a KB.</p>
      <p>Another aspect of this work is the opportunity to change the ontology as we go along. By
incrementally defining the relations as text queries with logical restrictions, users simultaneously
build both a query and the ontology backing it. Hence, we also address the problem of viewing
the data through the lens of diferent ontologies or ontology patterns.</p>
      <sec id="sec-1-1">
        <title>1.1. Motivating example</title>
        <p>Take for example the document in Table 1, which is a part of the DocRED dataset [5] presented
later on in this paper, where texts are annotated with entities and relations originating in
WikiData [6]. Depending on the underlying ontology, a user might wish to extract the statement
“Skai TV is located in the country of Greece.”. In a traditional KG, these facts are likely stored as
triples such that a simple SPARQL query could be used to retrieve them using a basic graph
pattern: ?x p17 ?y. In this case, the queries are essentially independent of the ontology; if p17
were undefined, there would simply be no matching triples returned. Further, the semantics
of the predicate p17 are disjoint from its use. To understand Skai_TV p17 Greece. requires
looking into the ontology directly. Had we named the predicate country instead, we risk
making wrong assumptions about the actual semantics of the predicate and when it applies.</p>
        <p>Now consider the case where that same query is written in a way which more closely matches
the statement from before: “?x is located in the country of ?y.” This query makes a few more
aspects of the relation clear, such as the direction of the relationship, the fact that the relationship
is a spatial one, and that ?y needs to be a country. These aspects carry over directly to the results
that the user gets. A user would also be able to see more clearly when incorrect statements are
retrieved. If the object of a statement ended up not being a country, the user could add in a
range constraint and try again. Finally, assume that the query is resolved by somehow matching
it to information in text documents. Then, if the results just simply seem incorrect overall, the
Document
Skai TV is a Greek free-to-air television network based in Piraeus. It
is part of the Skai Group, one of the largest media groups in the
country. It was relaunched in its present form on 1st of April 2006 in the
Athens metropolitan area, and gradually spread its coverage nationwide.</p>
        <p>Besides digital terrestrial transmission, it is available on the
subscriptionbased encrypted services of Nova and Cosmote TV. Skai TV is also a
member of Digea, a consortium of private television networks
introducing digital terrestrial transmission in Greece. At launch, Skai TV opted
for dubbing all foreign language content into Greek, instead of using
subtitles. This is very uncommon in Greece for anything except
documentaries (using voiceover dubbing) and children’s programmes (using
lip-synced dubbing), so after intense criticism the station switched to
using subtitles for almost all foreign shows.</p>
        <p>Relations
(‘Piraeus’, ‘P17’, ‘Greece’)
(‘Skai Group’, ‘P17’, ‘Greece’)
(‘Athens’, ‘P17’, ‘Greece’)
(‘Skai TV’, ‘P159’, ‘Piraeus’)
(‘Skai TV’, ‘P127’, ‘Skai Group’)
(‘Skai TV’, ‘P159’, ‘Athens’)
(‘Skai TV’, ‘P17’, ‘Greece’)
user could adjust the text of the query until the results were satisfactory. This encourages a
guided approach to querying for information within the document, in a sense building up the
logical structure alongside the query.</p>
        <p>The system which performs the querying could additionally keep track of the underlying
predicate it relates to, as well as any other information that enhances the query, such as
constraints like domain and range. This would also allow for multiple diferent queries for
the same relation, each representing diferent aspects, potentially with their own constraints.
Relation P131 (“located in the administrative territorial entity”) is a particularly dificult one,
with WikiData listing 47 alternate names1. For that relation, it feels obvious that no single text
query could describe all instances.</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Research Questions</title>
        <p>The work in this paper is a starting point towards answering the following research questions:
RQ1: What are the limitations and challenges of using a masked language model to retrieve
facts from natural-language text using fill-in-the-blanks queries and a perplexity-based
scoring metric?
RQ2: How robust is such a system to small changes in prompts and restrictions provided, and
where are important points of future improvement?
While we cannot definitively answer these questions yet, our results are analyzed in Section 4
with these in mind, and they guide our discussion of future work in Section 5. Delimitations
of the work include that we specifically analyze the case where the model is BERT [ 7]. Other
details of the experimental setup are provided in Sections 3.</p>
      </sec>
      <sec id="sec-1-3">
        <title>1.3. Contributions and Paper Outline</title>
        <p>The main contributions of this work are (1) a method and technical setup for using a masked
language model (MLM), i.e. BERT, in order to answer fill-in-the-blanks questions representing
1See https://www.wikidata.org/wiki/Property:P131 under “Also known as.” Accessed March 15, 2023.
typical KB triples, based on a specific text, and (2) a set of experimental results, allowing to
analyse and explore the limitations and challenges of using BERT for this task, and derive future
work directions.</p>
        <p>The remainder of this paper is structured as follows. In Section 2 we briefly present some
areas of related work, before presenting our method and technical setup in Section 3. Results of
the experiments are then presented in Section 4. Finally, we conclude with a discussion and
outlining future work in Section 5.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        There are approaches that try to use LMs as KBs in their own right, as exemplified by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
and [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, although using an LM as if it was a KB, these approaches assume that the
LM already contains the facts the user is asking for, while we make a diferent assumption, i.e.
that the facts reside in some text corpora, rather than in the LM itself. Nevertheless, the idea of
using natural language to access a KB and having the query interpreted by a LM is similar.
      </p>
      <p>For automatically extracting a KG from text, on the other hand, the most prominent methods
currently apply to Open Information Extraction (OpenIE) data sets. Although quite efective in
many cases, OpenIE attempts to extract triples directly from text without any predetermined
schema or ontology, which has several problems, e.g. as discussed in [8]. For instance this
means that the resulting KG will have a diverse set of properties linking the entities, and no
obvious way for the user to understand their overlap and relation, in order to formulate good
queries over the data. Hence, using these techniques is not really suitable when the end-user
needs a uniform way to access the resulting KG, e.g. through an ontology.</p>
      <p>More in detail, the restricted part of the problem we focus on in the experiments in this paper
is related to document-level relation extraction. In this area many approaches have emerged in
the last few years, e.g. [9]. However, it is not possible to directly compare our results to such
systems due to the diferent scoring system used in our method, pertaining to the diferent aim
of our approach.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>In this section we describe the overall method and experimental setup used to explore the
research questions. A diagram outlining this process is shown in Figure 1. First of all we
describe the dataset, and how we generated prompts and other input to the experiments. Next
we describe the LM used, the diferent experiments undertaken, and finally the evaluation
metrics used for assessing the results.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset &amp; Queries</title>
        <p>We chose to use the Document-level Relation Extraction Data set (DocRED) [5] for our
experiments. This data set difers from OpenIE in several ways. First, it is a slightly easier, more
restricted setting since all relations, entities, and mentions are provided a priori, so
preprocessing steps are not needed and noise from those steps is minimized. However, this setting also
matches our intended task very well, since we are assuming that we already have an ontology
based on which we would like to query our virtual KG. Second, the focus of the data set is on
facts whose support spans multiple sentences, thus requiring entire documents to be considered
in order for all facts to be extracted. The latter makes this an excellent data set for experimenting
with fact extraction due to the need for sub-tasks like coreference resolution to be performed
across sentence boundaries. This ensures that our results are relevant for realistic scenarios of
using a text corpus to back the virtual KG.
3.1.1. Prompts
We complemented the DocRED dataset with custom prompts for every relation, as a
representation of typical queries that could be submitted to the virtual KG. At least one prompt was
written for each relation, with the ten most frequent relations receiving four prompts, each
written by a diferent person (where 3 of those persons had no prior knowledge of the dataset,
to ensure a realistic set of queries), to examine sensitivity to prompt variation. The prompts are
in English, with the expected subject and object replaced with variable markers. For instance,
the most frequent relation P17 (“country”) had the following prompts written for it:
1. *?x is in the country of ?y.
2. ?x is located in ?y.
3. ?x is located in the country of ?y.</p>
        <p>4. ?x is in country ?y.</p>
        <p>The prompt marked with * is used in the primary experiments in Section 3.3, while the other
three are used in the sensitivity experiment described in Section 4.2.1. Table 2 gives details
about the ten most frequent relations in DocRED, including the main prompts used in the
experiments.</p>
        <sec id="sec-3-1-1">
          <title>3.1.2. Domain and Range Restrictions</title>
          <p>In order to get an idea of the efects of including logical restrictions in the prediction process,
we included domain and range restrictions on the relations in one of our experiments. Such
P17
P27
P131
P150
P161
P175
P527
P569
P570
P577</p>
          <p>Name</p>
          <p>Description</p>
          <p>Prompt
country sovereign state that this item ?x is in the country of ?y.
is in (not to be used for human
beings)
country of citizen- the object is a country that rec- ?x is a citizen of ?y.
ship ognizes the subject as its citizen
located in the ad- the item is located on the terri- ?x is located in ?y.
ministrative territo- tory of the following
adminisrial entity trative entity
contains administra- (list of) direct subdivisions of an ?x contains ?y within its borders.
tive territorial entity administrative territorial entity
cast member actor in the subject production The cast of ?x includes ?y.
performer actor, musician, band or other ?x was performed by ?y.</p>
          <p>performer associated with this
role or musical work
has part part of this subject One part of ?x is ?y.
date of birth date on which the subject was ?x was born on ?y.</p>
          <p>born
date of death date on which the subject died ?x died on ?y.
publication date date or point in time when a ?x was published on ?y.</p>
          <p>work was first published or
released
restrictions would typically be found in an ontology, used for querying the virtual KG. For
example, the “country” relation clearly can only have a country as the object, so its range was
set to LOC, i.e. only location entities during this specific experiment. These restrictions were
mined from the training portion of the DocRED data set. To account for labeling noise, we
excluded types if they did not appear in at least 5% of the relation instances.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Language Model</title>
        <p>For our baseline, we use the large, cased variant of the Bidirectional Encoder Representations
from Transformers (BERT) model [7]. BERT is a pre-trained masked language model (MLM)
which, when originally published, obtained state-of-the-art scores on eleven benchmark tasks
through fairly minimal task-specific fine tuning. As a MLM, BERT was trained on an infilling
task, where tokens in a sequence are masked out and need to be recovered. It was also trained
on a next-sentence prediction task, but we do not focus on that here. To accomplish this, BERT
learns contextualized vector representations for every token in the corpus that encapsulate
enough information such that nearby masked words can be recovered reliably.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Token-Level Conditional Probabilities</title>
          <p>Despite MLMs not being probabilistic models in a formal sense, they can produce a score
analogous to conditional probabilities for every token in a sequence [10]. Given a token   at
position  in a sequence W, the score for   is the value</p>
          <p>S (W) ∶=  BERT (  ∣ W⧵ ) .</p>
          <p>For ease of discussion, we adopt the language of related work and use conditional probability
notation for these scores. The value of S is calculated by masking out token  to yield a new
sequence W⧵ , then using BERT to score replacing the masked-out token with the original token
  . Consider the string “Hello world!” as an example. The score for “world” is calculated (after
tokenization) as</p>
          <p>S1 (['hello', 'world', '!']) =  BERT ('world' ∣ ['hello', [MASK], '!']) = 0.344.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Pseudo-Log-Likelihood Sequence Ranking</title>
          <p>Since BERT is a masked language model, it does not generate likelihood values for a given text,
making it dificult to use for generative tasks. To circumvent this limitation, Wang and Cho [ 11]
use BERT to calculate pseudo-log-likelihood scores (PLL):</p>
          <p>|W| |W|
PLL (W) ∶= ∑ log S (W) = ∑ log  BERT (  ∣ W⧵ )</p>
          <p>=0 =0
With the same example as before, we can calculate the PLL of “Hello world!”:
(1)
(2)</p>
          <p>PLL (['hello', 'world', '!']) = log (0.430) + log (0.344) + log (0.818) = −0.704.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experimental Setup</title>
        <p>The basic experiments are performed as follows. The DocRED dataset already provides certain
information which would otherwise need to be extracted separately, so we avoid any further
preprocessing of the text. In this sense, we are working under the assumption that similar
data could be generated using a pipeline that includes typed named-entity recognition,
entitymention disambiguation, and identification of all possible relations. Then, one fill-in-the-blanks
prompt was written for every relation present in the dataset (see Table 2). Next, all possible
pairs of entity mentions are collected from each document, which are in turn used to populate
the prompts, generating all possible statements as described in Section 3.1.1. The statements
are then ranked by their PLL values using BERT as in Equation 2 (larger is better).</p>
        <p>For each relation, the training set is used to derive a threshold value which determines
whether a statement is kept (considered true/valid) or discarded. Standard metrics can then be
used for a quantitative analysis. However, that is not enough to fully evaluate whether such a
system is viable for use in the future. To assess that, we gauge the robustness of the approach
and to what extent human input is necessary by performing the following experiments:
1. Logical constraints: What happens when we impose domain and range restrictions on
the possible responses?
2. Prompt stability: How do diferent user-generated prompts afect the scores?</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Evaluation Metrics</title>
        <p>We evaluate our system using a variety of simple metrics with the aim of exploring how well it
works in both automatic and expert-guided settings. For the automatic setting, we calculate a
per-relation prediction threshold, then use that to calculate precision, recall, and F1 scores for
every relation. For the expert-guided setting, we use top- metrics, varying  between 1 and 5.
The top-1 metric measures how often the first returned result is correct, such as when a user
wants a single, quick answer. The top-5 metrics instead measures how often an expert will be
able to locate at least one correct answer within the first 5 responses. As such, each top-  score
answers the question What is the perceived quality of the system if an expert is willing to accept
only  − 1 incorrect responses?</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>In this section, we present and analyze the scores of the various experiments, in relation to the
research questions and our overall aim of the work.</p>
      <sec id="sec-4-1">
        <title>4.1. Automatic Relation Extraction.</title>
        <p>The results for the ten most common relations are presented in Table 3. As can be seen in the
tables, scores vary a lot, but are in general at a level not usable in practical applications. The
same remains true even when the results are enhanced with the domain and range constraints.
This means that without further modifications, it is not reasonable to use our system setup as
an automatic KG extraction tool. Although these are negative results, it can be seen as a result
supporting our hypothesis that not actually extracting a KG from the text, but merely allowing
the user to query a virtual KG backed by the text is a much more reasonable route to pursue. In
the next section we make a deeper analysis of one of the problems currently afecting these
results.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Rare Token Bias</title>
          <p>In many cases, statements which included long strings of non-English words were scored higher
than expected. This is likely due to how the BERT tokenizer works. Longer, uncommon words
are broken down into sub-word tokens. Instances of these tokens may be uncommon in the
overall data set, but extremely common when appearing together as a sequence. This means
that the scoring metric often disproportionately ranks a token due to its adjacent neighbors,
rather than due the sentence as a whole. For example, take the clearly incorrect statement
“École nationale supérieure des Beaux - Arts was born in Paris.” This statement has per-token
scores of:
All scores related to “École nationale supérieure des Beaux - Arts” are close to 1.0, which is
the highest possible. The only words that have low scores are “was,” which BERT suggests
replacing with “.”, and “born,” which BERT suggests replacing with “founded”. The fact that
P17
P27
P131
P150
P161
P175
P527
P569
P570
P577
these two key words have such a low score would imply to a human that the statement as a
whole is wrong, yet it still received a very high score of -0.54, due to the remaining scores that
were much higher. As a point of comparison, the highest threshold in Table 3 is only -1.08. This
means that, in order for the system to function in an automatic setting, a more robust scoring
method is needed which can handle these sorts of situations.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Expert-Guided Relation Extraction</title>
        <p>Table 4 again presents results for the ten most frequent relations, this time for the top- metrics
as described in Section 3.4. These results are more positive, showing that in most cases, a user
will be able to find at least one example of what they are looking for in the top 5 returned
statements. We can see that the simple act of restricting the results by domain and range results
in a nearly ten-fold increase in scores for a few relations. This supports the idea that a user
will need to guide the extraction, either through imposing restrictions based on an ontology,
adjusting the text of the query, or adding in additional queries with their own restrictions.</p>
        <p>=1  =2  =3  =4  =5
P17
P27
P131
P150
P161
P175
P527
P569
P570
P577</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Prompt Sensitivity</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussions and Future Work</title>
      <p>In this section, we discuss the implications of our results, with regards to both the proposed
method and to outline directions of future research. While the negative results on the automatic
KG extraction task clearly shows that this method is, at least for now, not suitable for such a
completely automated task, it also supports our hypothesis that a virtual KG query system might</p>
      <p>Original
 =1  =5</p>
      <p>Participant 1
 =1  =5</p>
      <p>Participant 2
 =1  =5</p>
      <p>Participant 3
 =1  =5
P17
P27
P131
P150
P161
P175
P527
P569
P570
P577
be a fruitful direction. This is further supported by the promising results of the top-k results, in
the expert-guided setting, and when studying the contributions that logical constraints (such
as commonly present in an ontology) can provide, as well as the robustness to changes in the
prompts written by diferent persons. Below, we discuss each of these aspects more in detail.</p>
      <sec id="sec-5-1">
        <title>5.1. Logical Constraints</title>
        <p>The experiments in Section 3.3 only dealt with domain and range restrictions. While these
showed significant improvement in results when used, there are many more logical constraints
available which could be included. As an example, we know that P1365 (replaces) and P1366
(replaced by) each imply the other, and P3373 (sibling) is symmetric. To further bolster the
system as a guided way to build an ontology or knowledge graph, one could select any sort
of rules allowed by a particular family of description logics (as commonly used for ontologies,
such as in OWL) and use those to improve the filtering process.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Domain Adaptation</title>
        <p>One clear shortcoming of this system is its inability to adapt to a given domain. The experiments
as they were set up may be too heavily biased by accidental background knowledge. That is, if
BERT was trained on any data from Wikipedia or WikiData pages, then there is a high likelihood
that statements which are similar to the correct prompts have already been seen by the system.
This, however, would require the system to be fine-tuned in some way on a given document.
Salazar et al. [12] show how to use pseudo-perplexity (the MLM analogy to traditional LM
perplexity using PLL) to measure domain adaptation of a MLM during fine-tuning. Such an
approach could be used to fine-tune an MLM to an unseen document to improve ranking, which
remains a self-supervised approach.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Other Language Models</title>
        <p>BERT is not the only MLM which could have been considered in this study. Repeating similar
experiments with other MLMs, such as RoBERTa [13], is still part of future work. Diferent
classes of LMs should also be studied in this setting. Such exploration would seek to answer
the question How large is the impact of pretraining or model size on the quality of the results in a
zero-shot setting?</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was funded by the Swedish National Graduate School in Computer Science (CUGS).
Portions of this work were carried out using the AIOps/Stellar facilities funded by the Excellence
Center at Linköping–Lund in Information Technology (ELLIIT).
[5] Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu, L. Huang, J. Zhou, M. Sun, Docred: A
large-scale document-level relation extraction dataset, in: Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, 2019, pp. 764–777.
[6] D. Vrandečić, Wikidata: A new platform for collaborative data collection, in: Proceedings of
the 21st International Conference on World Wide Web, WWW ’12 Companion, Association
for Computing Machinery, New York, NY, USA, 2012, p. 1063–1064. URL: https://doi.org/
10.1145/2187980.2188242. doi:10.1145/2187980.2188242.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
N19-1423. doi:10.18653/v1/N19- 1423.
[8] J. L. Martinez-Rodriguez, I. Lopez-Arevalo, A. B. Rios-Alvarado, Openie-based
approach for knowledge graph construction from text, Expert Systems with
Applications 113 (2018) 339–355. URL: https://www.sciencedirect.com/science/article/pii/
S0957417418304329. doi:https://doi.org/10.1016/j.eswa.2018.07.017.
[9] G. Nan, Z. Guo, I. Sekulic, W. Lu, Reasoning with latent structure refinement for
documentlevel relation extraction, in: Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, Association for Computational Linguistics, Online, 2020,
pp. 1546–1557. URL: https://aclanthology.org/2020.acl-main.141. doi:10.18653/v1/2020.
acl- main.141.
[10] J. Shin, Y. Lee, K. Jung, Efective sentence scoring method using bert for speech
recognition, in: W. S. Lee, T. Suzuki (Eds.), Proceedings of The Eleventh Asian Conference on
Machine Learning, volume 101 of Proceedings of Machine Learning Research, PMLR, 2019,
pp. 1081–1093. URL: https://proceedings.mlr.press/v101/shin19a.html.
[11] A. Wang, K. Cho, Bert has a mouth, and it must speak: Bert as a markov random field
language model, in: Proceedings of the Workshop on Methods for Optimizing and
Evaluating Neural Language Generation, 2019, pp. 30–36.
[12] J. Salazar, D. Liang, T. Q. Nguyen, K. Kirchhof, Masked language model scoring, in:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
2020, pp. 2699–2712.
[13] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
arXiv:1907.11692 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>AlKhamissi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Celikyilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <article-title>A review on language models as knowledge bases, 2022</article-title>
          . URL: https://arxiv.org/abs/2204.06031. doi:
          <volume>10</volume>
          .48550/ ARXIV.2204.06031.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases?</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2463</fpage>
          -
          <lpage>2473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Heinzerling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Inui</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries</article-title>
          ,
          <source>in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>1772</fpage>
          -
          <lpage>1791</lpage>
          . URL: https:// aclanthology.org/
          <year>2021</year>
          .eacl-main.
          <volume>153</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .eacl- main.153.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>