<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SEBD</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Semantic Containment in MLMs: A Prompt-Based Approach⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Discussion Paper</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vito Walter Anelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro De Bellis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Di Noia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eugenio Di Sciascio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Bari</institution>
          ,
          <addr-line>Via Orabona 4, Bari, 70125</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>33</volume>
      <fpage>16</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This research explores whether Masked Language Models (MLMs) can understand semantic containment relations, such as sub-class and instance-of relationships, which are crucial for Semantic Web applications. The study introduces PRONTO, a novel approach that leverages MLM predictions to discover semantic containment relations in unstructured text by translating the model's internal predictions into classification labels. The efectiveness, reliability, and interpretability of PRONTO are assessed through a comprehensive probing procedure. The findings demonstrate that MLMs can capture semantic containment relationships, which has significant implications for ontology construction and aligning text data with ontologies. For the sake of reproducibility, we make our code, datasets, and evaluation tools available at https://github.com/sisinflab/PRONTO.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Masked Language Models</kwd>
        <kwd>Prompt Learning</kwd>
        <kwd>Ontologies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Pre-trained Language Models (PLMs) have become essential in Natural Language Processing
(NLP) due to their ability to capture complex language patterns through extensive training
on large text datasets. Studies show PLMs efectively capture factual [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and ontological [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
knowledge from this pre-training [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For example, when given a prompt like "Paris is a
[MASK]," a PLM is more likely to predict "capital." This suggests PLMs possess knowledge
modeling capabilities beyond simple word co-occurrence [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, this inherent knowledge
is rarely used in applications; instead, other types of structured knowledge are employed [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ],
as these models are often fine-tuned to achieve competitive levels of performance in downstream
tasks. This research aims to understand if bidirectional PLMs inherently recognize ontological
containment, which includes subclass and instance of relationships. Ontological containment
dbo:SoccerTeam
rdf:type
rdf:type
dbo:BasketballTeam
rdfs:subClassOf
dbo:Agent
rdfs:subClassOf
dbo:Organisation
https://dbpedia.org/page/Brooklyn_Nets dbo:SportsTeam rdfs:subClassOf
rdfs:label rdfs:label
      </p>
      <p>Brooklyn sports</p>
      <p>Nets team
Brooklyn Nets is [MASK] of sports team</p>
      <p>LM</p>
      <p>Encoder
Vanilla PLM (Frozen)</p>
      <p>P
r
e
d
iit
c
o
n
H
e
a
d</p>
      <p>Containment
verbalizer</p>
      <p>V
_
d
i
m
reflects a hierarchical "is a" relationship between entities. The study explores whether PLMs can
identify semantic containment when two entities are present in a prompt (e.g., "Paris [MASK]
city"), to determine if PLMs are zero-shot semantic containment learners. We propose PRONTO,
a novel procedure aimed at the extraction of semantic containment relations from bidirectional
PLMs based on the examination of their masked language modeling prediction head. Our key
contributions can be summarized as follows:
• We propose a general procedure to probe semantic containment knowledge from MLMs by
means of automatically learned verbalizers, i.e. mappings between a MLM prediction head
and a label.
• Through extensive analysis, we reveal how vanilla (i.e., not fine-tuned) MLMs exhibit an
inner awareness with respect to semantic containment.</p>
      <p>To the best of our knowledge, this is the first attempt to use the knowledge stored in PLMs
to detect ontological containment through relation prediction with automatically extracted
verbalizers. Finally, we present practical applications in zero-shot entity typing.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>In this section, we formally introduce our containment prediction task that we schematize
in Figure 1. Let  = {1, 2, . . . , } represent the set of classes in a reference ontology .
Each class  is a node within the ontology graph. Let  be the set of edges representing
the subclass relations among these classes, where each edge (,  ) ∈  denotes that class
 is a subclass of class  . Let  = {1, 2, . . . , } denote the set of instances of classes in ,
and  be the set of edges denoting the instance of relation, where each edge (, ) ∈ 
indicates that instance  is of type , linking instances to their respective classes. We define the
semantic containment graph  as the union of the two sets of edges  and  , combined
with their respective node sets  and . Formally,  = ⟨ ∪ ,  ∪  ⟩. For any two nodes
,  ∈ , we aim to determine whether there exists a path from  to  that signifies an
“is-a” relationship within the ontology . This relationship is characterized by a sequence
of edges each representing either a direct subclass relation between classes or an instance
belonging to a class, thereby forming a chain of semantic containment. Formally, we aim to
learn a model  : (,  ) ↦→ ˆ with ˆ being 1 if there exists a path from  to  in  and 0
otherwise. The function  is parameterized by the parameters  derived from a vanilla PLM
(e.g., BERT). The aim of  is to learn the mapping  : (,  ) ↦→ ˆ, where ˆ represents the
predicted probability that a containment relationship exists between the concepts  and  . Let
us define a function  that constructs a prompt for a pre-trained MLM, given two nodes 
and  . The function obtains the verbalized forms of  and  through  (·) and inserts a mask
token [MASK] between them to form the prompt. Formally, the prompt construction can be
represented as
 (,  ) =  () ⊕ "[MASK] " ⊕  (
 ),
(1)
where V() and V( ) are two natural language representations for the nodes  and  ,
respectively. The symbol ⊕ stands for string concatenation, and V( ) is the rdfs:label
associated to  .</p>
      <p>
        Automatic Extraction of a Containment Verbalizer. Given the prompt  (,  ) as input
to a bidirectional PLM capable of mask-filling, the output of its MLM prediction head consists
of the predicted probability distribution over possible tokens that could replace the [MASK]
(Fig. 1). We propose investigating whether these predicted probabilities can help determine the
existence of a containment relationship between  and  . Given a PLM capable of mask-filling
trained on a vocabulary of size  and a prompt function  (·, ·) , we aim to create a mapping
between the prediction head output and a discrete label . Prior work formulate the concept of
verbalizer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as a discrete mapping between a subset of tokens  = {1, ..., } and a label
. Formally:
(|) =
      </p>
      <p>1 ∑︁ ([MASK] =  |),
 =1
with  being the number of tokens in  and  being the prompt. The construction of 
is often done manually: for instance, if y="city", a reasonable although simplistic verbalizer
construction could be  = {, }. In this work, we formulate the construction of a
verbalizer  as a search problem over the whole vocabulary. This enables our verbalizer to
fully exploit the expressiveness of such a large vocabulary and possibly capture associations
between labels and tokens that could be not easily identifiable even for domain experts. We
want to design a verbalizer as a direct mapping function between the PLM prediction head and
a label. An implementation of such a verbalizer is the following:</p>
      <p>(|) = ∑︁   ([MASK] =  |) = ∑︁ (  )([MASK] =  |),</p>
      <p>
        =1 =1
where   ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is a weighting factor that modulates the contribution of each token 
in the vocabulary to the probability of predicting  given . The   weights can be learned
through an optimization process aiming to minimize a specified loss function. In fact, we learn
(2)
(3)
the   parameters jointly in our optimization procedure, constraining them in a range [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] by
means of a sigmoid. Ideally, we want the verbalizer to satisfy two useful properties:
P1 Noise Resilience: Since we are dealing with large vocabularies, the significant tokens’
marginal probabilities in a PLM prediction head tend to be diluted by the presence of
many less relevant tokens. This dilution is linked to the softmax function’s property of
distributing probabilities across all logits, diminishing the impact of pivotal tokens as the
vocabulary size expands.
      </p>
      <p>
        P2 Sparsity: We aim to enforce a sparsity constraint on the   weights to promote
interpretability. This constraint facilitates the identification of the most influential tokens minimizing
the influence of less relevant ones. In fact, a PLM vocabulary is highly populated even for
smaller models (30000+ tokens). Therefore, sparsity can aid interpretability for humans,
which can only realistically focus on a smaller set of informative tokens simultaneously.
To satisfy P1, MLM prediction head logits pass through a weighted-softmax [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]:
softmax(, ) =
︂(
      </p>
      <p>
        1 exp(1)
∑︀=1  exp()
, . . . , ∑︀=1  exp()
 exp() ︂)
where  are parameters learned jointly in the optimization process and constrained in the [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]
range. To satisfy P2, we impose an L1 regularization term over the learned weights   in our loss
function. L1 regularization is known to promote sparsity over other alternative regularization
strategies, as well as improving generalization. To investigate the potential benefits of
nonlinearity within our verbalization strategies, we draw inspiration from MAV (Mapping-Free
Automatic Verbalizer) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], in which the authors formulate a mapping-free verbalizer as a
nonlinear projection of a MLM prediction head in a latent vocabulary space. In our own adaptation,
we substitute the inner Tanh activation function with LayerNorm for numerical stability. This is
motivated by the observation that MLM logits can vary in unnormalized ranges, and the Tanh
function suppresses information associated with high activations:
      </p>
      <p>(|) = ( 2 ·  ℎ( 1 ·  (logits MLM))).</p>
      <p>In summary, we experiment with diferent verbalization strategies:
• PRONTO-VF: a verbalizer-free baseline approach, where the hidden state of the [MASK]
token is fed into two fully connected layers with a final sigmoid activation, as in Equation (5);
• PRONTO-LIN: a naive linear direct-mapping approach, based on Equation (3);
• PRONTO-WS: a direct-mapping approach where logits are re-weighted before the Softmax
as in Equation (4), and the final label probability is obtained as in Equation (3);
• PRONTO-MAV: a mapping-free approach where logits are fed into two fully connected
layers as in Equation (5).</p>
      <p>The direct-mapping verbalizers (PRONTO-WS, PRONTO-LIN) are inherently interpretable,
since each   can measure the contribution of the -th token for the final label prediction. On
the other side, PRONTO-MAV and PRONTO-VF can give an indication on more subtle patterns
in the prediction heads that can only be acquired by means of non-linearities.
Data Preparation. Given a semantic containment graph  = ⟨ ∪ ,  ∪  ⟩, we denote
Π + as the set of all the pairs of nodes (, ) that can be found along a path of . In other
(4)
(5)
words, we compute the transitive closure of each node in G.Since  does not contain negative
information, this leaves an important decision: how to extract useful negative pairs. This decision
is crucial since it impacts both the eficacy and generalizability of our learned verbalizers and
the reliability of our evaluation. Intuitively, we want our model to be capable of distinguishing
between semantically similar classes, although disjoint ones (e.g., "city"/"region"). However, we
want it to be also able to distinguish among completely unrelated classes (e.g. "city"/"person").
Furthermore, we want it to correctly model a semantic containment relationship that is
noncommutative, instead of just discriminating based on word similarity. We devise three strategies
to build the set Π − of negative samples:
• Reverse negatives: given a positive pair (, ) we obtain a negative pair by inverting subject
and object (, );
• Soft negatives: given a positive pair (, ), we replace  with a random class sampled based
on the class distribution in the data;
• Hard negatives: given a positive pair (, ) ∈ Π +, we build the two sets  +(, ) =
{ | (,  ) ∈ Π +} and ˆ+(, ) = {ˆ | (ˆ ,  ) ∈ Π + and ˆ ̸∈  +(, ) and  ∈
 +(, )}. While  +(, ) represents the set of nodes along a path starting from  in the
original graph , namely all the nodes in a semantic containment relation with , the set
ˆ+(, ) contains the nodes on the paths arriving in  +(, ). These nodes are not in a
semantic containment relation with  but are semantically "close" to it. Given a node , the
hard negatives are then built as (, ˆ ) with ˆ ∈ ˆ+(, ).</p>
      <p>
        Prompt Construction. Prior work has demonstrated the sensitivity of PLM outputs to prompt
selection [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In order to provide a more extensive analysis, we choose to experiment over
diferent prompt templates. We report our prompt choices in Table 1. We design various
hard templates to capture various linguistic manifestations of the containment relationship.
Regardless of the prompt, both subject and object follow the same verbalization strategy, i.e., the
rdfs:label literal value. In addition to manually designed prompts, we explore the integration
of soft tokens [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], i.e. word vectors jointly fine-tuned during the optimization process.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>
        This section outlines the experimental setup to probe the ability of PLMs to understand
ontological containment relationships. We specifically focus on evaluating the inherent capacity
of vanilla pre-trained MLMs prediction heads to recognize the hierarchical relation between
instances and classes. The experiments are structured around three core research questions:
RQ1: Do Masked Language Models (MLMs) capture semantic containment?
RQ2: How does contextual information influence MLM in semantic containment prediction
tasks?
RQ3: Can MLMs generalize their semantic containment reasoning abilities to new data and
tasks?
Dataset. We base our study on the dataset introduced by Wu et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a reputable dataset
from recent literature on probing. This dataset is based on a restriction of DBPedia, containing
783 classes and up to 20 instances per class, with 8753 unique instances. The restriction is
necessary because using the entire DBPedia is impractical due to resource limitations. Moreover,
multi-hop link extraction scales exponentially as (entities × branching_factor) hops. To extract
positive and negative pairs, we follow the procedure described in section 2. We construct the
set of negative pairs Π − as follows: for each pair in Π +, we sample two hard, one soft and one
reverse negative. From the union of negative samples and positive samples Π = Π + ∪ Π − ,
we extract training and evaluation splits with holdout. We find that the obtained evaluation
split contains a significative amount of soft and reverse negatives, that could potentially inflate
performances. For this reason, we extract a more challenging evaluation dataset, that we refer
to as Eval (hard), removing all the soft and reverse negatives from the original evaluation split.
We use the Eval (hard) dataset as evaluation dataset in all our experiments.
      </p>
      <p>
        Probed PLMs. It is worth noticing that the proposed probing procedure is versatile and can be
readily applied to any bidirectional PLM with mask-filling capabilities. For this investigation,
we focus on two prominent encoder-only PLMs,1 BERT [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and RoBERTa [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], that leverage a
masked language modeling objective during their pre-training stage.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Semantic Containment Understanding in PLMs (RQ1)</title>
        <p>To evaluate the efectiveness of the probed PLMs in identifying semantic containment
relationships, we analyze the performance of various combinations of verbalization strategies,
templates, and PLMs (the interested reader may take a look to Section 2 for further details).
We report the results in Table 2, presenting accuracy, precision, recall, and F1-score for each
combination. A decision threshold of 0.5 was used for all models.</p>
        <p>PLM Comparison. The analysis reveals several interesting trends. The first finding is that
the verbalization strategy matters. The Mapping-Free Automatic Verbalizer (MAV) consistently
outperforms those based on direct mapping (LIN and WS). This suggests token probabilities
likely contain complex relationships that direct mapping approaches might miss. The MAV
strategy seems to capture these more efectively. The RoBERTa-Large model generally achieves
better and more consistent results, particularly with direct-mapping verbalizations (LIN and WS).
For the MAV verbalizer, RoBERTa-Base outperforms the larger model with specific template
choices (h_4, h_2, s_1, and s_2). This suggests that prompt design plays a crucial role in
performance, even for larger models. There is no clear correlation between model size and the
1For all the adopted PLMs, we employ the pre-trained checkpoints available at https://huggingface.co/.</p>
        <p>
          PLMs’ discriminative ability to distinguish containment relationships. MAV verbalizers show
similar performance across model sizes while direct-mapping variants tend to improve with
larger models. We hypothesize that smaller PLMs may exhibit more nuanced activation patterns
for containment, requiring a non-linear verbalizer like MAV to capture them. This result is in
line with previous works that reached conflicting conclusions on this matter: indeed, Petroni et
al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] showed overall better results for larger PLMs in ontological memorization capabilities,
while a more recent study [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] proved that model size does not have reasonable impact on stored
ontological knowledge. The analysis suggests that vocabulary size might not be the primary
factor influ encing performance in this task. Interestingly, BERT-Base, with a smaller vocabulary
compared to RoBERTa-Base (approximately 20, 000 fewer tokens), outperforms RoBERTa-Base
for PRONTO-WS and PRONTO-MAV verbalizations across most prompts. This indicates that
other factors, potentially the specific tokenization strategies or the training data used for each
model, may play a more significant role in capturing semantic relationships.
        </p>
        <p>Figure 2 shows the Area Under the ROC Curve (AUC) scores for all verbalizer-prompt
combinations using the RoBERTa-Large PLM. These scores reflect the model’s ability to distinguish
between positive and negative containment pairs. While overall performance varies with prompt
choice for the same verbalizer, the results indicate some general trends. While PRONTO-LIN
achieves the lowest accuracy and F1 scores for hard prompts, it exhibits good AUC scores,
particularly for the h_1 and h_2 prompts. This suggests that PRONTO-LIN might benefit from
optimizing the decision threshold used to classify positive and negative pairs. A potential
explanation is in its underlying architecture. Indeed, PRONTO-LIN computes the label probability
as a linear sum of individual token probabilities. These token probabilities can be noisy and
potentially influenced by irrelevant factors, especially as vocabulary size increases. However,
adjusting the decision threshold could help mitigate the impact of this noise and potentially improve
PRONTO-LIN’s performance. The interested reader may find an additional comparison with
GPT-3.5 turbo in the extended version of this paper.</p>
        <p>Additional analyses. In the extended version of this paper, the interested reader may find
the experiments regarding the sensitivity to the relative positioning of instances and classes to
determine if the models’ predictions were based on memorizing word co-occurrences rather
than understanding the underlying meaning of containment relationships. The evaluation set
consisted of positive and "reverse negative" examples, with the less specific concept appearing on
the right-hand side of the prompt, and the verbalizers performed better on the reverse negative
set. This suggests that the verbalizers could distinguish between the relative specificities of
concepts, with models sensitive to the order in which concepts are presented. Moreover, we
have performed an analysis of the PRONTO-WS verbalizer that revealed both interpretable
and less intuitive top tokens, suggesting the model captures nuanced patterns beyond human
comprehension. These findings support the idea that containment relationships are intricate
and that the model uses a wide range of cues within the vocabulary, highlighting the need to
explore the full vocabulary for developing efective verbalizers.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Enhancing Verbalizers with Knowledge Graph Descriptions: The Impact of Context (RQ2)</title>
        <p>Building upon the learned verbalizers, we investigate the feasibility of leveraging textual
descriptions from our knowledge graph (KG) to potentially improve their performance in
addressing RQ2. This exploration is rooted in the hypothesis that enhancing our prompts
with relevant context about the entities involved can reinforce the model’s understanding of
the underlying semantic relationships and lead to better discrimination between positive and
negative containment pairs.</p>
        <p>
          To address RQ2, we reformulate the original containment prediction task as a textual
entailment task [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Here, we aim to infer whether a hypothesis (,  ) holds true based on
a provided natural language premise  (,  ). The hypothesis is formulated using the same
prompt construction method detailed in Section 2. For the premise, we leverage the textual
descriptions associated with entities  and  from the KG. We construct the premise by
concatenating the textual descriptions for entities  and  . Specifically, we use the dbo:abstract
property for the instances () and rdfs:comment property for the classes ( ) from DBPedia’s
Eval (hard) dataset, if available. Since PLMs have a maximum window size, we further process
the dataset by removing textual descriptions exceeding 50 tokens in length.
        </p>
        <p>Table 3 presents the final results on the Eval (hard) dataset after incorporating textual
descriptions from the knowledge graph (KG). The results reveal an interesting trend. Contrary to
expectations, adding context generally leads to a decrease in performance across most
verbalizerprompt combinations. This suggests that the KG descriptions might be introducing noise rather
than providing beneficial information. This negative impact can be attributed to architectural
factors. The prediction heads of the PLMs used may be sensitive to variations in input data,
struggling to integrate the additional context efectively. Moreover, the verbalizers themselves
might be susceptible to changes in the input, hindering their ability to leverage the
supplementary information. Interestingly, direct-mapping verbalizers (like PRONTO-LIN) are less afected
by the inclusion of context, showing improvements for specific prompts (h_3, h_4, h_5). This
experiment highlights the need for further investigation into effective strategies for incorporating
contextual information from knowledge graphs.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Generalizability of Verbalizers (RQ3)</title>
        <p>Generalizability to Unseen Instances. To address RQ3, we examine how training data size
afects verbalizers’ generalizability to unseen entities, simulating ontology completion. We
adopt an inductive setting, where the model predicts relationships for unseen entities. We
modify our training data by removing all training pairs containing randomly selected entities
from 80% of the Eval (hard) dataset. We retrain verbalizers on this split and report results in
Table 4. Reducing training size negatively impacts performance across metrics, though not
substantially. Interestingly, PRONTO-MAV outperforms its full-dataset counterpart in F1 score
and accuracy with the ℎ4 prompt, showing strong generalization.</p>
        <p>Zero-shot Entity Typing. We evaluate the generalizability of verbalizers through a zero-shot
entity typing task, assigning an entity type  to a mention  based on its context. Reformulating
this as a textual entailment task, we construct a cloze prompt for each type in  and select
the one with the highest probability. For experiments, we use the Few-NERD dataset [17],
a manually annotated NER dataset with fine- and coarse-grained tags. Due to type overlaps
(e.g., "Living Thing" under "Person"), we focus on well-defined, disjoint categories: Person (7
types), Location (6 types), and Organization (9 types), excluding ambiguous types like MISC.
PRONTO-MAV, our top-performing model, is used without additional training, leveraging the
verbalizer from our containment prediction task.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This study investigated the ability of pre-trained Masked Language Models (MLMs) to
understand hierarchical semantic relationships. The findings suggest that MLMs exhibit some grasp
of ontological containment, as evidenced by consistent patterns in the prediction heads. We
explored the generalizability of this approach, including learning specific verbalizers, inductive
containment prediction, and zero-shot entity typing. While non-linear verbalizers showed
remarkable performance, there is room for further exploration on developing more advanced
verbalization strategies to better integrate textual information with structured ontological
frameworks.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The authors acknowledge partial support of the following projects: OVS: Fashion Retail Reloaded
and ePansa. We acknowledge the CINECA award under the ISCRA initiative for the availability
of high-performance computing resources and support.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
G. Stoilos, L. Hollink, Z. Kaoudi, G. Cheng, J. Li (Eds.), The Semantic Web – ISWC 2023,
Springer Nature Switzerland, Cham, 2023, pp. 80–100.
[17] N. Ding, G. Xu, Y. Chen, X. Wang, X. Han, P. Xie, H. Zheng, Z. Liu, Few-NERD: A few-shot
named entity recognition dataset, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long
Papers), Association for Computational Linguistics, Online, 2021, pp. 3198–3213. URL:
https://aclanthology.org/2021.acl-long.248. doi:10.18653/v1/2021.acl-long.248.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Bellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Anelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Sciascio</surname>
          </string-name>
          ,
          <article-title>PRONTO: prompt-based detection of semantic containment patterns in mlms</article-title>
          , in: G. Demartini,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Acosta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          , G. Cheng, H.
          <string-name>
            <surname>Skaf-Molli</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferranti</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hernández</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Hogan (Eds.),
          <source>The Semantic Web - ISWC 2024 - 23rd International Semantic Web Conference</source>
          , Baltimore,
          <string-name>
            <surname>MD</surname>
          </string-name>
          , USA, November
          <volume>11</volume>
          -
          <issue>15</issue>
          ,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume
          <volume>15232</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          , pp.
          <fpage>227</fpage>
          -
          <lpage>246</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -77850-6_
          <fpage>13</fpage>
          . doi:
          <volume>10</volume>
          . 1007/978-3-
          <fpage>031</fpage>
          -77850-6\_
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Youssef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Koraş</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlötterer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Seifert</surname>
          </string-name>
          ,
          <article-title>Give me the facts! a survey on factual knowledge probing in pre-trained language models</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>15588</fpage>
          -
          <lpage>15605</lpage>
          . URL: https://aclanthology. org/
          <year>2023</year>
          .findings-emnlp.
          <volume>1043</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>1043</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <article-title>Do PLMs know and understand ontological knowledge?</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>3080</fpage>
          -
          <lpage>3101</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>173</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>173</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases?</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>2463</fpage>
          -
          <lpage>2473</lpage>
          . URL: https://aclanthology.org/D19-1250. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1250.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Anelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Biancofiore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Bellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Sciascio</surname>
          </string-name>
          ,
          <article-title>Interpretability of BERT latent space through knowledge graphs, in: M. A</article-title>
          .
          <string-name>
            <surname>Hasan</surname>
          </string-name>
          , L. Xiong (Eds.),
          <source>Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management</source>
          , Atlanta,
          <string-name>
            <surname>GA</surname>
          </string-name>
          , USA, October
          <volume>17</volume>
          -
          <issue>21</issue>
          ,
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>3806</fpage>
          -
          <lpage>3810</lpage>
          . URL: https://doi.org/10.1145/3511808.3557617. doi:
          <volume>10</volume>
          .1145/3511808.3557617.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Anelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lops</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Sciascio</surname>
          </string-name>
          ,
          <article-title>Feature factorization for top-n recommendation: From item rating to features relevance</article-title>
          , in: Y. Zheng,
          <string-name>
            <given-names>W.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Sahebi</surname>
          </string-name>
          , I. Fernández (Eds.),
          <source>Proceedings of the 1st Workshop on Intelligent Recommender Systems by Knowledge Transfer &amp; Learning co-located with ACM Conference on Recommender Systems (RecSys</source>
          <year>2017</year>
          ), Como, Italy,
          <year>August 27</year>
          ,
          <year>2017</year>
          , volume
          <volume>1887</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>21</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-1887/paper3.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Anelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Sciascio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ragone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Trotta</surname>
          </string-name>
          ,
          <article-title>Semantic interpretation of top-n recommendations</article-title>
          ,
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <fpage>2416</fpage>
          -
          <lpage>2428</lpage>
          . URL: https://doi.org/10.1109/TKDE.
          <year>2020</year>
          .
          <volume>3010215</volume>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2020</year>
          .
          <volume>3010215</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Anelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. L.</given-names>
            <surname>Bruna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tomeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Sciascio</surname>
          </string-name>
          ,
          <article-title>An analysis on time- and session-aware diversification in recommender systems</article-title>
          , in: M.
          <string-name>
            <surname>Bieliková</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Herder</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Cena</surname>
          </string-name>
          , M. C. Desmarais (Eds.),
          <source>Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization</source>
          ,
          <string-name>
            <surname>UMAP</surname>
          </string-name>
          <year>2017</year>
          , Bratislava, Slovakia,
          <source>July 09 - 12</source>
          ,
          <year>2017</year>
          , ACM,
          <year>2017</year>
          , pp.
          <fpage>270</fpage>
          -
          <lpage>274</lpage>
          . URL: https://doi.org/10.1145/3079628.3079703. doi:
          <volume>10</volume>
          . 1145/3079628.3079703.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , X. Han,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Xu,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          , H.-G. Kim,
          <article-title>Prompt-learning for fine-grained entity typing</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2022</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>6888</fpage>
          -
          <lpage>6901</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .findings-emnlp.
          <volume>512</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          . findings-emnlp.
          <volume>512</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bałazy</surname>
          </string-name>
          , Łukasz Struski,
          <string-name>
            <given-names>M.</given-names>
            <surname>Śmieja</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Tabor</surname>
          </string-name>
          , r-softmax:
          <article-title>Generalized softmax with controllable sparsity rate</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2304</volume>
          .
          <fpage>05243</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Boosting prompt-based self-training with mapping-free automatic verbalizer for multi-class classification</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>13786</fpage>
          -
          <lpage>13800</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          . ifndings-emnlp.
          <volume>921</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>921</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Winter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , E. Sigler,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/ paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisner</surname>
          </string-name>
          ,
          <article-title>Learning how to ask: Querying LMs with mixtures of soft prompts</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>5203</fpage>
          -
          <lpage>5212</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>410</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . naacl-main.
          <volume>410</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/N19-1423. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>García-Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berrío</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Gómez-Pérez</surname>
          </string-name>
          ,
          <article-title>Textual entailment for efective triple validation in object prediction</article-title>
          , in: T. R.
          <string-name>
            <surname>Payne</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Presutti</surname>
            , G. Qi,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Poveda-Villalón</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>