1. Introduction

SEBD

Semantic Containment in MLMs: A Prompt-Based Approach⋆

Discussion Paper

Vito Walter Anelli

Alessandro De Bellis

Tommaso Di Noia

Eugenio Di Sciascio

0 0 Politecnico di Bari , Via Orabona 4, Bari, 70125 , Italy

2025

33 16 19

This research explores whether Masked Language Models (MLMs) can understand semantic containment relations, such as sub-class and instance-of relationships, which are crucial for Semantic Web applications. The study introduces PRONTO, a novel approach that leverages MLM predictions to discover semantic containment relations in unstructured text by translating the model's internal predictions into classification labels. The efectiveness, reliability, and interpretability of PRONTO are assessed through a comprehensive probing procedure. The findings demonstrate that MLMs can capture semantic containment relationships, which has significant implications for ontology construction and aligning text data with ontologies. For the sake of reproducibility, we make our code, datasets, and evaluation tools available at https://github.com/sisinflab/PRONTO.

eol>Masked Language Models Prompt Learning Ontologies

1. Introduction

Pre-trained Language Models (PLMs) have become essential in Natural Language Processing (NLP) due to their ability to capture complex language patterns through extensive training on large text datasets. Studies show PLMs efectively capture factual [ 2 ] and ontological [ 3 ] knowledge from this pre-training [ 4 ]. For example, when given a prompt like "Paris is a [MASK]," a PLM is more likely to predict "capital." This suggests PLMs possess knowledge modeling capabilities beyond simple word co-occurrence [ 5 ]. However, this inherent knowledge is rarely used in applications; instead, other types of structured knowledge are employed [ 6, 7, 8 ], as these models are often fine-tuned to achieve competitive levels of performance in downstream tasks. This research aims to understand if bidirectional PLMs inherently recognize ontological containment, which includes subclass and instance of relationships. Ontological containment dbo:SoccerTeam rdf:type rdf:type dbo:BasketballTeam rdfs:subClassOf dbo:Agent rdfs:subClassOf dbo:Organisation https://dbpedia.org/page/Brooklyn_Nets dbo:SportsTeam rdfs:subClassOf rdfs:label rdfs:label

Brooklyn sports

Nets team Brooklyn Nets is [MASK] of sports team

Encoder Vanilla PLM (Frozen)

P r e d iit c o n H e a d

Containment verbalizer

V _ d i m reflects a hierarchical "is a" relationship between entities. The study explores whether PLMs can identify semantic containment when two entities are present in a prompt (e.g., "Paris [MASK] city"), to determine if PLMs are zero-shot semantic containment learners. We propose PRONTO, a novel procedure aimed at the extraction of semantic containment relations from bidirectional PLMs based on the examination of their masked language modeling prediction head. Our key contributions can be summarized as follows: • We propose a general procedure to probe semantic containment knowledge from MLMs by means of automatically learned verbalizers, i.e. mappings between a MLM prediction head and a label. • Through extensive analysis, we reveal how vanilla (i.e., not fine-tuned) MLMs exhibit an inner awareness with respect to semantic containment.

To the best of our knowledge, this is the first attempt to use the knowledge stored in PLMs to detect ontological containment through relation prediction with automatically extracted verbalizers. Finally, we present practical applications in zero-shot entity typing.

2. Methodology

In this section, we formally introduce our containment prediction task that we schematize in Figure 1. Let = {1, 2, . . . , } represent the set of classes in a reference ontology . Each class is a node within the ontology graph. Let be the set of edges representing the subclass relations among these classes, where each edge (, ) ∈ denotes that class is a subclass of class . Let = {1, 2, . . . , } denote the set of instances of classes in , and be the set of edges denoting the instance of relation, where each edge (, ) ∈ indicates that instance is of type , linking instances to their respective classes. We define the semantic containment graph as the union of the two sets of edges and , combined with their respective node sets and . Formally, = ⟨ ∪ , ∪ ⟩. For any two nodes , ∈ , we aim to determine whether there exists a path from to that signifies an “is-a” relationship within the ontology . This relationship is characterized by a sequence of edges each representing either a direct subclass relation between classes or an instance belonging to a class, thereby forming a chain of semantic containment. Formally, we aim to learn a model : (, ) ↦→ ˆ with ˆ being 1 if there exists a path from to in and 0 otherwise. The function is parameterized by the parameters derived from a vanilla PLM (e.g., BERT). The aim of is to learn the mapping : (, ) ↦→ ˆ, where ˆ represents the predicted probability that a containment relationship exists between the concepts and . Let us define a function that constructs a prompt for a pre-trained MLM, given two nodes and . The function obtains the verbalized forms of and through (·) and inserts a mask token [MASK] between them to form the prompt. Formally, the prompt construction can be represented as (, ) = () ⊕ "[MASK] " ⊕ ( ), (1) where V() and V( ) are two natural language representations for the nodes and , respectively. The symbol ⊕ stands for string concatenation, and V( ) is the rdfs:label associated to .

Automatic Extraction of a Containment Verbalizer. Given the prompt (, ) as input to a bidirectional PLM capable of mask-filling, the output of its MLM prediction head consists of the predicted probability distribution over possible tokens that could replace the [MASK] (Fig. 1). We propose investigating whether these predicted probabilities can help determine the existence of a containment relationship between and . Given a PLM capable of mask-filling trained on a vocabulary of size and a prompt function (·, ·) , we aim to create a mapping between the prediction head output and a discrete label . Prior work formulate the concept of verbalizer [ 9 ] as a discrete mapping between a subset of tokens = {1, ..., } and a label . Formally: (|) =

1 ∑︁ ([MASK] = |), =1 with being the number of tokens in and being the prompt. The construction of is often done manually: for instance, if y="city", a reasonable although simplistic verbalizer construction could be = {, }. In this work, we formulate the construction of a verbalizer as a search problem over the whole vocabulary. This enables our verbalizer to fully exploit the expressiveness of such a large vocabulary and possibly capture associations between labels and tokens that could be not easily identifiable even for domain experts. We want to design a verbalizer as a direct mapping function between the PLM prediction head and a label. An implementation of such a verbalizer is the following:

(|) = ∑︁ ([MASK] = |) = ∑︁ ( )([MASK] = |),

=1 =1 where ∈ [ 0, 1 ] is a weighting factor that modulates the contribution of each token in the vocabulary to the probability of predicting given . The weights can be learned through an optimization process aiming to minimize a specified loss function. In fact, we learn (2) (3) the parameters jointly in our optimization procedure, constraining them in a range [ 0, 1 ] by means of a sigmoid. Ideally, we want the verbalizer to satisfy two useful properties: P1 Noise Resilience: Since we are dealing with large vocabularies, the significant tokens’ marginal probabilities in a PLM prediction head tend to be diluted by the presence of many less relevant tokens. This dilution is linked to the softmax function’s property of distributing probabilities across all logits, diminishing the impact of pivotal tokens as the vocabulary size expands.

P2 Sparsity: We aim to enforce a sparsity constraint on the weights to promote interpretability. This constraint facilitates the identification of the most influential tokens minimizing the influence of less relevant ones. In fact, a PLM vocabulary is highly populated even for smaller models (30000+ tokens). Therefore, sparsity can aid interpretability for humans, which can only realistically focus on a smaller set of informative tokens simultaneously. To satisfy P1, MLM prediction head logits pass through a weighted-softmax [ 10 ]: softmax(, ) = ︂(

1 exp(1) ∑︀=1 exp() , . . . , ∑︀=1 exp() exp() ︂) where are parameters learned jointly in the optimization process and constrained in the [ 0, 1 ] range. To satisfy P2, we impose an L1 regularization term over the learned weights in our loss function. L1 regularization is known to promote sparsity over other alternative regularization strategies, as well as improving generalization. To investigate the potential benefits of nonlinearity within our verbalization strategies, we draw inspiration from MAV (Mapping-Free Automatic Verbalizer) [ 11 ], in which the authors formulate a mapping-free verbalizer as a nonlinear projection of a MLM prediction head in a latent vocabulary space. In our own adaptation, we substitute the inner Tanh activation function with LayerNorm for numerical stability. This is motivated by the observation that MLM logits can vary in unnormalized ranges, and the Tanh function suppresses information associated with high activations:

(|) = ( 2 · ℎ( 1 · (logits MLM))).

In summary, we experiment with diferent verbalization strategies: • PRONTO-VF: a verbalizer-free baseline approach, where the hidden state of the [MASK] token is fed into two fully connected layers with a final sigmoid activation, as in Equation (5); • PRONTO-LIN: a naive linear direct-mapping approach, based on Equation (3); • PRONTO-WS: a direct-mapping approach where logits are re-weighted before the Softmax as in Equation (4), and the final label probability is obtained as in Equation (3); • PRONTO-MAV: a mapping-free approach where logits are fed into two fully connected layers as in Equation (5).

The direct-mapping verbalizers (PRONTO-WS, PRONTO-LIN) are inherently interpretable, since each can measure the contribution of the -th token for the final label prediction. On the other side, PRONTO-MAV and PRONTO-VF can give an indication on more subtle patterns in the prediction heads that can only be acquired by means of non-linearities. Data Preparation. Given a semantic containment graph = ⟨ ∪ , ∪ ⟩, we denote Π + as the set of all the pairs of nodes (, ) that can be found along a path of . In other (4) (5) words, we compute the transitive closure of each node in G.Since does not contain negative information, this leaves an important decision: how to extract useful negative pairs. This decision is crucial since it impacts both the eficacy and generalizability of our learned verbalizers and the reliability of our evaluation. Intuitively, we want our model to be capable of distinguishing between semantically similar classes, although disjoint ones (e.g., "city"/"region"). However, we want it to be also able to distinguish among completely unrelated classes (e.g. "city"/"person"). Furthermore, we want it to correctly model a semantic containment relationship that is noncommutative, instead of just discriminating based on word similarity. We devise three strategies to build the set Π − of negative samples: • Reverse negatives: given a positive pair (, ) we obtain a negative pair by inverting subject and object (, ); • Soft negatives: given a positive pair (, ), we replace with a random class sampled based on the class distribution in the data; • Hard negatives: given a positive pair (, ) ∈ Π +, we build the two sets +(, ) = { | (, ) ∈ Π +} and ˆ+(, ) = {ˆ | (ˆ , ) ∈ Π + and ˆ ̸∈ +(, ) and ∈ +(, )}. While +(, ) represents the set of nodes along a path starting from in the original graph , namely all the nodes in a semantic containment relation with , the set ˆ+(, ) contains the nodes on the paths arriving in +(, ). These nodes are not in a semantic containment relation with but are semantically "close" to it. Given a node , the hard negatives are then built as (, ˆ ) with ˆ ∈ ˆ+(, ).

Prompt Construction. Prior work has demonstrated the sensitivity of PLM outputs to prompt selection [ 12 ]. In order to provide a more extensive analysis, we choose to experiment over diferent prompt templates. We report our prompt choices in Table 1. We design various hard templates to capture various linguistic manifestations of the containment relationship. Regardless of the prompt, both subject and object follow the same verbalization strategy, i.e., the rdfs:label literal value. In addition to manually designed prompts, we explore the integration of soft tokens [ 13 ], i.e. word vectors jointly fine-tuned during the optimization process.

3. Experiments

This section outlines the experimental setup to probe the ability of PLMs to understand ontological containment relationships. We specifically focus on evaluating the inherent capacity of vanilla pre-trained MLMs prediction heads to recognize the hierarchical relation between instances and classes. The experiments are structured around three core research questions: RQ1: Do Masked Language Models (MLMs) capture semantic containment? RQ2: How does contextual information influence MLM in semantic containment prediction tasks? RQ3: Can MLMs generalize their semantic containment reasoning abilities to new data and tasks? Dataset. We base our study on the dataset introduced by Wu et al. [ 3 ], a reputable dataset from recent literature on probing. This dataset is based on a restriction of DBPedia, containing 783 classes and up to 20 instances per class, with 8753 unique instances. The restriction is necessary because using the entire DBPedia is impractical due to resource limitations. Moreover, multi-hop link extraction scales exponentially as (entities × branching_factor) hops. To extract positive and negative pairs, we follow the procedure described in section 2. We construct the set of negative pairs Π − as follows: for each pair in Π +, we sample two hard, one soft and one reverse negative. From the union of negative samples and positive samples Π = Π + ∪ Π − , we extract training and evaluation splits with holdout. We find that the obtained evaluation split contains a significative amount of soft and reverse negatives, that could potentially inflate performances. For this reason, we extract a more challenging evaluation dataset, that we refer to as Eval (hard), removing all the soft and reverse negatives from the original evaluation split. We use the Eval (hard) dataset as evaluation dataset in all our experiments.

Probed PLMs. It is worth noticing that the proposed probing procedure is versatile and can be readily applied to any bidirectional PLM with mask-filling capabilities. For this investigation, we focus on two prominent encoder-only PLMs,1 BERT [ 14 ] and RoBERTa [ 15 ], that leverage a masked language modeling objective during their pre-training stage.

3.1. Semantic Containment Understanding in PLMs (RQ1)

To evaluate the efectiveness of the probed PLMs in identifying semantic containment relationships, we analyze the performance of various combinations of verbalization strategies, templates, and PLMs (the interested reader may take a look to Section 2 for further details). We report the results in Table 2, presenting accuracy, precision, recall, and F1-score for each combination. A decision threshold of 0.5 was used for all models.

PLM Comparison. The analysis reveals several interesting trends. The first finding is that the verbalization strategy matters. The Mapping-Free Automatic Verbalizer (MAV) consistently outperforms those based on direct mapping (LIN and WS). This suggests token probabilities likely contain complex relationships that direct mapping approaches might miss. The MAV strategy seems to capture these more efectively. The RoBERTa-Large model generally achieves better and more consistent results, particularly with direct-mapping verbalizations (LIN and WS). For the MAV verbalizer, RoBERTa-Base outperforms the larger model with specific template choices (h_4, h_2, s_1, and s_2). This suggests that prompt design plays a crucial role in performance, even for larger models. There is no clear correlation between model size and the 1For all the adopted PLMs, we employ the pre-trained checkpoints available at https://huggingface.co/.

PLMs’ discriminative ability to distinguish containment relationships. MAV verbalizers show similar performance across model sizes while direct-mapping variants tend to improve with larger models. We hypothesize that smaller PLMs may exhibit more nuanced activation patterns for containment, requiring a non-linear verbalizer like MAV to capture them. This result is in line with previous works that reached conflicting conclusions on this matter: indeed, Petroni et al. [ 4 ] showed overall better results for larger PLMs in ontological memorization capabilities, while a more recent study [ 3 ] proved that model size does not have reasonable impact on stored ontological knowledge. The analysis suggests that vocabulary size might not be the primary factor influ encing performance in this task. Interestingly, BERT-Base, with a smaller vocabulary compared to RoBERTa-Base (approximately 20, 000 fewer tokens), outperforms RoBERTa-Base for PRONTO-WS and PRONTO-MAV verbalizations across most prompts. This indicates that other factors, potentially the specific tokenization strategies or the training data used for each model, may play a more significant role in capturing semantic relationships.

Figure 2 shows the Area Under the ROC Curve (AUC) scores for all verbalizer-prompt combinations using the RoBERTa-Large PLM. These scores reflect the model’s ability to distinguish between positive and negative containment pairs. While overall performance varies with prompt choice for the same verbalizer, the results indicate some general trends. While PRONTO-LIN achieves the lowest accuracy and F1 scores for hard prompts, it exhibits good AUC scores, particularly for the h_1 and h_2 prompts. This suggests that PRONTO-LIN might benefit from optimizing the decision threshold used to classify positive and negative pairs. A potential explanation is in its underlying architecture. Indeed, PRONTO-LIN computes the label probability as a linear sum of individual token probabilities. These token probabilities can be noisy and potentially influenced by irrelevant factors, especially as vocabulary size increases. However, adjusting the decision threshold could help mitigate the impact of this noise and potentially improve PRONTO-LIN’s performance. The interested reader may find an additional comparison with GPT-3.5 turbo in the extended version of this paper.

Additional analyses. In the extended version of this paper, the interested reader may find the experiments regarding the sensitivity to the relative positioning of instances and classes to determine if the models’ predictions were based on memorizing word co-occurrences rather than understanding the underlying meaning of containment relationships. The evaluation set consisted of positive and "reverse negative" examples, with the less specific concept appearing on the right-hand side of the prompt, and the verbalizers performed better on the reverse negative set. This suggests that the verbalizers could distinguish between the relative specificities of concepts, with models sensitive to the order in which concepts are presented. Moreover, we have performed an analysis of the PRONTO-WS verbalizer that revealed both interpretable and less intuitive top tokens, suggesting the model captures nuanced patterns beyond human comprehension. These findings support the idea that containment relationships are intricate and that the model uses a wide range of cues within the vocabulary, highlighting the need to explore the full vocabulary for developing efective verbalizers.

3.2. Enhancing Verbalizers with Knowledge Graph Descriptions: The Impact of Context (RQ2)

Building upon the learned verbalizers, we investigate the feasibility of leveraging textual descriptions from our knowledge graph (KG) to potentially improve their performance in addressing RQ2. This exploration is rooted in the hypothesis that enhancing our prompts with relevant context about the entities involved can reinforce the model’s understanding of the underlying semantic relationships and lead to better discrimination between positive and negative containment pairs.

To address RQ2, we reformulate the original containment prediction task as a textual entailment task [ 16 ]. Here, we aim to infer whether a hypothesis (, ) holds true based on a provided natural language premise (, ). The hypothesis is formulated using the same prompt construction method detailed in Section 2. For the premise, we leverage the textual descriptions associated with entities and from the KG. We construct the premise by concatenating the textual descriptions for entities and . Specifically, we use the dbo:abstract property for the instances () and rdfs:comment property for the classes ( ) from DBPedia’s Eval (hard) dataset, if available. Since PLMs have a maximum window size, we further process the dataset by removing textual descriptions exceeding 50 tokens in length.

Table 3 presents the final results on the Eval (hard) dataset after incorporating textual descriptions from the knowledge graph (KG). The results reveal an interesting trend. Contrary to expectations, adding context generally leads to a decrease in performance across most verbalizerprompt combinations. This suggests that the KG descriptions might be introducing noise rather than providing beneficial information. This negative impact can be attributed to architectural factors. The prediction heads of the PLMs used may be sensitive to variations in input data, struggling to integrate the additional context efectively. Moreover, the verbalizers themselves might be susceptible to changes in the input, hindering their ability to leverage the supplementary information. Interestingly, direct-mapping verbalizers (like PRONTO-LIN) are less afected by the inclusion of context, showing improvements for specific prompts (h_3, h_4, h_5). This experiment highlights the need for further investigation into effective strategies for incorporating contextual information from knowledge graphs.

3.3. Generalizability of Verbalizers (RQ3)

Generalizability to Unseen Instances. To address RQ3, we examine how training data size afects verbalizers’ generalizability to unseen entities, simulating ontology completion. We adopt an inductive setting, where the model predicts relationships for unseen entities. We modify our training data by removing all training pairs containing randomly selected entities from 80% of the Eval (hard) dataset. We retrain verbalizers on this split and report results in Table 4. Reducing training size negatively impacts performance across metrics, though not substantially. Interestingly, PRONTO-MAV outperforms its full-dataset counterpart in F1 score and accuracy with the ℎ4 prompt, showing strong generalization.

Zero-shot Entity Typing. We evaluate the generalizability of verbalizers through a zero-shot entity typing task, assigning an entity type to a mention based on its context. Reformulating this as a textual entailment task, we construct a cloze prompt for each type in and select the one with the highest probability. For experiments, we use the Few-NERD dataset [17], a manually annotated NER dataset with fine- and coarse-grained tags. Due to type overlaps (e.g., "Living Thing" under "Person"), we focus on well-defined, disjoint categories: Person (7 types), Location (6 types), and Organization (9 types), excluding ambiguous types like MISC. PRONTO-MAV, our top-performing model, is used without additional training, leveraging the verbalizer from our containment prediction task.

4. Conclusion

This study investigated the ability of pre-trained Masked Language Models (MLMs) to understand hierarchical semantic relationships. The findings suggest that MLMs exhibit some grasp of ontological containment, as evidenced by consistent patterns in the prediction heads. We explored the generalizability of this approach, including learning specific verbalizers, inductive containment prediction, and zero-shot entity typing. While non-linear verbalizers showed remarkable performance, there is room for further exploration on developing more advanced verbalization strategies to better integrate textual information with structured ontological frameworks.

Acknowledgments

The authors acknowledge partial support of the following projects: OVS: Fashion Retail Reloaded and ePansa. We acknowledge the CINECA award under the ISCRA initiative for the availability of high-performance computing resources and support.

Declaration on Generative AI

The authors have not employed any Generative AI tools. G. Stoilos, L. Hollink, Z. Kaoudi, G. Cheng, J. Li (Eds.), The Semantic Web – ISWC 2023, Springer Nature Switzerland, Cham, 2023, pp. 80–100. [17] N. Ding, G. Xu, Y. Chen, X. Wang, X. Han, P. Xie, H. Zheng, Z. Liu, Few-NERD: A few-shot named entity recognition dataset, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 3198–3213. URL: https://aclanthology.org/2021.acl-long.248. doi:10.18653/v1/2021.acl-long.248.

[1]

A. D.

Bellis ,

V. W.

Anelli ,

T. D.

Noia ,

E. D.

Sciascio , PRONTO: prompt-based detection of semantic containment patterns in mlms , in: G. Demartini,

Hose ,

Acosta ,

Palmonari , G. Cheng, H. Skaf-Molli , N.

Ferranti , D.

Hernández , A . Hogan (Eds.), The Semantic Web - ISWC 2024 - 23rd International Semantic Web Conference , Baltimore, MD , USA, November 11 - 15 , 2024 , Proceedings, Part

, volume 15232 of Lecture Notes in Computer Science, Springer, 2024 , pp. 227 - 246 . URL: https://doi.org/10.1007/978-3- 031 -77850-6_ 13 . doi: 10 . 1007/978-3- 031 -77850-6\_ 13 .

[2]

Youssef ,

Koraş ,

Li ,

Schlötterer ,

Seifert , Give me the facts! a survey on factual knowledge probing in pre-trained language models , in: H. Bouamor , J. Pino , K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , Association for Computational Linguistics , Singapore, 2023 , pp. 15588 - 15605 . URL: https://aclanthology. org/ 2023 .findings-emnlp. 1043 . doi: 10 .18653/v1/ 2023 .findings-emnlp. 1043 .

[3]

Wu ,

Jiang ,

Xie ,

Tu , Do PLMs know and understand ontological knowledge? , in: A. Rogers , J. Boyd-Graber , N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 3080 - 3101 . URL: https://aclanthology.org/ 2023 . acl-long . 173 . doi: 10 .18653/v1/ 2023 . acl-long . 173 .

[4]

Petroni ,

Rocktäschel ,

Riedel ,

Lewis ,

Bakhtin ,

Wu ,

Miller , Language models as knowledge bases? , in: K. Inui,

Jiang ,

Ng , X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 2463 - 2473 . URL: https://aclanthology.org/D19-1250. doi: 10 .18653/v1/ D19 -1250.

[5]

V. W.

Anelli ,

G. M.

Biancofiore ,

A. D.

Bellis ,

T. D.

Noia ,

E. D.

Sciascio , Interpretability of BERT latent space through knowledge graphs, in: M. A . Hasan , L. Xiong (Eds.), Proceedings of the 31st ACM International Conference on Information & Knowledge Management , Atlanta, GA , USA, October 17 - 21 , 2022 , ACM, 2022 , pp. 3806 - 3810 . URL: https://doi.org/10.1145/3511808.3557617. doi: 10 .1145/3511808.3557617.

[6]

V. W.

Anelli ,

T. D.

Noia ,

Lops ,

E. D.

Sciascio , Feature factorization for top-n recommendation: From item rating to features relevance , in: Y. Zheng,

Pan ,

S. S.

Sahebi , I. Fernández (Eds.), Proceedings of the 1st Workshop on Intelligent Recommender Systems by Knowledge Transfer & Learning co-located with ACM Conference on Recommender Systems (RecSys 2017 ), Como, Italy, August 27 , 2017 , volume 1887 of CEUR Workshop Proceedings, CEUR-WS.org , 2017 , pp. 16 - 21 . URL: https://ceur-ws. org/ Vol-1887/paper3.pdf.

[7]

V. W.

Anelli ,

T. D.

Noia ,

E. D.

Sciascio ,

Ragone ,

Trotta , Semantic interpretation of top-n recommendations , IEEE Trans. Knowl. Data Eng . 34 ( 2022 ) 2416 - 2428 . URL: https://doi.org/10.1109/TKDE. 2020 . 3010215 . doi: 10 .1109/TKDE. 2020 . 3010215 .

[8]

V. W.

Anelli ,

Bellini ,

T. D.

Noia ,

W. L.

Bruna ,

Tomeo ,

E. D.

Sciascio , An analysis on time- and session-aware diversification in recommender systems , in: M. Bieliková , E.

Herder , F.

Cena , M. C. Desmarais (Eds.), Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization , UMAP 2017 , Bratislava, Slovakia, July 09 - 12 , 2017 , ACM, 2017 , pp. 270 - 274 . URL: https://doi.org/10.1145/3079628.3079703. doi: 10 . 1145/3079628.3079703.

[9]

Ding ,

Chen , X. Han, G . Xu,

Wang ,

Xie ,

Zheng ,

Liu ,

Li , H.-G. Kim, Prompt-learning for fine-grained entity typing , in: Y. Goldberg , Z. Kozareva , Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2022 , Association for Computational Linguistics , Abu Dhabi, United Arab Emirates, 2022 , pp. 6888 - 6901 . URL: https://aclanthology.org/ 2022 .findings-emnlp. 512 . doi: 10 .18653/v1/ 2022 . findings-emnlp. 512 .

[10]

Bałazy , Łukasz Struski,

Śmieja , J. Tabor , r-softmax: Generalized softmax with controllable sparsity rate , 2023 . arXiv: 2304 . 05243 .

[11]

Kho ,

Kim ,

Kang , Boosting prompt-based self-training with mapping-free automatic verbalizer for multi-class classification , in: H. Bouamor , J. Pino , K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , Association for Computational Linguistics , Singapore, 2023 , pp. 13786 - 13800 . URL: https://aclanthology.org/ 2023 . ifndings-emnlp. 921 . doi: 10 .18653/v1/ 2023 .findings-emnlp. 921 .

[12]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell ,

Agarwal ,

Herbert-Voss , G. Krueger,

Henighan ,

Child ,

Ramesh ,

Ziegler ,

Wu ,

Winter ,

Hesse ,

Chen , E. Sigler,

Litwin ,

Gray ,

Chess ,

Clark ,

Berner ,

McCandlish ,

Radford ,

Sutskever ,

Amodei , Language models are few-shot learners , in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , H. Lin (Eds.), Advances in Neural Information Processing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 1877 - 1901 . URL: https://proceedings.neurips.cc/paper_files/ paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

[13]

Qin ,

Eisner , Learning how to ask: Querying LMs with mixtures of soft prompts , in: K. Toutanova , A.

Rumshisky , L.

Zettlemoyer , D.

Hakkani-Tur , I.

Beltagy , S.

Bethard , R.

Cotterell , T.

Chakraborty , Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics , Online, 2021 , pp. 5203 - 5212 . URL: https://aclanthology.org/ 2021 .naacl-main. 410 . doi: 10 .18653/v1/ 2021 . naacl-main. 410 .

[14]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/N19-1423. doi: 10 .18653/v1/ N19 -1423.

[15]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , 2019 . arXiv: 1907 .11692.

[16]

García-Silva ,

Berrío ,

J. M.

Gómez-Pérez , Textual entailment for efective triple validation in object prediction , in: T. R. Payne , V.

Presutti , G. Qi, M.

Poveda-Villalón ,