1. Introduction

Classifying Gas Pipe Damage Descriptions in Low-Diversity Corpora

Luca Catalano

0 3

Federico D'Asaro

federico.dasaro@polito.it 0 2 3

Michele Pantaleo

michele.pantaleo@studenti.polito.it 0 2 3

Minal Jamshed

0 2 3

Prima Acharjee

prima.acharjee@studenti.polito.it 0 2 3

Nicola Giulietti

n.giulietti@composite-research.com 0 1 3

Eugenio Fossat

e.fossat@composite-research.com 0 1 3

Giuseppe Rizzo

0 3

LINKS Foundation - Torino

0 3

Italy

0 3 0 CLiC-it 2025: Eleventh Italian Conference on Computational Linguistics 1 Composite Research - Torino , Italy 2 Politecnico di Torino - Torino , Italy 3 Possible Values Galvanised fittings , Steel, Bitumen-coated steel, Polyethylene-coated steel, Cast iron, Polyethylene Non-sheared linear lesion, Hole, Cluster of holes, Sheared linear lesion, Visible axial deformation, Thread, Elbow, Sleeve, Tee, Nipples, Ball valve True, False True, False

2025

This paper introduces a retrieval-based text classification framework tailored for language corpora in the domain of gas pipe damage description analysis, with a specific focus on determining patch applicability. Due to the scarcity of free-text damage descriptions in this domain, we construct a synthetic binary classification dataset, referred to as CoRe-S. This dataset consists of 11,904 damage descriptions generated from structured attributes, where each instance is labeled as either Patchable (True) or Unpatchable (False). The CoRe-S dataset presents two primary challenges: (i) a class imbalance, where positive cases are the minority, and (ii) frequent use of domain-specific terminology, which results in low lexical diversity across descriptions. To quantify this lack of variation, we introduce the Corpus Pairwise Diversity statistic, which measures the degree of lexical dissimilarity between documents in a corpus. We adopt a training-free, retrieval-based text classification approach and demonstrate that Sentence-BERT-NLI is the most efective encoder under low-diversity conditions, as it excels at capturing subtle lexical and semantic diferences between otherwise similar documents. To address the class imbalance, we apply random undersampling, which outperforms other under-sampling strategies in our experiments. Our results show that the proposed retrieval-based classifier significantly outperforms other training-free text classification methods-whether zero-shot, few-shot, or similarity-based-achieving an improvement of approximately 35.2% in macro F1-score over the second-best method. Our code is publicly available at: https://github.com/links-ads/core-unimodal-retrieval-for-classification.

eol>Gas pipe damage description analysis Training-free text classification Low lexical diversity Low lexical diversity

1. Introduction

Text classification is the task of assigning predefined labels to a given text and has been applied to a wide range of domains, including sentiment analysis [ 1 ], emotion recognition [ 2 ], news classification [ 3 ], and spam detection [ 4 ]. Early approaches typically decomposed the task into two stages: feature extraction using neural models such as Recurrent Neural Networks (RNNs) [ 5, 6 ] or Convolutional Neural Networks (CNNs) [ 7 ], followed by feeding the extracted features into a classifier [ 8 ] to predict labels. With the emergence of transformer architectures [ 9 ], Large Pretrained Models (LPMs) such as BERT [ 10 ] and GPT [ 11 ] have become the foundation for modern NLP systems. Trained on massive textual corpora, these models demonstrate strong generalization capabilities across various downstream tasks, often without requiring additional task-specific training data.

In this work, we address the task of text classification over gas pipe damage descriptions, with the objective of determining whether a patch is applicable (True) or not (False). Due to the limited availability of free-text damage reports in this domain, we construct a synthetic binary classification dataset, referred to as CoRe-S. This Diversity, to quantify the lexical dissimilarity dataset comprises 11,904 damage descriptions generated between documents within a corpus. from structured attributes such as pipe material, lesion • We demonstrate that in low-diversity settings, a type, and pipe exposure. This setting poses two main Natural Language Inference–pretrained encoder, challenges: (i) a class imbalance, where positive cases are specifically SBERT-NLI, outperforms standard sethe minority; and (ii) low lexical diversity, as descriptions mantic similarity models by efectively capturing tend to be highly similar across classes, relying heavily subtle distinctions between documents belonging on domain-specific terminology and recurring linguistic to diferent classes. patterns. Consequently, texts from diferent categories may be lexically indistinguishable, complicating classification based on surface-level features. 2. Background on Training-Free

To quantify this lexical variability, we introduce a Text Classification novel statistic, Corpus Pairwise Diversity, which measures the degree of lexical dissimilarity between docu- With the advent of transformer architectures equipped ments within a corpus. When applied to our dataset, this with attention mechanisms [ 9 ], a new wave of Largestatistic produces significantly lower values compared scale Pretrained Models (LPMs) has emerged. These modto generalist corpora such as 20NewsGroups [ 12 ], which els are trained on vast textual corpora such as BooksCorare characterized by a broader vocabulary and greater pus (800M words) [ 14 ] and Common Crawl [ 15 ]. Modern topical diversity. PLMs are predominantly based on either the BERT [ 10 ]

For the classification task, we employ a training- or GPT [ 11 ] architectures. BERT utilizes a transformer free, retrieval-based framework, depicted in Figure 1, encoder to produce dense contextual representations of that leverages PLMs, consisting of a document encoder input text, making it well-suited for language understandand a similarity-based classifier. Given the low cor- ing tasks. In contrast, GPT adopts a decoder-only archipus diversity and frequent repetition of domain-specific tecture originally designed for generative applications, terms—regardless of class—conventional semantic search though it has also shown strong performance in classifimodels may underperform in this setting, as they often cation tasks [ 16, 17 ]. Both architectural families exhibit fail to capture fine-grained linguistic distinctions. For strong transfer learning capabilities, enabling efective instance, two descriptions may difer only in a subtle fea- adaptation to a variety of downstream tasks, and paving ture such as pressure level, which can determine whether the way for training-free approaches to text classification. a leak is patchable. BERT-based approaches leverage embeddings to com

This observation motivates the hypothesis that en- pare semantic similarity between pieces of text. Decoders focusing on logical inference, rather than rely- pending on the nature of the task, these methods can ing solely on surface-level semantic similarity, are better be broadly categorized into: (i) zero-shot methods, which suited for classification in such contexts. Accordingly, we compare the input text directly with class labels or their employ the Sentence-BERT model pre-trained on Natural representative keywords [ 18, 19, 13 ]; and (ii) retrievalLanguage Inference (NLI), a task that requires determin- based methods, which perform semantic search over a ing whether a hypothesis can be logically inferred, contra- database containing auxiliary knowledge [ 20, 21 ]. dicted, or is neutral with respect to a given premise. We Schopf et al. [ 22 ] presented, for the first group of methadopt SBERT-NLI [ 13 ], which efectively captures subtle ods (zero-shot), two diferent approaches. The first one lexical and semantic diferences between near-identical consists of representing each document as the average documents. To mitigate the efects of class imbalance, of its paragraph embeddings. Similarly, each label is repwe apply random undersampling to the retrieval corpus, resented as the average embedding of a set of predefined which achieves superior performance compared to alter- keywords associated with that label. Classification is native imbalance-handling strategies in our experiments. then performed by computing the similarity between Experimental results demonstrate that our text classifi- the document and label embeddings, assigning the label cation model consistently outperforms state-of-the-art with the highest similarity score. The second approach, training-free approaches, including zero-shot, few-shot, instead, implements a zero-shot entailment technique. and similarity-based methods. Each input document is paired with a hypothesis repreThe main contributions of this work are as follows: senting a candidate label, and the model predicts whether the hypothesis is entailed by the input. • We introduce CoRe-S, a novel dataset in the do- GPT-based approaches, on the other hand, leverage main of gas pipe damage descriptions, which, to the full potential of natural language processing and the the best of our knowledge, is the first dataset de- generative capabilities embedded in the models. These veloped in this domain. methods are typically applied in either: (i) a zero-shot • We introduce a novel statistic, Corpus Pairwise fashion, where predictions are made without any labeled

3. CoRe-S Dataset To explore this idea and assess its feasibility, we construct a synthetic dataset by transforming existing structured tabular data—originally collected in the field—into natural language descriptions.

The original tabular dataset comprises 11,904 pipe repair interventions. Each intervention is described using 11 categorical or boolean features—listed in Table 1—which capture the condition of the pipe at the time of the damage. Additionally, each record is labeled as Patchable (True) or Not Patchable (False), depending on whether the intervention involved a successful patch or required replacement of the pipe segment. Among all interventions, only 126 examples (1.06%) are labeled as successful patches, while the remaining 11,778 (98.94%) represent replacements.

We generate the textual descriptions using the large language model (LLM) Mistral-7B Instruct v0.31.

Figure 2 illustrates through an example the pipeline used to generate the dataset, where a prompt—shown in Figure 3—combines (i) a randomly selected example from a curated set of 36 real technician-written descriptions and (ii) a structured template filled with the most informative features extracted from the tabular dataset, enabling the LLM to produce realistic and domain-specific textual representations of pipe failures.

Specifically, for each entry in the original tabular dataset x ∈ R , we extract the relevant feature values and insert them into the template prompt, together with the example used to guide the writing style.

The label ∈ , False indicates whether the inter- Figure 3: Prompt template used for converting tabular data vention was resolved via patching ( = True) or required representing pipe damage into textual descriptions. The pipe replacement ( = False), and is directly inherited prompt is composed of: (1) the features relevant for generatfrom the original dataset. isnpgectihaelisctotnotegnuti,daentdhe(2s)taynle.example description written by a

The resulting CoRe-S dataset consists of pairs (, ), where is the synthetic textual description generated this statistic informs downstream components that rely from the structured features of intervention , and is on accurate estimations of inter-document similarity. the corresponding repair label.

To ensure the quality and reliability of the generated descriptions, we perform a human review process to: (i) 4.1. Definition verify stylistic consistency with real examples written Let = {1, . . . , } be a corpus of documents, by technicians, and (ii) randomly assess the semantic where each document is represented as the set of its alignment between each description and the original unique terms. The Jaccard distance between two docufeature vector x. ments and is

4. Corpus Pairwise Diversity Statistic

(, ) = 1 − | ∩ | | ∪ | ∈ [ 0, 1 ].

We then define the Corpus Pairwise Diversity statistic as

TphuissPsaecirtiwonisienDtriovdeurcseitsythsteaftoisrtmica,lwdheifinchitisoenrvoefstahsea fCouonr-- () = ︀( 1)︀ ∑︁ (, ). dational element for both the design and evaluation of 2 1≤ <≤ our retrieval-based classifier. By measuring the average By construction, () ∈ [ 0, 1 ]; low values indidissimilarity between the vocabularies of document pairs, cate high overall similarity, and high values indicate high overall dissimilarity among documents. It is non-negative, 1https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 symmetric, and unafected by the order of the set . descriptions based on embedding similarity. The final label is assigned using a majority voting mechanism over the retrieved documents, and | | represents the vocabulary size. text classification datasets. Here, || indicates the number of

Dataset

20NewsGroups Yahoo! Answers CoRe-S || 10,998 1,375,428 11,903 | | 85,551 739,655 2283 () 0.99 0.99 0.69 Moreover, it is also invariant to document length and term frequency, even when vocabulary sizes difer substantially. 4.2. Empirical Analysis To better understand the behaviour of the statistic, we compute it across multiple corpora. Table 2 shows that datasets like 20NewsGroups and Yahoo! Answers generally obtain higher diversity scores (), indicating increased textual heterogeneity and more extensive vocabularies. In contrast, the CoRe-S dataset exhibits lower diversity, which can be attributed to its specialized terminology and repetitive textual patterns. This is likely a consequence of the constrained set of attributes used during the generation process (see Section 3), which restricts variability in term usage. As a result, it becomes challenging to distinguish between damage descriptions across diferent categories.

5. Retrieval-based Classifier We adopt a zero-shot learning approach, depicted in Figure 4 built around a retrieval-based pipeline. The strategy involves retrieving the top-k most similar labeled textual descriptions based on embedding similarity and, using a

5.1. Formal Description as: Let ⊆ * be the set of all documents, where is a ifnite alphabet of symbols. The dataset is partitioned into two subsets: the query set ⊆ ⊆ relevant documents from the corpus , which contains

. For each query ∈ , the system retrieves descriptions of past pipe failures, each labeled as patch

and the corpus able (true) or not patchable (false). Let : * an encoding function that maps a document into an dimensional embedding space using a pre-trained model

→ R be and let : R

R ×

→ R be a similarity function that measures the closeness between two embedded documents ∈ and ∈ , the retrieval process is defined

Re,, () = arg max ∑︁ ((), ())

(1) ⊆ :||= ∈ where is the corpus, is the number of top retrieved documents, ((), ()) is the similarity score between the query document and the corpus document . We denote the resulting top- retrieved documents for a given query as:

*, = Re,, ()

Finally, the system produces its final prediction by applying majority voting over the labels of the documents in *,: ˆ = MajorityVote (︀ {label() | ∈ *,} ︀) (2) (3) 5.2. Encoder and Similarity Metrics

Selection For our training-free classification pipeline, we explore several pre-trained encoders to generate highquality semantic embeddings for both queries and cor- • NearMiss-2 selects majority class samples with pus documents. All selected encoders are transformer- the smallest average distance to the farthest sambased models chosen for their zero-shot capabili- ples of the minority class. ties, strong performance on general-purpose seman- • NearMiss-3 first selects a subset of minority samtic similarity benchmarks, and availability through the ples and retains their nearest neighbors among sentence-transformers library, which facilitates the majority. Then, it keeps the majority class seamless integration into our pipeline. Specifically, samples with the largest average distance to their we test all-mpnet-base-v22, a sentence-transformer selected neighbors. model based on MPNet [ 19 ], fine-tuned on over 1 billion sentence pairs for semantic similarity tasks. 5.3.3. Edited Nearest Neighbors (ENN) We also include multi-qa-mpnet-base3, a variant of MPNet fine-tuned on multiple question-answering The EditedNearestNeighbors (ENN) technique uses datasets—including Natural Questions, TriviaQA, and a K-Nearest Neighbors (KNN) approach to filter out noisy SQuAD—to better handle question-style inputs [ 25 ]. Fi- or ambiguous samples from the majority class. The pronally, we use bert-base-nli-mean-tokens4, a BERT- cedure involves training a KNN classifier on the entire based encoder trained on the SNLI and MultiNLI datasets corpus, then for each instance in the majority class, idenfor natural language inference (NLI) [ 13 ]. tifying its nearest neighbors and remove the instance if

We evaluate two popular similarity metrics for com- any or most of its neighbors belong to a diferent class. paring document embeddings: the dot product, which captures the directional similarity between embeddings and 6. Experiments the Euclidean distance (ℓ2), which measures the straightline distance between vectors in the embedding space. 6.1. Experimental Details 5.3. Corpus Under-sampling Techniques

To address class imbalance in our dataset, we use several

under-sampling strategies that reduce the number of documents in the corpus set of the majority class. We test diferent algorithms: Random Under-sampling, Near Miss with its 3 diferent versions and the Edited Nearest Neighborhood.

5.3.1. Random Under-sampling It is a simple technique that randomly removes examples from the majority classes until the desired class distribution is reached. 5.3.2. NearMiss

The algorithm consists of preserving samples from the majority class that are most relevant for the classification task, based on the evaluations of distances between samples from the majority and minority classes. There are diferent versions of the same algorithm: • NearMiss-1 selects majority class samples with the smallest average distance to the closest samples of the minority class.

2https://huggingface.co/sentence-transformers/all-mpnet-base-v2

3https://huggingface.co/sentence-transformers/multi-qa-mpnetbase-dot-v1 4https://huggingface.co/sentence-transformers/bert-base-nlimean-tokens

Experiments are conducted using an NVIDIA GeForce

RTX 2080 Ti GPU. Model performance is primarily evaluated using the F1-Macro score to ensure a balanced assessment across classes. Additionally, all results are obtained through 5-fold cross-validation, which involves changing the split of the corpus and query set in each fold to ensure robust evaluation. For the main results, we also report the Recall-Macro and Precision-Macro scores. 6.2. Results

6.2.1. Comparison with Zero-Shot Classification Methods We compare our zero-training retrieval-based classification approach with several zero-shot and few-shot classiifcation baselines.

The Baseline approach[ 22 ] represents each document as the average of its paragraph embeddings. Similarly, each label is represented as the average embedding of a set of predefined keywords associated with that label. Classification is performed by computing the similarity between the document and label embeddings, assigning the label with the highest similarity score. We evaluate this method using two diferent encoders: all-MiniLM-L6-v2 and all-mpnet-base-v2.

We also implement a zero-shot entailment technique[ 22 ], using pre-trained models such as DistilBERT, BART-large, and DeBERTa. Each input document is paired with a hypothesis representing a candidate label, and the model predicts whether the hypothesis is entailed by the input.

Corpus Under-sampling Figure 7 shows how difer

ent sampling strategies impact performance on the CoReS dataset across values of from 1 to 15. The results reported in the figure represent the best outcomes ob6.2.2. Ablation Study tained across the tested hyperparameter configurations. Similarity Metric Selection In our zero-shot pipeline, When no under-sampling is applied, macro-F1 peaks at we evaluate two most used similarity metrics: the dot = 7 (0.609), but then declines as if additional neighbors product and the Euclidean distance (ℓ2). Figure 5 il- introduce semantic noise. In contrast, applying underlustrates the performance of our retrieval-classification sampling leads to higher macro-F1 scores across all values strategy under both similarity functions. Both metrics of . Notably, random under-sampling achieves the best are tested across all selected encoders to determine which overall performance, improving from 0.601 at = 1 to (a) MPNet (b) SBERT NLI (c) SBERT NLI + Undersampling tic distinctions. After under-sampling, however, the red points are pushed further away from the green points, creating clearer separations between classes. This enhanced separation corresponds to improved macro-F1 performance, demonstrating how under-sampling helps the model better distinguish between patchable and nonpatchable instances by reducing class imbalance and mitigating semantic noise.

6.2.3. Cross-Corpus Encoder Selection with Varying Lexical Diversity To further explore the influence of corpus lexical diver

sity on model performance, we expand our evaluation a peak of 0.687 at = 15. This suggests that random beyond CoRe-S to include two additional text classificaunder-sampling efectively balances the class distribution in the corpus, enabling the model to achieve stronger wtiohnicdhadtaemseotsn:s2tr0aNteewhsiGghroeur plesxaicnadl YvaahrioaobiAlintys,waesrss,hboowthn oinf generalization and more robust performance.

The use of near-miss under-sampling, on the other sSteacttiisotinc.4 using our proposed Corpus Pairwise Diversity hand, significantly degrades performance. Although the We compare the performance of three document enedited nearest neighbor (edited nn) strategy performs bet- coders within the same retrieval-based classification ter than using no under-sampling at all, it still falls short framework: SBERT-NLI, MPNet and QA-MPNet. For of the results achieved with random under-sampling. evaluation, each dataset’s test set is evenly split into two This may be because these strategies remove fewer train- subsets: one half is used as the retrieval corpus and the ing examples and may not suficiently rebalance the cor- other half as the query set, where classification perforpus. In fact, the high similarity in textual descriptions mance is measured. with label patchable or non-patchable can lead to very Table 4 reports the best F1 scores achieved by each close embeddings and as a result, these strategies might encoder on the respective datasets. The results reveal a remove fewer examples. Random under-sampling in- clear interaction between corpus lexical diversity and enrsetesaudltionpgeirnataemsosoreleplyrobnaosuendceodnraedculactsisonraatniod tahmreoshreoeldf-, cSoBdEeRrTe-feNctLivIeancehsise.vOesn tthhee lhoiwgh-deisvterFs1itsycoCreo,Rseu-Sppdoartatisnegt, fective rebalancing of the corpus. The best performance our hypothesis that NLI-pretrained models are better is achieved with a reduced corpus of 962 training samples suited for distinguishing fine-grained linguistic nuances and the full set of 5,952 query instances. between similar documents. In contrast, on the higher

Figure 6(c) illustrates a t-SNE representation of the document embeddings produced by the best encoder, SBERT- dNievtecrosintysisdtaetnatsleytsou2t0pNeerwfosrGmrosuthpse aonthdeYraehnocoodAenrssw.Ienrst,hMesPeNLI, after applying random under-sampling to the corpus. settings, MPNet’s enhanced ability to capture broad seAs previously shown, SBERT-NLI naturally clusters green mantic content makes it more efective at handling lexical points (true labels) and red points (false labels) near each variation. other, reflecting its ability to capture fine-grained seman

7. Limitations

A key limitation of this study is the reliance on synthetic data. While synthetic fault descriptions are necessary due to the lack of large-scale real-world technician-written reports, they may not fully capture the noise, variation, and contextual complexity present in actual field documentation. This may afect the generalizability of the ifndings when applied to real-world scenarios. Future work should explore the collection and use of authentic, technician-authored data to validate and refine the proposed method.

8. Conclusion In this paper, we address the task of classifying gas pipe

damage descriptions. Starting from a set of damage features and real examples, we generate a new dataset called CoRe-S, the first of its kind in this domain. This dataset exhibits low lexical diversity, characterized by a restricted and repetitive vocabulary, along with severe class imbalance. To quantify lexical diversity within a corpus, we propose the Corpus Pairwise Diversity statistic.

To overcome these challenges, we design a trainingfree retrieval-based text classifier that leverages SBERTNLI to handle low lexical diversity, combined with undersampling techniques to mitigate class imbalance. Experimental results demonstrate that our method outperforms other training-free approaches, including zero-shot, fewshot, and similarity-based methods. Additional experiments suggest that natural language inference pretrained text encoders are particularly efective in low-diversity scenarios where subtle diferences between texts of different labels must be captured.

Table 4 Future work may involve a more extensive compariBest F1 scores obtained with each encoder across datasets, son of text encoder efectiveness across various text clasusing the same retrieval-based classification framework. sification datasets exhibiting diferent levels of lexical diversity.

Acknowledgments The authors acknowledge that this work has been par

tially funded by the European Union and by the Italian Ministry of Enterprises and Made in Italy (MIMIT), through the EXPAND project, Grant Agreement No. 101083443.

[1]

Liu , Sentiment analysis and opinion mining , Springer Nature, 2022 .

[2] F. D'Asaro , J. J. M. Villacís , G. Rizzo, Transfer learning of large speech models for italian speech emotion recognition , in: 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT) , IEEE, 2024 , pp. 1 - 6 .

[3]

Rai ,

Kumar ,

Kaushik ,

Raj ,

Ali , Fake news classification using transformer based enhanced lstm and bert , International Journal of Cognitive Computing in Engineering 3 ( 2022 ) 98 - 105 .

[4]

Liu ,

Li ,

Dong ,

Mo ,

He , Spam detection and classification based on distilbert deep learning algorithm , Applied Science and Engineering Journal for Advanced Research 3 ( 2024 ) 6 - 10 .

[5]

Wang ,

Li ,

Wang ,

Zhang ,

Shen ,

Zhang ,

Henao , L. Carin, Joint embedding of words and labels for text classification , arXiv preprint arXiv: 1805 . 04174 ( 2018 ).

[6]

Xie ,

Dai , E. Hovy,

Luong ,

Le , Unsupervised data augmentation for consistency training , Advances in neural information processing systems 33 ( 2020 ) 6256 - 6268 .

[7]

Wei ,

Zou , Eda: Easy data augmentation techniques for boosting performance on text classification tasks , arXiv preprint arXiv: 1901 . 11196 ( 2019 ).

[8]

Jacovi ,

O. S.

Shalom ,

Goldberg , Understanding convolutional neural networks for text classification , arXiv preprint arXiv: 1809 . 08037 ( 2018 ).

[9]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in neural information processing systems 30 ( 2017 ).

[10]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers ), 2019 , pp. 4171 - 4186 .

[11]

Radford ,

Narasimhan ,

Salimans ,

Sutskever , et al., Improving language understanding by generative pre-training .( 2018 ), 2018 .

[12]

Lang , Newsweeder: Learning to filter netnews, in: Machine learning proceedings 1995, Elsevier , 1995 , pp. 331 - 339 .

[13]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , arXiv preprint arXiv: 1908 . 10084 ( 2019 ).

[14]

Zhu ,

Kiros ,

Zemel ,

Salakhutdinov ,

Urtasun ,

Torralba ,

Fidler , Aligning books and movies: Towards story-like visual explanations by watching movies and reading books , in: Proceedings of the IEEE international conference on computer vision , 2015 , pp. 19 - 27 .

[15]

Ye ,

Chen ,

Xu ,

Zu ,

Shao , S. Liu,

Cui ,

Zhou ,

Gong ,

Shen , et al., A comprehensive capability analysis of gpt-3 and gpt-3.5 series models , arXiv preprint arXiv:2303.10420 ( 2023 ).

[16]

Sun ,

Li ,

Wu ,

Guo ,

Zhang , G. Wang, Text classification via large language models , arXiv preprint arXiv:2305.08377 ( 2023 ).

[17]

Wang ,

Pang ,

Lin , Large language models are zero-shot text classifiers , arXiv preprint arXiv:2312.01044 ( 2023 ).

[18]

Schopf ,

Braun ,

Matthes , Lbl2vec: An embedding-based approach for unsupervised document retrieval on predefined topics , arXiv preprint arXiv:2210.06023 ( 2022 ).

[19]

Song ,

Tan ,

Qin ,

Lu , T.-Y. Liu, Mpnet: Masked and permuted pre-training for language understanding , Advances in neural information processing systems 33 ( 2020 ) 16857 - 16867 .

[20]

Ahmadi ,

Shah , E. Fox, Retrieval-based text selection for addressing class-imbalanced data in classification , arXiv preprint arXiv:2307.14899 ( 2023 ).

[21]

Abdullahi ,

Singh ,

Eickhof , Retrieval augmented zero-shot text classification , in: Proceedings of the 2024 ACM SIGIR international conference on theory of information retrieval , 2024 , pp. 195 - 203 .

[22]

Schopf ,

Braun ,

Matthes , Evaluating unsupervised text classification: zero-shot and similaritybased approaches , in: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval , 2022 , pp. 6 - 15 .

[23]

Rubin ,

Herzig ,

Berant , Learning to retrieve prompts for in-context learning , arXiv preprint arXiv:2112.08633 ( 2021 ).

[24]

Su ,

Kasai ,

C. H.

Wu ,

Shi ,

Wang ,

Xin ,

Zhang ,

Ostendorf ,

Zettlemoyer ,

N. A.

Smith , et al., Selective annotation makes language models better few-shot learners , arXiv preprint arXiv:2209 . 01975 ( 2022 ).

[25]

Thakur ,

Reimers ,

Daxenberger , I. Gurevych ,

Anand , Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models , Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM) ( 2021 ). URL: https://arxiv.org/abs/ 2104.08663.