Detailed Descriptions for Text Classification Applications Gorka Artola* 1 , German Rigau1 1 University of the Basque Country (UPV/EHU), Faculty of Informatics, Manuel Lardizabal pasealekua, 1, 20018 Donostia-San Sebastián, Spain Abstract The development of effective domain specific text classification systems generally requires the availability of large amounts of high quality labeled domain data. In domains such as BioNLP, eHealth, NLP for Legal Purposes, NLP for Social Media and Journalism, etc., obtaining the needed volume of data manually-labeled by domain experts is not usually feasible or affordable. In this work we propose a new method for text classification based on the use of detailed class descriptions instead of using a large number of labeled instances for training the classifiers. Our method, experimentally tested on the classification of titles of scientific papers on the domain of the Sustainable Development Goals of the United Nations, consistently outperforms mainstream NLP classification approaches, radically faster and at a fraction of their cost due to it does not need a previous process of hand-labelling thousands of samples. Keywords text classification, class descriptions, sustainable development goals 1. Introduction 1.1. Description of the Task The dominant approach in the last few years for specific The selected application domain for our experimentation domain Natural Language Processing (NLP) applications is the 2030 Agenda for Sustainable Development of the is the use of transformer-based [1] general language mod- United Nations (UN) with its 17 Goals (SDGs). Being a els (PLMs) fine-tuned on domain specific and task rele- relatively recent concept, there is not an extensive corpus vant labeled data. To reach top performance, these down- of data in the subject on which PLMs can be pre-trained stream training processes easily require several thou- or fine-tuned, but their official descriptions elaborated sands of high quality labeled samples in the shape of collaboratively by the members of the UN are detailed the targeted task, but for many real-world applications and public. The selected NLP task for this study is the obtaining the minimum volume of data manually-labeled multi-class classification of titles of scientific papers by by domain experts is not affordable or even feasible. For SDGs as described in Figure 1: given a title of a scientific these cases, there is a variety of zero-shot classification paper the method or system must select the SDGs most models and techniques [2] using the PLMs, but they of- related to the paper. fer worse results than the mentioned methods based on As an example of the relevance of the work proposed fine-tuning the models with large amounts of data. in this paper, the generation of the hand-labeled AU- In some cases, despite not having labeled samples, we RORA dataset [3], used in this work to train the main have detailed descriptions of the main classes of the spe- baseline model for comparison purposes, has required cific domain, usually crafted under consensus of relevant the design of an international survey, the participation of stakeholders. Our goal in this work is to explore the po- 244 expert respondents from Europe and North America tential impact of using these detailed descriptions instead from October 2019 to January 2020 to gather the raw of the labeled samples for text classification tasks in spe- data, a post-processing phase to generate a labeled set of cific application domains. Despite descriptions of classes samples that ended in May 2020, and a multi-disciplinary and samples of the task to be performed may differ in work-team of 15 people for whole dataset generation pro- both shape and domain, our main hypothesis is that the cess. The method proposed in this paper avoids all this use of detailed descriptions of the classes will multiply trouble using instead just the SDG descriptions provided the transfer-learning capacity of PLMs and noticeably by the UN to train a competitive classifier immediately improve the performance of NLP methods. and for free. 1.2. Summary of Contributions SEPLN-2024: 40th Conference of the Spanish Society for Natural The main contributions of this work are: Language Processing. Valladolid, Spain. 24-27 September 2024. $ gorka.artola@ehu.eus (G. Artola* ); german.rigau@ehu.eus • We propose the use of already existing or hand- (G. Rigau) crafted detailed descriptions of the classes for © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). multi-label sentence classification with PLMs as CEUR Workshop Proceedings (CEUR-WS.org) a better performing and more resource-efficient CEUR http://ceur-ws.org Workshop ISSN 1613-0073 Proceedings * Corresponding author CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: SDG headlines and an example of the targeted task. In this example, the system should classify the title of the scientific paper in SDGs 1 "No poverty", 5 "Gender equality", 8 "Decent work and economic growth", and/or 10 "Reduced inequalities" way than investing in manual-labelling of sam- Xia et al. [10], that proposes the use of dense text ples. representation techniques in semantic spaces. • We propose guidelines to decide between working • In the case we have large amounts of unlabeled in the generation of detailed descriptions or in- data, but non related to the domain or the task, vesting in hand-labeling samples, considering the Unsupervised text classification techniques availability or not of either detailed descriptiosn [11] show the capacity to improve text classifica- or labeled samples, and depending on this deci- tion. sion, to select the most appropriate multi-class • When lacking of any data, PLMs allow the gener- classification technique with PLMs. ation of improved semantically meaningful text • We establish a new SOTA for classification of representation models like Sentence-BERT [12], titles of scientific papers by SDGs. and the enunciation of the text classification task • We publicly disclose the most relevant datasets as a natural language inference (NLI) problem are and code used in our experimentation. the SOTA techniques [2]. Recently, Schopf et al. [13] proposed the combination of the embedding- based method Lbl2Vec and transformer-based 2. Related Work PLMs to further improve their performance on unsupervised text classification. PLMs, such as 𝐵𝐸𝑅𝑇 [4] and 𝐺𝑃 𝑇 [5], have achieved state-of-art performance on many NLP tasks [6], and Focusing specifically on the use of descriptions of among them on multi-class text classification [7]. The classes, there is also a body of research studying question research community has developed several lines of work answering task embodiment for text classification like the one proposed by Chai et al. [14]. These techniques to improve text classification in different data availability scenarios: in combination with strategies for the development of better class descriptions [15], label noise reduction meth- • When we have abundant unlabeled data related to ods [16], and the recent emergence of generative large the specific application domain but lack of labeled language models (LLMs) [17] set the ground for future data Weakly-supervised techniques [8] show research in the use of descriptions for specific domain promising results. The most recent of them lever- NLP applications. age the capacities of transformer-based PLMs, Regarding the classification of scientific papers by like LOTClass [9], which uses label names as SDGs, related literature describes several approaches initial keywords and augments the keywords grouped in two different working principles: with 𝐵𝐸𝑅𝑇 ’s MLM module to train classifica- tion models on unlabeled data, or FastClass by • Boolean query based approaches for information retrieval from databases like the ones developed Table 1 Macro-averaged F1-score, tested on the test split of the Paper Titles Gold Dataset, of baselines and SDG-Descriptions based scientific paper classifiers Available Model F1-score labeled samples Baselines General PLM fine-tuned on labeled gold dataset 8,339 66.14% Zero-shot classification with NLI-PLM on prompted keywords 17 61.20% Description based models General PLM fine-tuned on SDG-Descriptions 447 64.56% Zero-shot classification with NLI-PLM on prompted SDG-Titles 17 51.68% Few-shot classification with NLI-PLM fine-tuned on SDG-Descriptions 447 67.12% by Elservier [18] [19] [20], Digital Science [21], This Gold Dataset is a subset of the AURORA the University of Bergen [22], the University of dataset [3], elaborated surveying expert scien- Auckland [23] and the AURORA European Uni- tist, and that shows a human agreement level of versity Alliance* . The most relevant among them 70.10% in this task. We have developed several is the AURORA SDG Queries v5 method [24] [25]. splits of this dataset for training, development • NLP based methods like the AURORA-ML* and evaluation purposes. The train-split contains method [26] [27]. This approach comprises 169 more than 8,000 positive samples, and the test- 𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙 𝐵𝐸𝑅𝑇 [28] based models, one split contains 2,086 labeled paper titles unevenly for each SDG target, fine-tuned on abstracts of distributed by SDG but with the same amount papers obtained with the AURORA SDG Queries of positive and negative samples for each one of v5 method. them. The Classification Approaches and Models we have experimented with are: 3. Experimental Setup • Fine-tuning classifiers from general PLMs. After The following Datasets contain all the data used in our experimenting with different general PLMs we experimentation: have selected 𝐵𝐴𝑅𝑇𝐿𝐴𝑅𝐺𝐸 [29] for its better results. We have developed different classifiers • The "SDG-Descriptions Dataset" comprising 447 fine-tuning 𝐵𝐴𝑅𝑇𝐿𝐴𝑅𝐺𝐸 on different amounts sentences of different semantic natures (SDG- of samples of the train-split of the Paper Titles Headlines, SDG-Titles, SDG-Targets and SDG- Gold Dataset, on different amounts of samples of Indicators) developed by the UN and published the SDG-Descriptions Dataset, and on the combi- in a dedicated website* describing the 17 SDGs. nations of both of them. Altogether, we name the samples of this dataset • Zero-shot classification with NLI-PLMs. Af- SDG-Descriptions. Considering we have 447 de- ter experimenting with different NLI-PLMs and scriptive sentences of SDGs, we have built train- querying/prompting setups, we have obtained ing dataset with 430 entailment samples and 7,152 the best results querying 𝐵𝐴𝑅𝑇𝐿𝐴𝑅𝐺𝐸 𝑀 𝑁 𝐿𝐼 contradiction samples. [29] with either SDG-Headlines or SDG-Titles • The "Paper Titles Gold Dataset" with 9,382 scien- and prompting the queries with the expression tific paper titles labeled by experts. This dataset "The subject is ". includes two families of samples that are disjoint, • Few-shot classification. Building upon the pre- i.e., no paper title appears in both families: vious approaches, we have developed a new – "Positive samples" of titles labeled to one method for multi-class text classification fine- or more specific SDGs they are related to. tuning 𝐵𝐴𝑅𝑇𝐿𝐴𝑅𝐺𝐸 𝑀 𝑁 𝐿𝐼 on pairs of SDG- – "Negative samples" of titles labeled to one Description sentence/SDG-Headline, and apply- or more specific SDGS they are not related ing the resulting model for NLI based zero-shot to. classification of paper titles. For the initial fine- tuning we have built a training dataset with sam- * https://aurora-universities.eu/ ples composed by pairs of sentences, being the * https://github.com/Aurora-Network-Global/TMD first each one of the SDG description sentences * https://metadata.un.org/sdg/ and the second each one of the SDG-Headlines prompted with the text "The Sustainable Devel- In our study we simulate this scenario fine-tuning opment Goal is". This way we have generated the base models with different numbers of labeled 17 samples from each SDG description sentence, samples (75%, 50%, 25%, 10%, 5% and 1%) of the out of which the one pairing the sentence with train split of the "Paper Titles Gold Dataset". its correspondent SDG-Headline is labeled as "en- • A "Class descriptions available" scenario, in tailment" and all the rest (16) as "contradiction". which different amounts and types of descrip- We generate a zero-shot classifier fine tuning the tion sentences of the classes are available. We 𝐵𝐴𝑅𝑇𝐿𝐴𝑅𝐺𝐸 𝑀 𝑁 𝐿𝐼 model with this dataset. simulate this in our study by splitting the The classification of each test sample is finally SDG-Descriptions Dataset in sub-sets of SDG- performed querying the model with the SDG- Headlines, SDG-Titles, SDG-Targets and SDG- Headlines and prompting the queries with the Indicators and fine-tuning the base models in ac- expression "This is". cumulative combinations of them. Considering that a paper title may be related to several Considering these simulations of scenarios, we have SDGs, our Metrics on the experiments consider true studied how the baseline and the different SDG- positives (TP) the right predictions on positive samples, Description based models evolve with increasing num- false positives (FP) the wrong predictions on negative bers of available samples and descriptive sentences for samples and true negatives (TN) the right predictions training. In Figure 2 we can observe that the general base- on negative samples. The Prediction Criterion used line (blue line) requires almost 3,000 labeled samples to in this analysis of the results is Topk-3, i.e., the top 3 overcome our most simple model exclusively trained with scores given by the models for each tested sample are 447 description sentences (yellow line). Furthermore, if considered predictions for all considerations. we continue training our description based models with The current SOTA for the studied task and domain is increasing numbers of hand-labeled samples, we can ob- the top macro averaged F1-score of 55% offered by the serve that the obtained fine-tuned classifier (light-green AURORA-ML method referenced in section 2. In the ex- line), beats the top F1-score of the general baseline (+8000 perimentation we have observed that the F1-score regis- labeled samples) with only around 500 labeled samples tered in a vanilla fine-tuning of 𝐵𝐴𝑅𝑇𝐿𝐴𝑅𝐺𝐸 on the full additional to the SDG-Descriptions. Furthermore, our train split of the "Paper Titles Gold Dataset" goes above few-shot classification model defines a new estate of art 60%. Therefore, we have considered this vanilla approach for classification of scientific papers by SDGs when using our Baseline or the analysis of the impact of the use of all SDG-Descriptions and hand-labeled samples. At this SDG-Descriptions. In the zero-shot approach the consid- point, the peak measured F1-score is 71.01%, slightly over ered baseline is the direct use of 𝐵𝐴𝑅𝑇𝐿𝐴𝑅𝐺𝐸 𝑀 𝑁 𝐿𝐼 the human agreement level of 70.10% observed in the with a collection of keywords, namely SDG-Subjects, also AURORA dataset. enunciated by the UN and related to the SDGs that we Up to now we have studied the results as a whole, but have not considered part of the SDG-Descriptions be- the task includes 17 different classes that may behave dif- cause they are not shaped as the descriptive sentences ferently. Table 2 shows the detailed global and per SDG we intend to study. results of the test performed with our Few-Shot approach These choices are the result of an extensive experi- on the Titles-Test split of the Paper Titles-Gold dataset. mentation process comprising different PLMs and meta- SDG 6 "Clean water and sanitation" and SDG 17 "Partner- parameters looking for the best performing ones. ship for the goals" show the worst results. The model has been trained with 21 sentences describing SDG 6 and 46 4. Results sentences of SDG 17, similar or higher than the number of sentences used to train other much better perform- Table 1 shows a comparison between the best macro- ing SDGs like SDG 7 "Affordable and clean energy" (13 averaged F1-scores obtained with our description-based sentences, F1-score 83.93%) or SDG 3 "Good health and models and the baselines. Our few-shot classifica- well-being" (42 sentences, F1-score 67.7%). This suggests tion method using 447 publicly available SDG- that there is no clear correlation or proportion between Descriptions overcomes the general baseline trained the number of sentences included in the description and with over 8,000 hand-labeled samples. On the other hand, the performance of the model, and that the reasons for a our zero-shot classification using SDG-Descriptions lags better classification may relay on other features proba- far behind the zero-shot baseline bly related to the semantics of the description sentences For the analysis of these results we will consider the and the sentences to be classified. The study of the fea- following two scenarios of data availability: tures that make a description good for this classification • A "Labeled samples available" scenario, in which approach are lines for further research. different amounts of labeled samples are available. Figure 2: Macro-averaged F1-score curves by number of samples, on the test split of the Paper Titles Gold Dataset, of baselines and SDG description based scientific paper classifiers 5. Error Analysis Overall, the behaviour of the model seems to follow what could be expected by common sense on the sce- We focus the error analysis on the results of the few-shot nario we are working on, considering (i) that each tested model, the best performing model among those that use paper title is most likely related with several SDGs but exclusively SDG-Descriptions for training and classifi- not in the same extent, (ii) that the label given to each cation. Table 3 summarizes how the model gives right, test sample is not necessarily the one of the SDG they are wrong or inconclusive predictions. More than 90% of most related to, and (iii) that the more SDGs a paper title the good predictions are obtained with the first (Top 1 - is related to, the lower score it will give at each one of 74.00%) and second (Top 2 - 17.79%) highest scores. The them individually. Coherently, the scores observed in the average scoring pattern gives a relatively high value at "no coincidence" predictions of the positive samples hap- Top 1 (0.55-0.77) and drops significantly at every next pen to be ones with lowest scores. Also, the low average prediction, scoring in the range 0.03-0.15 at Top 2 and scores registered in the false positive coincidences in Top 0.002-0.02 at Top 3. Nevertheless, the highest average 2 and Top 3 can be explained as those debatable cases Top 2 (0.1514) and Top 3 (0.0214) scores correspond to that even with human observers reduce the agreement right predictions obtained at second and third guesses. level to the previously mentioned 70.1%. Nevertheless, Both right and wrong predictions at Top 1 score on aver- the following results appear to be relevant failures of the age around 0.77, meaning that the model is particularly model worth to be analysed in detail: mistaken in the wrong predictions. Deepening one step further, the test dataset has a par- • The 63.02% of false positive coincidences with the ticular set of samples; those that have been labeled pos- highest scores (0.7795) at Top 1 prediction. itively or negatively by more than one expert. They • The 12 bad predictions or false positive coinci- could be referred as "strong true samples" or "strong false dences on strong false samples. samples" if they are either positive or negative labeled • The high rate of undetected true positives on SDG samples. In opposition, we call the test samples labeled 6 and SDG 17. by a single person as "weak true samples" or "weak false samples". Table 4 shows the results of the test on these Appendix A shows several examples of these failures. particular samples. The model has been able to classify Regarding the false positives in weak false samples, the correctly all the strong true samples with a particularly wrong guesses are absolutely arguable and may fall in high average Top 1 score of over 0.88, but at the same the side of the measured roughly 30% of human disagree- time has classified incorrectly around 30% (12) of the ment level, with the exception of the example of SDG strong false samples. 17 "Partnership for the goals" with the title "Tuple-based Table 2 Testing of the few-shot model: number of sentences in the SDG-Descriptions dataset used for training the model, number of positive (True) and negative (False) samples in the Titles-Test split of the Paper Titles-Gold dataset, number of true positive (TP) and false positive (FP) predictions on this test dataset, obtained recall (r), precision (p), F1-score (f1) and accuracy (acc), and global micro and macro averages for the same metrics Sentences True False SDG in SDG test test TP FP r p f1 acc descriptions samples samples 1 23 2 2 2 0 100.00% 100.00% 100.00% 100.00% 2 23 30 30 25 23 83.33% 52.08% 64.10% 53.33% 3 42 313 304 283 240 90.42% 54.11% 67.70% 56.24% 4 23 87 87 73 42 83.91% 63.48% 72.28% 67.82% 5 25 94 96 63 25 67.02% 71.59% 69.23% 70.53% 6 21 62 62 19 20 30.65% 48.72% 37.62% 49.19% 7 13 63 69 47 2 74.60% 95.92% 83.93% 86.36% 8 31 17 17 13 10 76.47% 56.52% 65.00% 58.82% 9 22 65 65 34 16 52.31% 68.00% 59.13% 63.85% 10 23 31 31 17 9 54.84% 65.38% 59.65% 62.90% 11 27 57 57 50 28 87.72% 64.10% 74.07% 69.30% 12 26 48 49 40 31 83.33% 56.34% 67.23% 59.79% 13 15 36 36 24 13 66.67% 64.86% 65.75% 65.28% 14 22 17 17 13 1 76.47% 92.86% 83.87% 85.29% 15 28 77 77 60 35 77.92% 63.16% 69.77% 66.23% 16 37 40 40 33 28 82.50% 54.10% 65.35% 56.25% 17 46 29 29 8 7 27.59% 53.33% 36.36% 51.72% Total 447 1,068 1,068 804 530 Micro avg. 75.28% 60.27% 66.94% 62.83% Macro avg. 71.51% 66.15% 67.12% 66.05% Table 3 Testing of the few-shot model: share (%) of positions of predictions matching the label of the test samples for true positive (TP) and false positive (FP) predictions, and average prediction scores registered in each position (average score of matching positions in bold) for TP, FP and predictions not matching neither positive nor negative samples Position of prediction Top 1 Top 2 Top 3 Positive samples Prediction-test label coincidence 74.00% 17.79% 8.21% Average prediction scores TP coincidence in Top 1 0.7741 0.0528 0.0040 TP coincidence in Top 2 0.6667 0.1514 0.0117 TP coincidence in Top 3 0.5558 0.0975 0.0214 No coincidence 0.6313 0.0655 0.0097 Negative samples Prediction-test label coincidence 63.02% 23.58% 13.40% Average prediction scores FP coincidence in Top 1 0.7795 0.0393 0.0038 FP coincidence in Top 2 0.6379 0.0782 0.0027 FP coincidence in Top 3 0.6087 0.0751 0.0207 No coincidence 0.6175 0.0703 0.0134 semantic and structural mapping for sustainable interop- • Debatable or arguable labelling. erability" not objectively relatable with this SDG. When • A possible tendency of the model to relate tobacco it comes to the false positives related to strong false sam- with health (SDG 3), and a tendency of experts not ples, that have happened exclusively for samples of the to do it when the paper titles refer to its economic SDG 3 "Good health and well-being", we can observe dimensions. several possible reasons for the failures like: • A difficulty of the model to distinguish between Table 4 Testing of the few-shot model: number of samples and average Topk-3 predictions scores for good and bad predictions on positive (strong true) and negative (strong false) samples labeled coincidentally by more than one expert Number of Average prediction scores samples Top 1 Top 2 Top 3 Strong True samples 59 Good prediction 59 0.8816 0.0092 0.0020 Bad prediction 0 - - - Strong False Samples 38 Good prediction 26 0.6390 0.0726 0.0034 Bad prediction 12 0.8030 0.0223 0.0046 Figure 3: Testing of the few-shot model: co-occurrence matrix, considering the positive samples of the test dataset animal health and human health. and sanitation" as the model predicts. These phenomena are most likely related to the ev- In the case of the undetected positive samples of SDG ident overlaps that exist between the SDGs. Figures 3 6 "Clean water and sanitation" and SDG 17 "Partnership and 4 depict the co-occurrence and confusion matrices of for the goals", all cases appear to be very debatable. An the test. The co-occurrence matrix plots all Topk-3 pre- explanation may be that in these cases the titles of the dictions of the model on the positive samples of the test papers do not describe properly the contents of the paper, dataset. Generally the model predicts more frequently or even may be misleading, but the experts have labeled the right SDG, but we can also observe that SDG 15 "Life the papers not by their title but by their content. For on land" is remarkably more predicted than the other instance, the paper titled "Local renewable energy coop- SDGs, followed by SDG 12 "Responsible consumption eratives: revolution in disguise?", may be related with and production", SDG 1 "No poverty" and SDG 8 "Decent the SDG 6 "Clean water and sanitation", but the title itself work and economic growth". SDG 15 "Life on land" is suggests it may be more related to SDG 7 "Affordable and even more prevalent than the labeled SDG in the case clean energy" as the model predicts, or the paper titled of SDG 6 "Clean water and sanitation", SDG 13 "Climate "Sustainability of small water supplies: Lessons from a action", SDG 14 "Life below the water" and SDG 17 "Part- Brazilian program (SESP/FSESP)" may of course be re- nership for the goals". The confusion matrix plots whe lated to SDG 17 "Partnership for the goals" but the title wrong predictions of the model. In this case most fre- suggests it may be mainly related to SDG 6 "Clean water Figure 4: Testing of the few-shot model: confusion matrix, considering only wrong predictions on positive samples of test dataset quently mistaken prediction is for SDG 12 "Responsible for zero-shot exclusively with class descriptions, consumption and production", followed in this case by have the potential to outperform conventional the same SDGs outstanding in the co-occurrence matrix. classifiers fine-tuned on PLMs with thousands of According to these results, the SDGs that most overlap hand labeled samples. with the rest are the SDG 15 "Life on land", the SDG 12 • In the case of both detailed descriptions and la- "Responsible consumption and production", SDG 1 "No beled samples available: Conventional PLM clas- poverty" and SDG 8 "Decent work and economic growth". sifiers fine-tuned with a combination of class de- This may mean that these SDGs are the ones that most scriptions and labeled samples have the potential diversely may impact the UN 2030 Agenda for Sustainable to reduce the need of labeling by an order of mag- Development, what could be an excellent bonus insight nitude, being able to establish a new SOTA in our offered by the model, but once again, this may be related case study. only to the different quality of the descriptions of each • On the contrary, on a pure zero-shot approach, SDG, and for sure a question worth to be further studied. in cases with only a single keyword or descrip- tion sentence available per class, the classical prompted keyword classification seems to be bet- 6. Conclusions and future work ter than any similar description sentence based This work offers initial experimental evidences that using classifier. detailed descriptions of the main classes that shape an The use of class descriptions instead of labeled task specific domain has the potential to benefit Text Classifi- samples may not require sophisticated NLP approaches, cation. All the experiments reported have been developed offering results comparable to human classification in the classifying automatically scientific papers to UN SDGs. studied case, using only conventional and widely-used The use of class descriptions may reduce significantly NLP methods and models. or even eliminate the need to develop hand-labeled sam- To benefit from these advantages, the class descrip- ples for training NLP models, reducing drastically the tions may be written by non domain experts in plain development cost. Depending on the availability on de- non-scientific or technical language. In our experimen- scriptions of classes we recommend: tal case, SDG-Descriptions are designed for the general • In scenarios with detailed descriptions avail- public understanding in a public-policy style language able: Few shot approaches, fine tuning NLI-PLM while paper titles are written in a specialized scientific language. Also, not all descriptive sentences offer the Improving language understanding by generative same improvement potential: single sentences describing pre-training, 2018. the whole class (SDG titles) and collection of single sen- [6] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. tences describing each one a particular relevant aspect Nguyen, O. Sainz, E. Agirre, I. Heinz, D. Roth, Re- of the class (SDG targets) contribute the most. cent advances in natural language processing via Finally, the results of this initial experimental study large pre-trained language models: A survey, 2021. suggest the following future lines of research: arXiv:2111.01243. [7] Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. • Extending the study to further specific domain Yu, L. He, A survey on text classification: From NLP applications to generate further evidence traditional to deep learning, ACM Transactions about the potential benefits of using class descrip- on Intelligent Systems and Technology (TIST) 13 tions and grasp its limitations. (2022) 1 – 41. • Apply the use of class descriptions in methods [8] Y. Meng, J. Shen, C. Zhang, J. Han, Weakly- more sophisticated than the conventional NLP supervised neural text classification, in: Proceed- approaches applied in this work to validate or ings of the 27th ACM International Conference on refuse the hypothesis that advanced NLP tech- Information and Knowledge Management, CIKM niques like generative LLMs and QA tasks may ’18, Association for Computing Machinery, New also benefit from them. York, NY, USA, 2018, pp. 983–992. URL: https: • Deep dive in what makes a description good for //doi.org/10.1145/3269206.3271737. doi:10.1145/ NLP applications and explore how advanced de- 3269206.3271737. scription development and improvement tech- [9] Y. Meng, Y. Zhang, J. Huang, C. Xiong, H. Ji, niques can contribute. C. Zhang, J. Han, Text classification using label names only: A language model self-training ap- proach, in: Proceedings of the 2020 Conference References on Empirical Methods in Natural Language Pro- [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, cessing (EMNLP), Association for Computational L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- Linguistics, Online, 2020, pp. 9006–9017. URL: https: tention is all you need, 2017. arXiv:1706.03762. //aclanthology.org/2020.emnlp-main.724. doi:10. 18653/v1/2020.emnlp-main.724. [2] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot text classification: Datasets, evaluation and entail- [10] T. Xia, Y. Wang, Y. Tian, Y. Chang, FastClass: ment approach, in: Proceedings of the 2019 Con- A time-efficient approach to weakly-supervised ference on Empirical Methods in Natural Language text classification, in: Proceedings of the 2022 Processing and the 9th International Joint Con- Conference on Empirical Methods in Natural Lan- ference on Natural Language Processing (EMNLP- guage Processing, Association for Computational IJCNLP), Association for Computational Linguis- Linguistics, Abu Dhabi, United Arab Emirates, 2022, tics, Hong Kong, China, 2019, pp. 3914–3923. URL: pp. 4746–4758. URL: https://aclanthology.org/2022. https://aclanthology.org/D19-1404. doi:10.18653/ emnlp-main.313. v1/D19-1404. [11] M. Thangaraj, M. Sivakami, Text classification tech- [3] M. Vanderfeesten, E. Spielberg, Y. Gunes, Survey niques: A literature review, Interdisciplinary Jour- data of "Mapping Research Output to the Sustain- nal of Information, Knowledge, and Management able Development Goals (SDGs)", 2020. URL: https: 13 (2018) 117–135. //doi.org/10.5281/zenodo.3813230. doi:10.5281/ [12] N. Reimers, I. Gurevych, Sentence-BERT: Sentence zenodo.3813230. embeddings using Siamese BERT-networks, in: [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Proceedings of the 2019 Conference on Empirical Pre-training of deep bidirectional transformers for Methods in Natural Language Processing and the language understanding, in: Proceedings of the 9th International Joint Conference on Natural Lan- 2019 Conference of the North American Chap- guage Processing (EMNLP-IJCNLP), Association ter of the Association for Computational Linguis- for Computational Linguistics, Hong Kong, China, tics: Human Language Technologies, Volume 1 2019, pp. 3982–3992. URL: https://aclanthology.org/ (Long and Short Papers), Association for Com- D19-1410. doi:10.18653/v1/D19-1410. putational Linguistics, Minneapolis, Minnesota, [13] T. Schopf, D. Braun, F. Matthes, Evaluating unsuper- 2019, pp. 4171–4186. URL: https://aclanthology.org/ vised text classification: Zero-shot and similarity- N19-1423. doi:10.18653/v1/N19-1423. based approaches, 2023. arXiv:2211.16285. [5] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, [14] D. Chai, W. Wu, Q. Han, F. Wu, J. Li, Descrip- tion based text classification with reinforcement learning, in: International Conference on Machine doi:10.5281/zenodo.4883250. Learning, 2020. [25] F. Schmidt, M. Vanderfeesten, Evaluation on accu- [15] X. Zhu, Z. Peng, J. Guo, S. Dietze, Generating ef- racy of mapping science to the United Nations’ Sus- fective label description for label-aware sentiment tainable Development Goals (SDGs) of the Aurora classification, Expert Syst. Appl. 213 (2023). URL: SDG queries, 2021. URL: https://doi.org/10.5281/ https://doi.org/10.1016/j.eswa.2022.119194. doi:10. zenodo.4964606. doi:10.5281/zenodo.4964606. 1016/j.eswa.2022.119194. [26] M. Vanderfeesten, R. Jaworek, L. Keßler, AI for map- [16] X. Gao, Z. Zhu, X. Chu, Y. Wang, W. Ruan, J. Zhao, ping multi-lingual academic papers to the United Enhancing robust text classification via category Nations’ Sustainable Development Goals (SDGs), description, 2022 IEEE International Conference 2022. URL: https://doi.org/10.5281/zenodo.6487606. on Data Mining (ICDM) (2022) 151–160. doi:10.5281/zenodo.6487606. [17] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, [27] R. Jaworeck, SDG BERT - Multi-language Multi- Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, label BERT model for classifying texts to Sus- Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, tainable Development Goals (SDGs) based on Au- Z. Liu, P. Liu, J.-Y. Nie, J.-R. Wen, A survey of large rora SDG Query Model v5, 2022. URL: https: language models, 2023. arXiv:2303.18223. //doi.org/10.5281/zenodo.7304547. doi:10.5281/ [18] B. ayabalasingham, R. Boverhof, K. Agnew, zenodo.7304547. L. Klein, Identifying research supporting the united [28] J. Devlin, M. Chang, K. Lee, K. Toutanova, nations sustainable development goals, 2019. URL: BERT: pre-training of deep bidirectional trans- https://elsevier.digitalcommonsdata.com/datasets/ formers for language understanding, CoRR 87txkw7khs/1. doi:10.17632/87txkw7khs.1. abs/1810.04805 (2018). URL: http://arxiv.org/abs/ [19] M. Rivest, Y. Kashnitsky, A. Bédard-Vallée, 1810.04805. arXiv:1810.04805. D. Campbell, P. Khayat, I. Labrosse, S. Pinheiro, [29] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- Henrique; Provençal, G. Roberge, C. James, Im- hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: proving the scopus and aurora queries to identify denoising sequence-to-sequence pre-training for research that supports the united nations sustain- natural language generation, translation, and com- able development goals (sdgs) 2021, 2021. URL: prehension, CoRR abs/1910.13461 (2019). URL: http: https://elsevier.digitalcommonsdata.com/datasets/ //arxiv.org/abs/1910.13461. arXiv:1910.13461. 9sxdykm8s4/. doi:10.17632/9sxdykm8s4.4. [20] G. Roberge, Y. Kashnitsky, C. James, Else- vier 2022 sustainable development goals (sdg) mapping, 2022. URL: https://elsevier. A. Examples of wrong predictions digitalcommonsdata.com/datasets/6bjy52jkm9. doi:10.17632/6bjy52jkm9.1. [21] D. Science, J. Wastl, S. Porter, H. Draux, B. Fane, D. Hook, Contextualizing sustainable develop- ment research, 2020. URL: https://digitalscience. figshare.com/articles/report/Contextualizing_ Sustainable_Development_Research/12200081. doi:10.6084/m9.figshare.12200081.v2. [22] C. Armitage, M. Lorenz, S. Mikki, Replication data for: Mapping scholarly publications related to the Sustainable Development Goals: Do independent bibliometric approaches get the same results?, 2020. URL: https://doi.org/10.18710/98CMDR. doi:10. 18710/98CMDR. [23] W. Wang, W. Kang, J. Mu, Mapping re- search to the sustainable development goals (sdgs), 2023. URL: https://www. researchsquare.com/article/rs-2544385/v2. doi:10.21203/rs.3.rs-2544385/v2. [24] M. Vanderfeesten, R. Otten, E. Spielberg, Search Queries for "Mapping Research Output to the Sustainable Development Goals (SDGs)" v5.0.2, 2020. URL: https://doi.org/10.5281/zenodo.4883250. Table 5 Testing of NLI-based few-shot approach with Topk-3 classification strategy: examples of wrong predictions # Paper title Gold SDG Predicted SGD False postives in weak false samples # Headline Top 1 Top 2 Top 3 Food insecurity and effectiveness of behavioral 1 interventions to reduce blood pressure 2 Zero hunger 2 3 8 , New York City, 2012-2013 Global governance for facilitating access to Good health 2 3 3 12 8 medicines: Role of world health organization and wellbeing Equipping Preservice Elementary Teachers Quality 3 4 4 8 12 for Data Use in the Classroom education How to study varieties of opposition to gender+ equality in Europe?: Lessons from this Gender 4 5 5 10 8 book, conceptual building blocks, and puzzles equality to address RETRACTED ARTICLE: Comparative advantage Clean water 5 analysis for water utilization in Hubei province 6 6 8 1 and sanitation based on NRCA model A study on factors affecting the youth employment Decent work 6 rate: Focusing on data from 31 cities and counties 8 and economic 8 11 9 in Gyeonggi-do, South Korea growth Industry, Analysis of the inclusions in 38Si7 spring steel 7 9 innovation and 9 12 8 with fatigue failure infrastructure The development and transition of urban walking Sustainable 8 grey space in China, based on a unique model " 11 cities and 11 1 12 Langpeng" communities Corporate sustainability in emerging markets: Responsble 9 Insights from the practices reported by the 12 consumption 12 8 10 Brazilian retailers and production Sensitivity analysis with the regional climate model 10 13 Climate action 13 15 8 COSMO-CLM over the CORDEX-MENA domain Rainforest tourism, conservation and management: 11 15 Life on land 15 12 8 Challenges for sustainable development Capitalizing on Criminal Accomplices: Peace, justice 12 Considering the Relationship between 16 16 1 10 and institutions Co-offending and Illegal Earnings Tuple-based semantic and structural mapping Partnership 13 17 17 12 9 for a sustainable interoperability for the goals False positives in strong false samples # Headline Top 1 Top 2 Top 3 Nature, scope and use of economic evaluation of Good health 14 healthcare programmes: With special reference 3 3 15 12 and wellbeing to Pakistan Endovascular Aortic Repair for Thoracic Good health 15 3 3 8 1 Aortic Injuries and wellbeing Comparison between online and offline price of Good health 16 3 3 12 8 tobacco products using novel datasets and wellbeing An Assessment of the Forward-Looking Hypothesis Good health 17 3 3 8 12 of the Demand for Cigarettes and wellbeing Mycobacterium marinum infection in fish and man: Good health 18 Epidemiology, pathophysiology and management; 3 14 3 6 and wellbeing a review Undetected SDG 6 and SDG 17 true positives # Headline Top 1 Top 2 Top 3 An exploration of the boundaries of ‘community’ in Clean water 19 community renewable energy projects: Navigating 6 7 11 10 and sanitation between motivations and context Typology of future clean energy communities: An Clean water 20 6 7 8 12 exploratory structure, opportunities, and challenges and sanitation A review of renewable energy investment Partnership 21 in the BRICS countries: History, models, 17 7 12 8 for the goals problems and solutions Sustainability of small water supplies: Lessons Partnership 22 17 6 10 8 from a brazilian program (SESP/FSESP) for the goals