1. Introduction

Detailed Descriptions for Text Classification Applications

Gorka Artola

German Rigau

0 0 University of the Basque Country (UPV/EHU), Faculty of Informatics , Manuel Lardizabal pasealekua, 1, 20018 Donostia-San Sebastián , Spain

The development of effective domain specific text classification systems generally requires the availability of large amounts of high quality labeled domain data. In domains such as BioNLP, eHealth, NLP for Legal Purposes, NLP for Social Media and Journalism, etc., obtaining the needed volume of data manually-labeled by domain experts is not usually feasible or affordable. In this work we propose a new method for text classification based on the use of detailed class descriptions instead of using a large number of labeled instances for training the classifiers. Our method, experimentally tested on the classification of titles of scientific papers on the domain of the Sustainable Development Goals of the United Nations, consistently outperforms mainstream NLP classification approaches, radically faster and at a fraction of their cost due to it does not need a previous process of hand-labelling thousands of samples.

eol>text classification class descriptions sustainable development goals

1. Introduction 1.1. Description of the Task 1.2. Summary of Contributions The main contributions of this work are:

• We propose the use of already existing or handcrafted detailed descriptions of the classes for multi-label sentence classification with PLMs as a better performing and more resource-eficient way than investing in manual-labelling of samples. • We propose guidelines to decide between working in the generation of detailed descriptions or investing in hand-labeling samples, considering the availability or not of either detailed descriptiosn or labeled samples, and depending on this decision, to select the most appropriate multi-class classification technique with PLMs. • We establish a new SOTA for classification of titles of scientific papers by SDGs. • We publicly disclose the most relevant datasets and code used in our experimentation.

2. Related Work

PLMs, such as [ 4 ] and [ 5 ], have achieved state-of-art performance on many NLP tasks [ 6 ], and among them on multi-class text classification [ 7 ]. The research community has developed several lines of work to improve text classification in diferent data availability scenarios: • When we have abundant unlabeled data related to the specific application domain but lack of labeled data Weakly-supervised techniques [ 8 ] show promising results. The most recent of them leverage the capacities of transformer-based PLMs, like LOTClass [ 9 ], which uses label names as initial keywords and augments the keywords with ’s MLM module to train classification models on unlabeled data, or FastClass by

Xia et al. [10], that proposes the use of dense text representation techniques in semantic spaces. • In the case we have large amounts of unlabeled data, but non related to the domain or the task,

Unsupervised text classification techniques [ 11 ] show the capacity to improve text classification. • When lacking of any data, PLMs allow the generation of improved semantically meaningful text representation models like Sentence-BERT [ 12 ], and the enunciation of the text classification task as a natural language inference (NLI) problem are the SOTA techniques [ 2 ]. Recently, Schopf et al. [ 13 ] proposed the combination of the embeddingbased method Lbl2Vec and transformer-based PLMs to further improve their performance on unsupervised text classification.

Focusing specifically on the use of descriptions of classes, there is also a body of research studying question answering task embodiment for text classification like the one proposed by Chai et al. [ 14 ]. These techniques in combination with strategies for the development of better class descriptions [ 15 ], label noise reduction methods [ 16 ], and the recent emergence of generative large language models (LLMs) [ 17 ] set the ground for future research in the use of descriptions for specific domain NLP applications.

Regarding the classification of scientific papers by SDGs, related literature describes several approaches grouped in two diferent working principles: • Boolean query based approaches for information retrieval from databases like the ones developed by Elservier [ 18 ] [ 19 ] [ 20 ], Digital Science [ 21 ], the University of Bergen [ 22 ], the University of Auckland [ 23 ] and the AURORA European University Alliance*. The most relevant among them is the AURORA SDG Queries v5 method [ 24 ] [25]. • NLP based methods like the AURORA-ML* method [26] [27]. This approach comprises 169 [28] based models, one for each SDG target, fine-tuned on abstracts of papers obtained with the AURORA SDG Queries v5 method.

3. Experimental Setup

The following Datasets contain all the data used in our experimentation: • The "SDG-Descriptions Dataset" comprising 447 sentences of diferent semantic natures (SDGHeadlines, SDG-Titles, SDG-Targets and SDGIndicators) developed by the UN and published in a dedicated website* describing the 17 SDGs.

Altogether, we name the samples of this dataset SDG-Descriptions. Considering we have 447 descriptive sentences of SDGs, we have built training dataset with 430 entailment samples and 7,152 contradiction samples. • The "Paper Titles Gold Dataset" with 9,382 scientific paper titles labeled by experts. This dataset includes two families of samples that are disjoint, i.e., no paper title appears in both families: – "Positive samples" of titles labeled to one

or more specific SDGs they are related to. – "Negative samples" of titles labeled to one or more specific SDGS they are not related to. *https://aurora-universities.eu/ *https://github.com/Aurora-Network-Global/TMD *https://metadata.un.org/sdg/ This Gold Dataset is a subset of the AURORA dataset [ 3 ], elaborated surveying expert scientist, and that shows a human agreement level of 70.10% in this task. We have developed several splits of this dataset for training, development and evaluation purposes. The train-split contains more than 8,000 positive samples, and the testsplit contains 2,086 labeled paper titles unevenly distributed by SDG but with the same amount of positive and negative samples for each one of them.

The Classification Approaches and Models we have experimented with are: • Fine-tuning classifiers from general PLMs. After experimenting with diferent general PLMs we have selected [29] for its better results. We have developed diferent classifiers ifne-tuning on diferent amounts of samples of the train-split of the Paper Titles Gold Dataset, on diferent amounts of samples of the SDG-Descriptions Dataset, and on the combinations of both of them. • Zero-shot classification with NLI-PLMs. After experimenting with diferent NLI-PLMs and querying/prompting setups, we have obtained the best results querying [29] with either SDG-Headlines or SDG-Titles and prompting the queries with the expression "The subject is ". • Few-shot classification. Building upon the previous approaches, we have developed a new method for multi-class text classification finetuning on pairs of SDGDescription sentence/SDG-Headline, and applying the resulting model for NLI based zero-shot classification of paper titles. For the initial finetuning we have built a training dataset with samples composed by pairs of sentences, being the ifrst each one of the SDG description sentences and the second each one of the SDG-Headlines prompted with the text "The Sustainable Development Goal is". This way we have generated 17 samples from each SDG description sentence, out of which the one pairing the sentence with its correspondent SDG-Headline is labeled as "entailment" and all the rest (16) as "contradiction".

We generate a zero-shot classifier fine tuning the model with this dataset.

The classification of each test sample is finally performed querying the model with the SDGHeadlines and prompting the queries with the expression "This is".

In our study we simulate this scenario fine-tuning the base models with diferent numbers of labeled samples (75%, 50%, 25%, 10%, 5% and 1%) of the train split of the "Paper Titles Gold Dataset". • A "Class descriptions available" scenario, in which diferent amounts and types of description sentences of the classes are available. We simulate this in our study by splitting the SDG-Descriptions Dataset in sub-sets of SDGHeadlines, SDG-Titles, SDG-Targets and SDGIndicators and fine-tuning the base models in accumulative combinations of them.

Considering that a paper title may be related to several Considering these simulations of scenarios, we have SDGs, our Metrics on the experiments consider true studied how the baseline and the diferent SDGpositives (TP) the right predictions on positive samples, Description based models evolve with increasing numfalse positives (FP) the wrong predictions on negative bers of available samples and descriptive sentences for samples and true negatives (TN) the right predictions training. In Figure 2 we can observe that the general baseon negative samples. The Prediction Criterion used line (blue line) requires almost 3,000 labeled samples to in this analysis of the results is Topk-3, i.e., the top 3 overcome our most simple model exclusively trained with scores given by the models for each tested sample are 447 description sentences (yellow line). Furthermore, if considered predictions for all considerations. we continue training our description based models with

The current SOTA for the studied task and domain is increasing numbers of hand-labeled samples, we can obthe top macro averaged F1-score of 55% ofered by the serve that the obtained fine-tuned classifier (light-green AURORA-ML method referenced in section 2. In the ex- line), beats the top F1-score of the general baseline (+8000 perimentation we have observed that the F1-score regis- labeled samples) with only around 500 labeled samples tered in a vanilla fine-tuning of on the full additional to the SDG-Descriptions. Furthermore, our train split of the "Paper Titles Gold Dataset" goes above few-shot classification model defines a new estate of art 60%. Therefore, we have considered this vanilla approach for classification of scientific papers by SDGs when using our Baseline or the analysis of the impact of the use of all SDG-Descriptions and hand-labeled samples. At this SDG-Descriptions. In the zero-shot approach the consid- point, the peak measured F1-score is 71.01%, slightly over ered baseline is the direct use of the human agreement level of 70.10% observed in the with a collection of keywords, namely SDG-Subjects, also AURORA dataset. enunciated by the UN and related to the SDGs that we Up to now we have studied the results as a whole, but have not considered part of the SDG-Descriptions be- the task includes 17 diferent classes that may behave difcause they are not shaped as the descriptive sentences ferently. Table 2 shows the detailed global and per SDG we intend to study. results of the test performed with our Few-Shot approach

These choices are the result of an extensive experi- on the Titles-Test split of the Paper Titles-Gold dataset. mentation process comprising diferent PLMs and meta- SDG 6 "Clean water and sanitation" and SDG 17 "Partnerparameters looking for the best performing ones. ship for the goals" show the worst results. The model has been trained with 21 sentences describing SDG 6 and 46 4. Results sentences of SDG 17, similar or higher than the number of sentences used to train other much better performTable 1 shows a comparison between the best macro- ing SDGs like SDG 7 "Afordable and clean energy" (13 averaged F1-scores obtained with our description-based sentences, F1-score 83.93%) or SDG 3 "Good health and models and the baselines. Our few-shot classifica- well-being" (42 sentences, F1-score 67.7%). This suggests tion method using 447 publicly available SDG- that there is no clear correlation or proportion between Descriptions overcomes the general baseline trained the number of sentences included in the description and with over 8,000 hand-labeled samples. On the other hand, the performance of the model, and that the reasons for a our zero-shot classification using SDG-Descriptions lags better classification may relay on other features probafar behind the zero-shot baseline bly related to the semantics of the description sentences

For the analysis of these results we will consider the and the sentences to be classified. The study of the feafollowing two scenarios of data availability: tures that make a description good for this classification approach are lines for further research. • A "Labeled samples available" scenario, in which

diferent amounts of labeled samples are available.

5. Error Analysis

Overall, the behaviour of the model seems to follow what could be expected by common sense on the sceWe focus the error analysis on the results of the few-shot nario we are working on, considering (i) that each tested model, the best performing model among those that use paper title is most likely related with several SDGs but exclusively SDG-Descriptions for training and classifi- not in the same extent, (ii) that the label given to each cation. Table 3 summarizes how the model gives right, test sample is not necessarily the one of the SDG they are wrong or inconclusive predictions. More than 90% of most related to, and (iii) that the more SDGs a paper title the good predictions are obtained with the first (Top 1 - is related to, the lower score it will give at each one of 74.00%) and second (Top 2 - 17.79%) highest scores. The them individually. Coherently, the scores observed in the average scoring pattern gives a relatively high value at "no coincidence" predictions of the positive samples hapTop 1 (0.55-0.77) and drops significantly at every next pen to be ones with lowest scores. Also, the low average prediction, scoring in the range 0.03-0.15 at Top 2 and scores registered in the false positive coincidences in Top 0.002-0.02 at Top 3. Nevertheless, the highest average 2 and Top 3 can be explained as those debatable cases Top 2 (0.1514) and Top 3 (0.0214) scores correspond to that even with human observers reduce the agreement right predictions obtained at second and third guesses. level to the previously mentioned 70.1%. Nevertheless, Both right and wrong predictions at Top 1 score on aver- the following results appear to be relevant failures of the age around 0.77, meaning that the model is particularly model worth to be analysed in detail: mistaken in the wrong predictions.

Deepening one step further, the test dataset has a par- • The 63.02% of false positive coincidences with the ticular set of samples; those that have been labeled pos- highest scores (0.7795) at Top 1 prediction. itively or negatively by more than one expert. They • The 12 bad predictions or false positive coincicould be referred as "strong true samples" or "strong false dences on strong false samples. samples" if they are either positive or negative labeled • The high rate of undetected true positives on SDG samples. In opposition, we call the test samples labeled 6 and SDG 17. by a single person as "weak true samples" or "weak false samples". Table 4 shows the results of the test on these Appendix A shows several examples of these failures. particular samples. The model has been able to classify Regarding the false positives in weak false samples, the correctly all the strong true samples with a particularly wrong guesses are absolutely arguable and may fall in high average Top 1 score of over 0.88, but at the same the side of the measured roughly 30% of human disagreetime has classified incorrectly around 30% (12) of the ment level, with the exception of the example of SDG strong false samples. 17 "Partnership for the goals" with the title "Tuple-based semantic and structural mapping for sustainable interoperability" not objectively relatable with this SDG. When it comes to the false positives related to strong false samples, that have happened exclusively for samples of the SDG 3 "Good health and well-being", we can observe several possible reasons for the failures like: • Debatable or arguable labelling. • A possible tendency of the model to relate tobacco with health (SDG 3), and a tendency of experts not to do it when the paper titles refer to its economic dimensions. • A dificulty of the model to distinguish between animal health and human health.

In the case of the undetected positive samples of SDG 6 "Clean water and sanitation" and SDG 17 "Partnership for the goals", all cases appear to be very debatable. An explanation may be that in these cases the titles of the papers do not describe properly the contents of the paper, or even may be misleading, but the experts have labeled the papers not by their title but by their content. For instance, the paper titled "Local renewable energy cooperatives: revolution in disguise?", may be related with the SDG 6 "Clean water and sanitation", but the title itself suggests it may be more related to SDG 7 "Afordable and clean energy" as the model predicts, or the paper titled "Sustainability of small water supplies: Lessons from a Brazilian program (SESP/FSESP)" may of course be related to SDG 17 "Partnership for the goals" but the title suggests it may be mainly related to SDG 6 "Clean water and sanitation" as the model predicts.

These phenomena are most likely related to the evident overlaps that exist between the SDGs. Figures 3 and 4 depict the co-occurrence and confusion matrices of the test. The co-occurrence matrix plots all Topk-3 predictions of the model on the positive samples of the test dataset. Generally the model predicts more frequently the right SDG, but we can also observe that SDG 15 "Life on land" is remarkably more predicted than the other SDGs, followed by SDG 12 "Responsible consumption and production", SDG 1 "No poverty" and SDG 8 "Decent work and economic growth". SDG 15 "Life on land" is even more prevalent than the labeled SDG in the case of SDG 6 "Clean water and sanitation", SDG 13 "Climate action", SDG 14 "Life below the water" and SDG 17 "Partnership for the goals". The confusion matrix plots whe wrong predictions of the model. In this case most freThis work ofers initial experimental evidences that using detailed descriptions of the main classes that shape an specific domain has the potential to benefit Text Classification. All the experiments reported have been developed classifying automatically scientific papers to UN SDGs.

The use of class descriptions may reduce significantly or even eliminate the need to develop hand-labeled samples for training NLP models, reducing drastically the development cost. Depending on the availability on descriptions of classes we recommend: quently mistaken prediction is for SDG 12 "Responsible consumption and production", followed in this case by the same SDGs outstanding in the co-occurrence matrix.

According to these results, the SDGs that most overlap with the rest are the SDG 15 "Life on land", the SDG 12 "Responsible consumption and production", SDG 1 "No poverty" and SDG 8 "Decent work and economic growth".

This may mean that these SDGs are the ones that most diversely may impact the UN 2030 Agenda for Sustainable Development, what could be an excellent bonus insight ofered by the model, but once again, this may be related only to the diferent quality of the descriptions of each SDG, and for sure a question worth to be further studied. 6. Conclusions and future work for zero-shot exclusively with class descriptions, have the potential to outperform conventional classifiers fine-tuned on PLMs with thousands of hand labeled samples. • In the case of both detailed descriptions and labeled samples available: Conventional PLM classifiers fine-tuned with a combination of class descriptions and labeled samples have the potential to reduce the need of labeling by an order of magnitude, being able to establish a new SOTA in our case study. • On the contrary, on a pure zero-shot approach, in cases with only a single keyword or description sentence available per class, the classical prompted keyword classification seems to be better than any similar description sentence based classifier. language. Also, not all descriptive sentences ofer the same improvement potential: single sentences describing the whole class (SDG titles) and collection of single sentences describing each one a particular relevant aspect of the class (SDG targets) contribute the most.

Finally, the results of this initial experimental study suggest the following future lines of research: • Extending the study to further specific domain

NLP applications to generate further evidence about the potential benefits of using class descriptions and grasp its limitations. • Apply the use of class descriptions in methods more sophisticated than the conventional NLP approaches applied in this work to validate or refuse the hypothesis that advanced NLP techniques like generative LLMs and QA tasks may also benefit from them. • Deep dive in what makes a description good for

NLP applications and explore how advanced description development and improvement techniques can contribute.

Gold SDG # Headline

Good health 3 and wellbeing

Quality 4 education

Gender 5 equality

Clean water 6 and sanitation

Decent work 8 and economic growth

Industry, 9 innovation and infrastructure

Sustainable 11 cities and communities

Responsble 12 consumption

and production # Paper title False postives in weak false samples

Food insecurity and efectiveness of behavioral 1 interventions to reduce blood pressure , New York City, 2012-2013

Global governance for facilitating access to 2 medicines: Role of world health organization

Equipping Preservice Elementary Teachers 3 for Data Use in the Classroom

How to study varieties of opposition to gender+ equality in Europe?: Lessons from this 4 book, conceptual building blocks, and puzzles to address

RETRACTED ARTICLE: Comparative advantage 5 analysis for water utilization in Hubei province based on NRCA model

A study on factors afecting the youth employment 6 rate: Focusing on data from 31 cities and counties

in Gyeonggi-do, South Korea 7

Analysis of the inclusions in 38Si7 spring steel with fatigue failure

The development and transition of urban walking 8 grey space in China, based on a unique model "

Langpeng"

Corporate sustainability in emerging markets: 9 Insights from the practices reported by the

Brazilian retailers

Sensitivity analysis with the regional climate model 10 COSMO-CLM over the CORDEX-MENA domain 13 Climate action

Rainforest tourism, conservation and management: 11 Challenges for sustainable development 15 Life on land

Capitalizing on Criminal Accomplices: 12 Considering the Relationship between

Co-ofending and Illegal Earnings

Tuple-based semantic and structural mapping Partnership 13 for a sustainable interoperability 17 for the goals False positives in strong false samples # Headline

Nature, scope and use of economic evaluation of

Good health 14 healthcare programmes: With special reference 3 and wellbeing to Pakistan

Endovascular Aortic Repair for Thoracic Good health 15 Aortic Injuries 3 and wellbeing

Comparison between online and ofline price of Good health 16 tobacco products using novel datasets 3 and wellbeing

An Assessment of the Forward-Looking Hypothesis Good health 17 of the Demand for Cigarettes 3 and wellbeing

Mycobacterium marinum infection in fish and man:

Good health 18 Epidemiology, pathophysiology and management; 3 and wellbeing

a review Undetected SDG 6 and SDG 17 true positives # Headline

An exploration of the boundaries of ‘community’ in

Clean water 19 community renewable energy projects: Navigating 6

and sanitation between motivations and context

Typology of future clean energy communities: An Clean water 20 exploratory structure, opportunities, and challenges 6 and sanitation

A review of renewable energy investment

Partnership 21 in the BRICS countries: History, models, 17 for the goals problems and solutions

Sustainability of small water supplies: Lessons Partnership 22 from a brazilian program (SESP/FSESP) 17 for the goals

Peace, justice 16 and institutions

[1]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need, 2017 . arXiv: 1706 . 03762 .

[2]

Yin ,

Hay ,

Roth , Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP) , Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 3914 - 3923 . URL: https://aclanthology.org/D19-1404. doi: 10 .18653/ v1/ D19 -1404.

[3]

Vanderfeesten , E. Spielberg,

Gunes , Survey data of "Mapping Research Output to the Sustainable Development Goals (SDGs)" , 2020 . URL: https: //doi.org/10.5281/zenodo.3813230. doi: 10 .5281/ zenodo.3813230.

[4]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/ N19-1423. doi: 10 .18653/v1/ N19 -1423.

[5]

Radford ,

Narasimhan ,

Salimans , I. Sutskever , Improving language understanding by generative pre-training , 2018 .

[6]

Min ,

Ross ,

Sulem ,

A. P. B.

Veyseh ,

T. H.

Nguyen ,

Sainz ,

Agirre ,

Heinz ,

Roth , Recent advances in natural language processing via large pre-trained language models: A survey , 2021 . arXiv: 2111 . 01243 .

[7]

Li ,

Peng ,

Li ,

Xia ,

Yang ,

Sun ,

P. S.

Yu ,

He , A survey on text classification: From traditional to deep learning , ACM Transactions on Intelligent Systems and Technology (TIST) 13 ( 2022 ) 1 - 41 .

[8]

Meng ,

Shen ,

Zhang , J. Han, Weaklysupervised neural text classification , in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management , CIKM '18, Association for Computing Machinery, New York, NY, USA, 2018 , pp. 983 - 992 . URL: https: //doi.org/10.1145/3269206.3271737. doi: 10 .1145/ 3269206.3271737.

[9]

Meng ,

Zhang ,

Huang ,

Xiong ,

Ji ,

Zhang , J. Han, Text classification using label names only: A language model self-training approach , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Association for Computational Linguistics , Online, 2020 , pp. 9006 - 9017 . URL: https: //aclanthology.org/ 2020 .emnlp-main. 724 . doi: 10 . 18653/v1/ 2020 .emnlp-main. 724 .

[10]

Xia ,

Wang ,

Tian , Y. Chang, FastClass: A time-eficient approach to weakly-supervised text classification , in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022 , pp. 4746 - 4758 . URL: https://aclanthology.org/ 2022 . emnlp-main. 313 .

[11]

Thangaraj ,

Sivakami , Text classification techniques: A literature review , Interdisciplinary Journal of Information, Knowledge, and Management 13 ( 2018 ) 117 - 135 .

[12]

Reimers , I. Gurevych , Sentence-BERT: Sentence embeddings using Siamese BERT-networks , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 3982 - 3992 . URL: https://aclanthology.org/ D19-1410. doi: 10 .18653/v1/ D19 -1410.

[13]

Schopf ,

Braun ,

Matthes , Evaluating unsupervised text classification: Zero-shot and similaritybased approaches , 2023 . arXiv: 2211 . 16285 .

[14]

Chai ,

Wu ,

Han ,

Wu ,

Li , Description based text classification with reinforcement learning , in: International Conference on Machine doi:10 .5281/zenodo.4883250. Learning , 2020 . [25]

Schmidt ,

Vanderfeesten , Evaluation on accu-

[15]

Zhu ,

Peng ,

Guo ,

Dietze , Generating ef- racy of mapping science to the United Nations' Susfective label description for label-aware sentiment tainable Development Goals (SDGs) of the Aurora classification , Expert Syst. Appl . 213 ( 2023 ). URL: SDG queries , 2021 . URL: https://doi.org/10.5281/ https://doi.org/10.1016/j.eswa. 2022 . 119194 . doi:10. zenodo.4964606. doi: 10 .5281/zenodo.4964606. 1016/j.eswa. 2022 . 119194 . [26]

Vanderfeesten ,

Jaworek , L. Keßler, AI for map-

[16]

Gao ,

Zhu ,

Chu ,

Wang ,

Ruan , J . Zhao, ping multi-lingual academic papers to the United Enhancing robust text classification via category Nations' Sustainable Development Goals (SDGs) , description, 2022 IEEE International Conference 2022 . URL: https://doi.org/10.5281/zenodo.6487606. on Data Mining (ICDM) ( 2022 ) 151 - 160 . doi: 10 .5281/zenodo.6487606.

[17]

W. X.

Zhao ,

Zhou ,

Li ,

Tang ,

Wang ,

Hou , [27]

Jaworeck , SDG BERT - Multi-language MultiY . Min,

Zhang ,

Dong ,

Du , C. Yang, label BERT model for classifying texts to SusY . Chen,

Chen ,

Jiang ,

Ren ,

Li , X. Tang, tainable Development Goals (SDGs) based on AuZ . Liu, P. Liu,

J.-Y.

Nie ,

J.-R.

Wen , A survey of large rora SDG Query Model v5 , 2022 . URL: https: language models, 2023 . arXiv: 2303 .18223. //doi.org/10.5281/zenodo.7304547. doi: 10 .5281/

[18] B. ayabalasingham , R. Boverhof, K. Agnew, zenodo.7304547. L. Klein, Identifying research supporting the united [28]

Devlin ,

Chang ,

Lee ,

Toutanova , nations sustainable development goals, 2019 . URL: BERT: pre-training of deep bidirectional transhttps://elsevier.digitalcommonsdata.com/datasets/ formers for language understanding , CoRR 87txkw7khs/1 . doi: 10 .17632/87txkw7khs.1. abs/ 1810 .04805 ( 2018 ). URL: http://arxiv.org/abs/

[19]

Rivest ,

Kashnitsky ,

Bédard-Vallée , 1810 . 04805 . arXiv: 1810 .04805. D. Campbell,

Khayat , I. Labrosse, S. Pinheiro, [29]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad , A. MoHenrique; Provençal,

Roberge ,

James , Im- hamed,

Levy ,

Stoyanov , L. Zettlemoyer, BART: proving the scopus and aurora queries to identify denoising sequence-to-sequence pre-training for research that supports the united nations sustain- natural language generation, translation, and comable development goals (sdgs) 2021 , 2021 . URL: prehension, CoRR abs/ 1910 .13461 ( 2019 ). URL: http: https://elsevier.digitalcommonsdata.com/datasets/ //arxiv.org/abs/ 1910 .13461. arXiv: 1910 . 13461 . 9sxdykm8s4/. doi: 10 .17632/9sxdykm8s4.4.

[20]

Roberge ,

Kashnitsky ,

James , Elsevier 2022 sustainable development goals A. Examples of wrong predictions (sdg) mapping, 2022 . URL: https://elsevier. digitalcommonsdata.com/datasets/6bjy52jkm9. doi: 10 .17632/6bjy52jkm9.1.

[21]

Science ,

Wastl ,

Porter ,

Draux ,

Fane ,

Hook , Contextualizing sustainable development research , 2020 . URL: https://digitalscience. ifgshare.com/articles/report/Contextualizing_ Sustainable_Development_Research/12200081. doi: 10 .6084/m9.figshare. 12200081 . v2 .

[22]

Armitage ,

Lorenz ,

Mikki , Replication data for: Mapping scholarly publications related to the Sustainable Development Goals: Do independent bibliometric approaches get the same results ?, 2020 . URL: https://doi.org/10.18710/98CMDR. doi: 10 . 18710/98CMDR.

[23]

Wang ,

Kang ,

Mu , Mapping research to the sustainable development goals (sdgs ), 2023 . URL: https://www. researchsquare.com/article/rs-2544385/v2. doi: 10 .21203/rs.3.rs- 2544385 /v2.

[24]

Vanderfeesten ,

Otten , E. Spielberg, Search Queries for "Mapping Research Output to the Sustainable Development Goals (SDGs)" v5.0.2 , 2020 . URL: https://doi.org/10.5281/zenodo.4883250.

2 Zero hunger 2 3 4 5 6 8 9 3 3 3 3 7 7 7 11 12 13 15 16 17 14 3 12 8 10 8 11 12 1 8 15 12 1 12 15 8 12 8 3 11 8 12 Top 1 Top 2 Top 3 Top 1 Top 2 Top 3