<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>R. Abu Ahmad);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>of the FoRC @ NSLP2025 Shared Task: Field of Research Classification for Computational Linguistics and Natural Language Processing Publications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Francis</string-name>
          <email>maria.francis287@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raia Abu Ahmad</string-name>
          <email>raia.abu_ahmad@dfki.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ekaterina Borisova</string-name>
          <email>ekaterina.borisova@dfki.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georg Rehm</string-name>
          <email>georg.rehm@dfki.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Mind/Brain Sciences (CIMeC), University of Trento</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Deutsches Forschungszentrum für Künstliche Intelligenz GmbH</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1893</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>embeddings with zero-shot prompting using large language models. Their approach achieved a micro-F1 score of 0.68 and a macro-F1 of 0.66, outperforming our baselines as well as results from last year's iteration. ifeld of research classification, research topic classification, shared task, scholarly information processing In the face of an increasingly interconnected research landscape, efectively organising scholarly knowledge is an essential task. As the number of scientific publications continues to expand - doubling approximately every 17 years [1] - eficient ways to organise and navigate scholarly literature are becoming more relevant. Digital platforms, such as academic search engines, bibliographic databases, and knowledge graphs, now play a crucial role in representing both the content of individual works and the relationships among them [2, 3, 4, 5]. Such systems depend on classification schemes to provide a standardised way to group papers by their disciplinary focus, which in turn plays a role in positioning publications within their research contexts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR</p>
      <p>ceur-ws.org
subfields evolve rapidly and often overlap [ 9, 10, 11]. Yet, one of the field’s primary resources, the ACL
Anthology [12], does not ofer a built-in topic classification system, limiting its application to structured
search, exploration, and analysis.</p>
      <p>To address this gap, we organise the second iteration of the Field of Research Classification (FoRC)
shared task at the Natural Scientific Language Processing (NSLP) Workshop 2025 . The previous iteration
hosted two subtasks: 1) single-label classification of general academic papers into a broad research
taxonomy, and 2) fine-grained multi-label classification of CL/NLP publications [ 13]. This year, we
focus exclusively on the second subtask. We use the same dataset as the previous iteration, namely
FoRC4CL – a manually annotated corpus of 1,500 ACL Anthology papers labelled using Taxonomy4CL,
a hierarchical taxonomy of CL/NLP research areas [11]. To encourage participation and facilitate
the exploration of low-resource learning methods, for this year’s iteration, we expand the dataset by
releasing a large-scale, weakly labelled extension consisting of over 41,000 ACL publications [14].1</p>
      <p>The task was hosted on CodaBench [15], and although ten teams registered to participate, there
was only one system submission. Nevertheless, this submission demonstrates substantial performance
enhancements over last year’s results, improving macro-precision from 0.39 to 0.65 and macro-F1 from
0.43 to 0.66 – an increase of 0.26 and 0.23, respectively. The shared task had the following schedule:
• Training and testing data release: February 18, 2025
• System submissions deadline: March 25, 2025
• Paper submissions: March 27, 2025
• Notification of acceptance: April 10, 2025
• Camera-ready Submission: April 17, 2025</p>
      <p>The rest of the paper is structured as follows. Section 2 presents related work on extreme multi-label
classification techniques, which achieved strong performance in last year’s shared task, as well as
related work in weak supervision. Section 3 provides a detailed description of the task. Section 4
outlines the methodology for constructing and weakly annotating the supplementary dataset. Section 5
presents the baseline systems and submitted models from both this and last year’s iterations. Finally,
Section 6 ofers a discussion of the results and findings and Section 7 concludes our paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>A wide range of approaches have been proposed to address Extreme Multi-Label Classification (XMLC)
[16], a task characterised by a large label space and label imbalance. Although XMLC problems typically
span far more classes than seen in FoRC, the highest-performing submission from the 2024 iteration
achieved good results by treating it as an XMLC problem, demonstrating the adaptability of this
paradigm even in more constrained settings [17].</p>
      <p>One approach for XMLC is the use of One-vs-All classifiers [ 18, 19, 20], which treat each label as
an independent binary classification task. While conceptually simple, these methods face scalability
challenges due to their high computational cost associated with training and inference across thousands
of binary classifiers [ 20].</p>
      <p>To mitigate this, embedding-based approaches have been proposed, which project instances and labels
into a shared low-dimensional space and perform prediction via similarity search. For example, SLEEC
[21] uses sparse local embeddings to capture non-linear label manifolds, while its successor, AnnexML
[22], improves scalability by constructing a k-nearest neighbors (k-NN) graph of label embeddings for
approximate nearest neighbor search.</p>
      <p>Tree-based approaches, such as Parabel [23], Bonsai [24], or FastXML [25], tackle XMLC by recursively
splitting the feature or label space, reducing both training and prediction complexity while maintaining
competitive performance.
1The dataset is publicly available at https://zenodo.org/records/14901529</p>
      <p>More recently, generative models have reframed XMLC as a sequence generation task. XLGen [26],
for example, fine-tunes T5 [ 27] and BART [28] to generate label sequences, further enhancing output
coherence by integrating hierarchical clustering to better model label dependencies.</p>
      <p>In cases where annotated data is scarce or expensive to obtain, weak supervision has emerged as an
alternative to traditional supervised learning. Weak supervision refers to strategies that generate noisy
labels using inexpensive or indirect sources, for instance external knowledge bases [29], crowdsourced
annotations of varying quality [30], heuristic rules [31], feature annotations [32], or predictions from
pre-trained models.</p>
      <p>To avoid poor generalisation, specialised learning strategies exist to mitigate the noise introduced
by the weak labels – BOND [33] and COSINE [34], for example, apply teacher-student frameworks.
Another strategy is to filter weakly labelled data for instances where labels are likely to be false – CleaR
[35] preferentially exposes parameter-eficient fine-tuning modules to clean data while bypassing the
noisy ones. Non-neural approaches such as bagging, boosting, outlier detection, and k-NN have also
been employed to identify and discard erroneous labels [36, 37, 38, 39].</p>
      <p>The efectiveness of weak supervision, however, has been questioned in recent work. Zhu et al.
[40] demonstrate that many weakly supervised learning methods depend heavily on access to a clean
validation set, and that models trained directly on this clean subset can outperform weak supervised
learning (WSL) methods when even 15 cleanly labelled examples per class are available. Nonetheless,
these findings are based on balanced classification tasks with up to 10 classes. In contrast, the FoRC
shared task has 170 labels with a highly unbalanced label distribution, where many classes indeed have
fewer than 10 annotated examples. Therefore, we find that weak supervision may still prove useful in
the context of our task.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <p>The FoRC 2025 task is a fine-grained multi-label classification task with 170 total classes represented
over three hierarchical levels. The task is described as follows:</p>
      <p>Given an ACL publication’s (meta-)data, predict all associated labels that describe
the main contributions of the publication from a taxonomy of 170 (sub-)topics in</p>
      <p>CL/NLP.</p>
      <p>We use FoRC4CL [11] as a dataset, which consists of 1,500 CL/NLP articles extracted from the ACL
Anthology [12] from the years 2016 to 2022. The data includes three splits, which were created by
shufling the corpus randomly into 70/15/15 for training, validation, and testing, respectively. The
articles were manually annotated by graduate students using Taxonomy4CL, a three-level taxonomy
of (sub-)topic in CL/NLP. For more detailed information on the taxonomy creation, its topics, and the
annotation process of FoRC4CL, we refer the reader to our previous work [11].</p>
      <p>Originally, the following metadata was available for each publication: ACL Anthology ID, title,
abstract, author(s), URL to the PDF, publisher, publication year and month, proceedings title, DOI, and
venue. This year, instead of only the URL to the full text being given, the full text of each article is
directly available in the dataset. The data is additionally supplemented with 41,000 weakly labelled
articles, the overall available splits and their sizes are summarised in Table 1. The task is evaluated
using micro, macro, and weighted scores of precision, recall, and F1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Construction of Weakly labelled Dataset</title>
      <p>
        To construct the weakly labelled supplementary dataset, we start by sourcing 80,013 scholarly articles
from the ACL Anthology Corpus [41]. To prevent data leakage, we exclude any articles already present
in the FoRC4CL dataset. We include only papers published between 2013 and 2022, which ensures
alignment with the classes in Taxonomy4CL while maintaining a dataset of suficient size. We extract
metadata fields consistent with those available in FoRC4CL, namely ACL Anthology ID, abstract, URL
to the full text, publisher, year and month of publication, book title, author(s), DOI, and venue. When
metadata values are missing in either FoRC4CL or the supplementary dataset, we add them using
OpenAlex [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] whenever possible. The dataset for the 2025 iteration of the FoRC shared task is publicly
available on Zenodo [14].
      </p>
      <p>Prior work in last year’s iteration of the FoRC shared task demonstrated that using full article texts
for model training is a promising strategy, with the top-performing teams in both subtasks using this
approach [42, 17]. Therefore, we retrieve the full text of each article and integrate it into both datasets.
We train a One-vs-Rest Logistic Regression model on the full FoRC4CL dataset including full article
texts, and we apply that model to the supplementary dataset to generate weak labels. The reason
for choosing this model is based on the experiments presented in Section 4.1. As a postprocessing
step, we remove any predicted labels whose corresponding superclass was not also predicted to ensure
consistency within the label hierarchy.</p>
      <p>Taxonomy4CL comprises 181 classes, of which 46 belong to the first, 109 to the second, and 26 to the
third hierarchical level. Not all classes appear as labels in the FoRC4CL dataset, with 7 level 1-, 7 level-2,
and 4 level-3 labels remaining unused. The set of labels present in the FoRC4CL dataset is identical to
those found in the weakly labelled dataset. Both the FoRC4CL and the weakly labelled datasets exhibit
significant class imbalance, with their label distributions closely adhering to Zipf’s Law (see Figure 1).</p>
      <sec id="sec-4-1">
        <title>4.1. Selection of Model for Weak Labeling</title>
        <p>To select a model to use for weakly labelling the additional ACL publications, we assess the performance
of various models by training and evaluating them on the FoRC4CL dataset. We evaluate the models
using micro, macro, and weighted F1 scores. We assign the most importance to the macro F1 score. Our
experimentation covers classic machine learning (ML) algorithms and Transformer models.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Classic ML Algorithms</title>
          <p>Prior research suggests that directly training a transformer model may not be ideal for this task, and
that simpler machine learning models can yield superior results [17]. Given this insight, we evaluate the
efectiveness of various ensemble learning techniques, including One-vs-Rest Logistic Regression,
Onevs-Rest Support Vector Classification (SVC), Random Forest, and XGBoost classifiers, across diferent
hyperparameter configurations. Initial experiments show that One-vs-Rest Logistic Regression achieves
slightly higher F1 scores when the input text is stemmed, as shown in Table 2. Thus, we adopt this
preprocessing strategy for all subsequent experiments. We tokenize our input using a unigram
tfidf tokenizer. We also explore using bi- and trigram tf-idf tokenization, but observe a decrease in
performance (see Table 3) alongside an increase in computational cost.</p>
          <p>The results of a subset of our experiments are presented in Table 4, and a comprehensive set of
experimental results is provided in Table 9 in the Appendix. All Logistic Regression and SVC models
employ the One vs. Rest ensemble technique. We achieve the best performance on the test set using
the One vs. Rest Logistic Regression model with a liblinear solver and an L1 penalty, which achieves
a weighted F1-score of 0.65. More detailed results of micro-, macro-, and weighted-F1 scores per
hierarchical level are shown in Table 5. We notice that Random Forest models achieve the worst
performance – this may be because each tree in the forest only sees a portion of the total dataset,
making it more likely to miss out on some uncommon labels. Training each tree on the full dataset
instead may mitigate this issue and increase performance. Notably, we find that incorporating the full
text of the article as part of training data increases classification performance in most cases. The final
model performs better on common classes than on sparse ones: A simple linear regression analysis
revealed a weak but significant positive relationship between class size in the training set and class
performance on the test set ( = 0.0018,  = 0.0003,  = 1.975 − 07,  2 = 0.141).</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Transformer Models</title>
          <p>We extend our experiments to include various transformer-based models, which are trained, validated,
and tested on FoRC4CL. We include BERT [43],2 DeBERTa [44],3 SciBERT [45],4 SPECTER [46],5 and
SciNCL [47].6 The latter three models are pre-trained on scientific corpora and use a vocabulary derived
2https://huggingface.co/google-bert/bert-base-uncased
3https://huggingface.co/microsoft/deberta-v3-base
4https://huggingface.co/allenai/scibert_scivocab_uncased
5https://huggingface.co/allenai/specter
6https://huggingface.co/malteos/scincl
from scientific texts. The results of these experiments are presented in Table 6.</p>
          <p>We test our models on two loss functions – Binary Cross-Entropy with Logits (BCE Loss) and Focal
Loss. Focal Loss handles strong class imbalances by down-weighting classes that are easy to predict
[48]. All models use AdamW [49] as an optimiser, and a learning rate of 4e-5, which we selected based
on preliminary trials on SciBERT and SciNCL. The results of these trials are shown in Figure 2. To
prevent overfitting, we implement early stopping with a patience of 3 on the validation loss.</p>
          <p>The best-performing transformer model, SPECTER with BCE loss, trained for 30 epochs before early
stopping and achieved a weighted F1-score of 0.4. While this performance is not necessarily poor, we
manually check the outputs of the model and find that these models predict only a small subset of
the possible labels. SPECTER + BCE uses only 30 of the 170 possible classes, predominantly the most
common ones. This adherence to common labels is also reflected in the low macro-F1 score compared
to weighted-F1. Furthermore, none of the models produced predictions from the second or third levels
of the label hierarchy. These findings suggest that these models are not well-suited for generating
useful weakly labelled instances for this task.</p>
          <p>Loss
BCE
Focal
cSiBERT
Model</p>
          <p>BERT
DeBERTa
SciBERT
SPECTER
SciNCL</p>
          <p>BERT
DeBERTa
SciBERT
SPECTER
SciNCL
cSiNCL
0.11
0.02
0.04
0.16
0.16</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. System Descriptions and Results</title>
      <p>The following section provides an overview of all systems evaluated in the current and previous edition
of the FoRC shared task, including both baseline and submitted models. The results for each system are
presented in Table 7. We include models from the 2024 task to contextualise progress, particularly in
light of the limited number of submissions in the 2025 edition. While results from 2024 are reported in
terms of micro and macro scores, we additionally report weighted scores for systems from 2025.</p>
      <sec id="sec-5-1">
        <title>5.1. Baseline Systems</title>
        <p>Due to its simplicity and strong performance in the preliminary experiments for constructing the weakly
labelled dataset, we adopt Logistic Regression as the baseline model for this year’s iteration. The model
is trained with balanced class weights, using the liblinear solver and an L1 regularisation penalty. We
train the baseline on both the FoRC4CL training set and the weakly labelled supplementary dataset.
For completeness, we train once using full texts and once without. Otherwise, all available metadata
are used in training for both settings.</p>
        <p>The baseline model for the 2024 iteration of the task was SciNCL fine-tuned on the train split of
the FoRC4CL dataset. The input features were article titles and abstracts, and taxonomy labels were
multi-hot encoded. Hierarchical information was not used during training and instead was flattened.
The baseline was trained for three epochs with BCEWithLogits as the loss function and AdamW as
the optimiser. All other hyperparameters were left at their default values as defined in Hugging Face’s
AutoModelForSequenceClassification class 7.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Submitted Systems</title>
        <p>Submission from the 2025 iteration. The 2025 iteration of the shared task received one submission
from the team KBOGAS, who explored three approaches to the FoRC classification problem: 1) k-NN
over article embeddings, 2) graph neural networks (GNNs), and 3) zero-shot classification using large
language models (LLMs). Their best-performing approach combines the first and third strategies in a
two-stage pipeline, first retrieving likely candidate labels for a given query via k-NN, then refining the
selection using zero-shot LLMs. The k-NN approach alone, which constitutes the baseline, achieves
high recall, while the LLM boosts precision. To describe in more detail: the k-NN approach uses
Sentence Transformers [50] to embed the titles and abstracts of the training set. Then, the query item is
embedded, and the k most similar embeddings by cosine similarity are selected. A subset of their labels
are chosen as the final classification using a distance-weighted voting scheme with a voting threshold
of 0.3. With increasing values of k, recall increases, converging around 90% at k=20. The team choose a
ifxed k of 19 because of this. The GNN approach constructs a graph using embedding-based similarity
and applies a node classification model, but this does not outperform the baseline. For the LLM
component, the team experimented with several prompting strategies, varying the selection of possible
labels that are passed to the LLM. Either the full set of Taxonomy4CL labels was included in the prompt,
or only a smaller set of labels that were deemed as likely candidates using the aforementioned k-NN
algorithm was included. The authors also experiment with the inclusion of hierarchical information.
All components were trained and evaluated using only titles and abstracts; the provided weakly labelled
data and full texts were not utilised. KBOGAS’ best model achieves a micro-F1 of 0.68, macro-F1 of 0.66,
and a weighted-F1 of 0.69, surpassing baselines and last year’s results in almost all metrics.
Submissions from the 2024 iteration. Two systems were submitted for the 2024 iteration of the
shared task: one by CAU&amp;ZBE [17] and another by CUFE. CAU&amp;ZBE outperform CUFE on all metrics.
As CUFE did not provide any system description, we proceed to describe the system submitted by
CAU&amp;ZBE. Given the large label set, the imbalanced class distribution, and the hierarchical structure
of the taxonomy, CAU&amp;ZBE approach FoRC4CL as an XMLC task. Accordingly, they experimented
with models commonly used in the XMLC literature: Parabel [23] and X-Transformer [51]. Parabel
is a tree-based approach that deals with label imbalance by recursively dividing the label space to
create balanced clusters. Two versions of Parabel were trained: Once using only the FoRC4CL training
set, and once additionally using the article full texts. Interestingly, the model trained with full texts
underperformed compared to the model trained without, indicating that full texts do not always provide
useful signal for the task. For the X-Transformer, fine-tuning proceeds in three phrases: a) clustering
the label space, b) assigning the input publication to one of the clusters by relevance, and c) using
a ranker to score individual labels within the selected cluster based on their relevance to the input.
Two X-Transformer variants were trained, one on the FoRC4CL dataset without full texts, and another,
referred to as the weak X-Transformer, trained on the same data augmented with approximately 70,000
additional ACL Anthology articles. The extra 70,000 articles are weakly labelled using a simple tf-idf
classifier. Their best results were achieved using the X-Transformer model with the weakly supervised
data.
7https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The performance of all models is shown in Table 7. KBOGAS achieves the strongest overall performance
across nearly all evaluation metrics, with the exception of micro-recall, weighted-recall, and weighted-F1,
where the 2025 baselines remain slightly ahead. The fact that the most notable gains made from KBOGAS’
approach are found in the macro-averaged scores suggests that its improvements are concentrated on the
underrepresented labels, rather than the majority classes. This is additionally reflected in the weighted
scores, where KBOGAS only improves in precision over the baseline. In general, KBOGAS performs
remarkably well in precision, improving by over 0.25 in macro-precision over the next best model,
CAU&amp;ZBW. We find this indicative of the model’s ability to make semantically informed predictions,
rather than simply predicting more common labels. Overall, KBOGAS’ approach yields impressive
results, driving progress in the areas of the task that have proved most dificult in the past. By avoiding
direct fine-tuning on the data, their approach mitigates the overfitting that typically occurs in such
low-resource settings. We expect that with a carefully-curated set of manually labelled instances, this
approach could be improved even further.</p>
      <p>Despite its simplicity, the Logistic Regression baseline performs surprisingly well on micro and
weighted metrics. In comparison, it underperforms in macro scores, which suggests that the baseline
handles common labels efectively, but struggles with rare ones. Comparing the results shown here to
those in Section 4.1, we observe that the additional training on weakly labelled data does indeed lead
to improvements in macro- and weighted-F1, compared to training on clean data alone. Interestingly,
the utility of full-text inputs seems to depend on data quality. While adding full-texts to training data
improved performance when training on the smaller, clean FoRC4CL dataset, it degraded performance
when training with the weakly labelled dataset. This suggests that, in the context of weak supervision,
full-texts introduce more noise than they do signal.</p>
      <p>One of the conclusions of last year’s FoRC shared task was the dificulty posed by the limited
availability of high-quality annotated data for training. These challenges are compounded by the
large number of labels and the heavy class imbalance, which makes the classification of rare labels
particularly dificult. Despite these constraints – and with only a single system submission – the 2025
edition saw substantial progress in the task, especially in improving classification performance on
underrepresented labels, all without access to any additional annotated data. To guide future research
towards improvement in this task, a qualitative analysis of the predictions made by the KBOGAS model
may provide insights towards its weaknesses.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper, we presented an overview of the second iteration of the FoRC shared task, which was
held under NSLP 2025. This year, the shared task focused on the fine-grained classification of papers in
computational linguistics as a hierarchical, multi-label classification problem. The taxonomy used was
Taxonomy4CL, which includes 170 (sub-)topics, and the dataset was FoRC4CL, a manually annotated
corpus of 1500 papers from the ACL Anthology. This year, we additionally provided 41,000 weakly
labelled ACL Anthology papers, which were labelled using a simple Logistic Regression model. One of
the main challenges of the task is the highly unbalanced nature of both datasets, making classification of
underrepresented labels particularly dificult. While only one team, KBOGAS, participated in the task,
their system outperformed last year’s winners by a large margin, particularly excelling in precision
and in the classification of uncommon labels. Their method combines k-NN clustering over paper
embeddings with zero-shot prompting of large language models – the first stage alone achieves high
recall, while the second stage improves precision by refining the final prediction. This year’s baseline
model achieves surprisingly high scores, particularly in recall, and improves in performance through
additional training on the weakly labelled dataset. Both datasets are publicly available, and we hope to
support future work on FoRC through this contribution.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported by the consortium NFDI for Data Science and Artificial Intelligence (NFDI4DS) 8
as part of the non-profit association National Research Data Infrastructure (NFDI e. V.). The consortium
is funded by the Federal Republic of Germany and its states through the German Research Foundation
(DFG) project NFDI4DS (no. 460234259).</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-3.5 and GPT-4 in order to: Grammar and
spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
[7] M. Daradkeh, L. Abualigah, S. Atalla, W. Mansoor, Scientometric analysis and classification
of research using convolutional neural networks: A case study in data science and analytics,
Electronics 11 (2022) 2066.
[8] F. Hoppe, D. Dessì, H. Sack, Deep learning meets knowledge graphs for scholarly data classification,
in: Companion proceedings of the web conference 2021, 2021, pp. 417–421.</p>
      <p>[9] X. Chen, H. Xie, X. Tao, Vision, status, and research topics of natural language processing, 2022.
[10] A. A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, E. Motta, The computer science
ontology: A large-scale taxonomy of research areas, in: International Semantic Web Conference,
2018, pp. 187–205.
[11] R. A. Ahmad, E. Borisova, G. Rehm, Forc4cl: a fine-grained field of research classification and
annotated dataset of nlp articles, in: Proceedings of the 2024 Joint International Conference on
Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp.
7389–7394.
[12] S. Bird, R. Dale, B. Dorr, B. Gibson, M. Joseph, M.-Y. Kan, D. Lee, B. Powley, D. Radev, Y. F. Tan, The
acl anthology reference corpus: A reference dataset for bibliographic research in computational
linguistics, in: Proceedings of the Sixth International Conference on Language Resources and
Evaluation (LREC’08), 2008.
[13] R. Abu Ahmad, E. Borisova, G. Rehm, FoRC@NSLP2024: Overview and insights from the field of
research classification shared task, in: International Workshop on Natural Scientific Language
Processing and Research Knowledge Graphs, Springer, 2024, pp. 189–204.
[14] M. Francis, R. Abu Ahmad, E. Borisova, G. Rehm, FoRC@NSLP2025 Dataset, 2025. URL: https:
//doi.org/10.5281/zenodo.14901529. doi:10.5281/zenodo.14901529.
[15] Z. Xu, S. Escalera, A. Pavão, M. Richard, W.-W. Tu, Q. Yao, H. Zhao, I. Guyon, Codabench:
Flexible, easy-to-use, and reproducible meta-benchmark platform, Patterns 3 (2022) 100543. URL:
https://www.sciencedirect.com/science/article/pii/S2666389922001465. doi:https://doi.org/10.
1016/j.patter.2022.100543.
[16] J. Liu, W.-C. Chang, Y. Wu, Y. Yang, Deep learning for extreme multi-label text classification, in:
Proceedings of the 40th international ACM SIGIR conference on research and development in
information retrieval, 2017, pp. 115–124.
[17] L. R. Bashyam, R. Krestel, Advancing automatic subject indexing: combining weak supervision
with extreme multi-label classification, in: Proceedings of the 1st International Workshop on
Natural Scientific Language Processing and Research Knowledge Graphs (NSLP 2024). Hersonissos,
Crete, Greece, volume 27, 2024.
[18] R. Babbar, B. Schölkopf, Dismec: Distributed sparse machines for extreme multi-label classification,
in: Proceedings of the tenth ACM international conference on web search and data mining, 2017,
pp. 721–729.
[19] I. E. Yen, X. Huang, W. Dai, P. Ravikumar, I. Dhillon, E. Xing, Ppdsparse: A parallel primal-dual
sparse method for extreme classification, in: Proceedings of the 23rd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2017, pp. 545–553.
[20] I. E.-H. Yen, X. Huang, P. Ravikumar, K. Zhong, I. Dhillon, Pd-sparse: A primal and dual sparse
approach to extreme multiclass and multilabel classification, in: International conference on
machine learning, PMLR, 2016, pp. 3069–3077.
[21] K. Bhatia, H. Jain, P. Kar, M. Varma, P. Jain, Sparse local embeddings for extreme multi-label
classification, Advances in neural information processing systems 28 (2015).
[22] Y. Tagami, Annexml: Approximate nearest neighbor search for extreme multi-label classification,
in: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and
data mining, 2017, pp. 455–464.
[23] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, M. Varma, Parabel: Partitioned label trees for extreme
classification with application to dynamic search advertising, in: Proceedings of the 2018 World
Wide Web Conference, 2018, pp. 993–1002.
[24] S. Khandagale, H. Xiao, R. Babbar, Bonsai: diverse and shallow trees for extreme multi-label
classification, Machine Learning 109 (2020) 2099–2119.
[25] Y. Prabhu, M. Varma, Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label
learning, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge
discovery and data mining, 2014, pp. 263–272.
[26] T. Jung, J.-K. Kim, S. Lee, D. Kang, Cluster-guided label generation in extreme multi-label
classification, arXiv preprint arXiv:2302.09150 (2023).
[27] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning
research 21 (2020) 1–67.
[28] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer,
Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation,
and comprehension, arXiv preprint arXiv:1910.13461 (2019).
[29] R. Hofmann, C. Zhang, X. Ling, L. Zettlemoyer, D. S. Weld, Knowledge-based weak supervision
for information extraction of overlapping relations, in: Proceedings of the 49th annual meeting of
the association for computational linguistics: human language technologies, 2011, pp. 541–550.
[30] M.-C. Yuen, I. King, K.-S. Leung, A survey of crowdsourcing systems, in: 2011 IEEE third
international conference on privacy, security, risk and trust and 2011 IEEE third international
conference on social computing, IEEE, 2011, pp. 766–773.
[31] A. Awasthi, S. Ghosh, R. Goyal, S. Sarawagi, Learning from rules generalizing labeled exemplars,
arXiv preprint arXiv:2004.06025 (2020).
[32] G. S. Mann, A. McCallum, Generalized expectation criteria for semi-supervised learning with
weakly labeled data., Journal of machine learning research 11 (2010).
[33] C. Liang, Y. Yu, H. Jiang, S. Er, R. Wang, T. Zhao, C. Zhang, Bond: Bert-assisted open-domain
named entity recognition with distant supervision, in: Proceedings of the 26th ACM SIGKDD
international conference on knowledge discovery &amp; data mining, 2020, pp. 1054–1064.
[34] Y. Yu, S. Zuo, H. Jiang, W. Ren, T. Zhao, C. Zhang, Fine-tuning pre-trained language model
with weak supervision: A contrastive-regularized self-training approach, arXiv preprint
arXiv:2010.07835 (2020).
[35] Y. Kim, J. Kim, S. Lee, Clear: Towards robust and generalized parameter-eficient fine-tuning for
noisy label learning, arXiv preprint arXiv:2411.00873 (2024).
[36] V. Wheway, Using boosting to detect noisy data, in: Advances in Artificial Intelligence. PRICAI
2000 Workshop Reader: FourWorkshops held at PRICAI 2000 Melbourne, Australia, August
28September 1, 2000 Revised Papers 6, Springer, 2001, pp. 123–130.
[37] B. Sluban, D. Gamberger, N. Lavrač, Ensemble-based noise detection: noise ranking and visual
performance evaluation, Data mining and knowledge discovery 28 (2014) 265–303.
[38] S. J. Delany, N. Segata, B. Mac Namee, Profiling instances in noise reduction, Knowledge-Based</p>
      <p>Systems 31 (2012) 28–40.
[39] J. Thongkam, G. Xu, Y. Zhang, F. Huang, Support vector machine for outlier detection in breast
cancer survivability prediction, in: Advanced Web and Network Technologies, and Applications:
APWeb 2008 International Workshops: BIDM, IWHDM, and DeWeb Shenyang, China, April 26-28,
2008. Revised Selected Papers 10, Springer, 2008, pp. 99–109.
[40] D. Zhu, X. Shen, M. Mosbach, A. Stephan, D. Klakow, Weaker than you think: A critical look at
weakly supervised learning, arXiv preprint arXiv:2305.17442 (2023).
[41] S. Rohatgi, Acl anthology corpus with full text, Github, 2022. URL: https://github.com/shauryr/</p>
      <p>ACL-anthology-corpus.
[42] F. Ruosch, R. Vasu, R. Wang, L. Rossetto, A. Bernstein, Single-label multi-modal field of research
classification, in: International Workshop on Natural Scientific Language Processing and Research
Knowledge Graphs, Springer Nature Switzerland Cham, 2024, pp. 224–233.
[43] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 conference of the North American chapter
of the association for computational linguistics: human language technologies, volume 1 (long
and short papers), 2019, pp. 4171–4186.
[44] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention,
arXiv preprint arXiv:2006.03654 (2020).
[45] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, arXiv preprint
arXiv:1903.10676 (2019).
[46] A. Cohan, S. Feldman, I. Beltagy, D. Downey, D. S. Weld, Specter: Document-level representation
learning using citation-informed transformers, arXiv preprint arXiv:2004.07180 (2020).
[47] M. Ostendorf, N. Rethmeier, I. Augenstein, B. Gipp, G. Rehm, Neighborhood contrastive learning
for scientific document representations with citation embeddings, arXiv preprint arXiv:2202.06671
(2022).
[48] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings
of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[49] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101
(2017).
[50] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv
preprint arXiv:1908.10084 (2019).
[51] J. Zhang, W.-C. Chang, H.-F. Yu, I. Dhillon, Fast multi-resolution transformer fine-tuning for
extreme multi-label text classification, Advances in Neural Information Processing Systems 34
(2021) 7267–7280.</p>
      <p>Model</p>
      <p>SVC
During our experiments with diferent transformer architectures, we also tried various dropout rates.
These modifications did not lead to any improvements in performance. The results of these experiments
are presented in Table 8.
0.018
0.018
0.020
0.019
0.016
0.018
0.018
0.071
0.072
0.073</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bornmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Haunschild</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mutz</surname>
          </string-name>
          ,
          <article-title>Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases</article-title>
          ,
          <source>Humanities and Social Sciences Communications</source>
          <volume>8</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Jaradeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Farfar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prinz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kismihók</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stocker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge</article-title>
          ,
          <source>in: Proceedings of the 10th international conference on knowledge capture</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Wade</surname>
          </string-name>
          ,
          <article-title>The semantic scholar academic graph (s2ag)</article-title>
          ,
          <source>in: Companion Proceedings of the Web Conference</source>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>739</fpage>
          -
          <lpage>739</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hendricks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tkaczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Feeney</surname>
          </string-name>
          ,
          <article-title>Crossref: The sustainable source of community-owned scholarly metadata</article-title>
          ,
          <source>Quantitative Science Studies</source>
          <volume>1</volume>
          (
          <year>2020</year>
          )
          <fpage>414</fpage>
          -
          <lpage>427</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Priem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Piwowar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Orr</surname>
          </string-name>
          ,
          <article-title>Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts</article-title>
          ,
          <source>arXiv preprint arXiv:2205</source>
          .
          <year>01833</year>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Eykens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Engels</surname>
          </string-name>
          ,
          <article-title>Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches</article-title>
          ,
          <source>Quantitative Science Studies</source>
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <fpage>89</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Macro-F1 Weighted-F1 Micro-F1 Macro-F1</surname>
          </string-name>
          Weighted-F1
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>