<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Workshop of IT-professionals on Artificial Intelligence, October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Text Classification System using Natural Language Processing and Machine Learning with Generative Adversarial Networks⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victor Sineglazov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Lytovchenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”</institution>
          ,
          <addr-line>37, Prospect Beresteiskyi, Kyiv, 03056</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>State University “Kyiv Aviation Institute”</institution>
          ,
          <addr-line>1, Prospect Liubomyra Huzara, Kyiv, 03058</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>5</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This work is devoted to develop a scalable multi-label classification system for Norwegian texts. We propose a novel architecture that fuses contextual embeddings from the NbAiLab/nb-bert-base model with a feature-level generative augmentation module based on f-VAEGAN-D2. By synthesizing labelconditioned embeddings for underrepresented classes and applying on-the-fly generative oversampling during classifier training, our method alleviates class imbalance and enhances recognition performance for both frequent and rare categories. We adapt the f-VAEGAN-D2 discriminator to operate on text embedding spaces, yielding substantial recall improvements on tail labels. It is offered practical guidelines for integration into municipal electronic document-routing systems that support both Bokmål and Nynorsk.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-label classification</kwd>
        <kwd>Norwegian language</kwd>
        <kwd>machine learning</kwd>
        <kwd>large language models</kwd>
        <kwd>BERT embeddings</kwd>
        <kwd>f-VAEGAN-D2</kwd>
        <kwd>class imbalance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Currently, with the growth of information volume, the problem of processing and classifying
textual data remains undoubtedly relevant. It has gained wide popularity and is used for various
tasks, such as spam classification, sentiment classification (sentiment analysis), and document
categorization. In general, the classification process - namely assigning texts to a predefined set of
classes - requires significant time and human resources when dealing with global tasks and large
data volumes; therefore, using machine learning for text classification is a more appropriate option.</p>
      <p>
        This is especially relevant for public authorities that receive thousands of emails per day which
must be processed and routed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For automatic distribution of emails by categories, it is necessary
to analyze their content, identify key topics, and forward them to the email address of the relevant
departments. Existing solutions based on manual processing or simple algorithms (such as keyword
filters) are ineffective due to subjectivity, high time costs, and the growing variety of linguistic
constructions in emails. Of special interest are the Scandinavian countries, in particular Norway,
since according to the report by Statistisk sentralbyrå (SSB) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] the share of Norwegian
municipalities that do not use electronic tools, including email, fell from 37% (2018) to 6.5% (2022);
95.5% of municipalities in 2022 used email as the main channel of communication with citizens,
which leads to additional load due to the large number of electronic requests. Norwegian language,
having two official written standards and limited annotated corpora, poses a challenge for
traditional machine learning methods applied to classification tasks.
      </p>
      <p>
        Natural Language Processing (NLP) is a machine learning (ML) technology that enables
computers to interpret, manipulate, and understand human language [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Programming within NLP
combines linguistics and computer science with the aim of decoding the structure of language and
the rules of its use in order to detect, decompose into components, and extract meaningful
information from text and speech [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. By combining computational linguistics with statistical
models, machine learning, and deep learning, NLP enables computers to recognize, analyze, and
generate text and speech [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The field traces back to the Turing Test proposed by Alan Turing in the
1950s [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Subsequent milestones include the 1954 Georgetown experiment in machine translation
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], rule–based systems such as ELIZA in the 1960s [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the rise of corpus–based statistical methods
in the 1980s and 2000s (Penn Treebank, WordNet, SVMs, HMMs) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and the launch of Google
Translate in 2006 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. From 2000 to 2010, ML and neural networks transformed NLP; today models
like [26], GPT, and LLaMA achieve high accuracy across tasks, and the market is projected to reach
USD 92.7 billion by 2028 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In practice, approaches span rules-based NLP, statistical NLP, and
deep-learning-based NLP.
      </p>
      <p>
        Norwegian presents specific challenges for NLP. Two written standards – Bokmål (≈85–90%)
and Nynorsk (≈10–15%) – differ in orthography, grammar, and lexicon, preventing a universal
model without multi-corpus preparation [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The language is morphologically productive with
extensive compounding (e.g., høyhastighetstog), which complicates tokenization [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Rich
inflection, flexible word order, and numerous regional dialects further increase variability,
impacting parsing and representation learning [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>This research explores the integration of an intelligent multi-label text classification system for
Norwegian using NLP, machine learning, and generative adversarial neural networks. The system
is aimed at automatically determining to which category or categories an input text belongs.
Special attention is paid to limited training data; we therefore apply a generative learning approach
based on the f-VAEGAN-D2 framework, which augments the training corpus with high-quality
synthetic examples.</p>
      <sec id="sec-1-1">
        <title>2. Literature review</title>
        <p>
          In the field of Natural Language Processing for text classification is commonly organized as a
staged pipeline that converts raw messages into machine-interpretable features [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The literature
describes a progression from tokenization, sentence and word segmentation, stop-word filtering,
normalization, and vectorization to downstream classifiers ranging from linear models and
ensembles to deep neural networks and transformers; ensemble and hybrid designs are well studied
in this context [30]. For Norwegian – where two written standards (Bokmål and Nynorsk) coexist
and morphological productivity is high – the quality of each stage has a measurable effect on final
metrics, making preprocessing and representation learning particularly consequential [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
These choices can be framed as multi-criteria trade-offs among accuracy, robustness, and cost, for
which formal multi-criteria optimization perspectives are relevant [33].
        </p>
        <p>
          Research on tokenization for Norwegian addresses ambiguous periods (abbreviations, domains,
decimals), hyphenated constructions, fixed expressions, and compound nouns. Practical systems
combine rules, regular expressions, and machine learning, while modern pipelines favor subword
approaches such as SentencePiece within AutoTokenizer, which remove fixed-vocabulary
dependence and better handle compounding and orthographic variation between Bokmål and
Nynorsk [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Preserving semantics at this earliest stage improves the fidelity of later vectorization
and classification.
        </p>
        <p>Stop-word filtering, normalization, and lemmatization are standard tools for reducing noise and
vocabulary size. Off-the-shelf Norwegian stop-lists (e.g., in NLTK) are often a starting point but
typically require domain adaptation. Normalization via stemming or lemmatization reduces type
sparsity and stabilizes frequency statistics, which is useful for both inflection and compounding.
These steps tend to improve efficiency and, when tuned to the domain, can improve effectiveness
by sharpening the signal available to classifiers. The selection and topology of models used
downstream are also influenced by foundational analyses of artificial neuron and network
topologies [31].</p>
        <p>Vectorization has shifted from Bag–of–Words and TF-IDF–simple, interpretable, but
context-agnostic representations–to contextual embeddings that encode meaning as a function of
surrounding tokens. BERT-style representations operate at the subtoken level, capture context, and
preserve multi-word expressions, delivering higher accuracy on Norwegian classification tasks than
sparse, high-dimensional count vectors. Contextualization is especially valuable where inflection and
compounding would otherwise explode the vocabulary and obscure semantic relatedness across
forms.</p>
        <p>
          Classical classifiers remain relevant reference points. Naive Bayes and logistic regression are
strong baselines for short texts but rely on assumptions–conditional independence and linear
separability – that limit performance on longer sequences and multi-label settings [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Support
Vector Machines (SVM) perform well with TF-IDF features and small training sets but are sensitive
to kernel choice and regularization and do not scale gracefully to a large number of labels without
reduction schemes [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Ensembles such as Random Forests and XGBoost capture nonlinearities
and are robust to sparse or noisy inputs, yet, like linear models, they lack explicit modeling of word
order and long-range dependencies.
        </p>
        <p>
          Early neural approaches for text classification addressed these gaps by modeling sequential
context. Recurrent Neural Networks (RNN) removed the independence assumption but suffered
from vanishing and exploding gradients on long sequences. LSTM and GRU introduced gating
mechanisms that significantly improved long-distance dependencies, at the expense of slower
training and limited parallelism [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Convolutional Neural Networks (CNN) for text offered speed
and the ability to learn local n-gram-like patterns useful for sentiment, toxicity, and stylistic cues,
but they are less suited to capturing global discourse structure compared with attention -based
models. When multiple, often conflicting, objectives arise—e.g., accuracy vs. latency vs. robustness
—multi-criteria and evolutionary optimization methods, including genetic-algorithm-based
conditional optimization, can guide model and threshold selection [32], [33]. Transformers
fundamentally changed the state of the art. Architectures such as BERT, GPT, T5, and RoBERTa
leverage self-attention to use full-sentence context [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. BERT introduced bidirectional encoding
and the [CLS] token as a document-level aggregate, which became a de facto standard for
classification heads. For Norwegian, NB-BERT/NbAiLab variants adapted to Bokmål/Nynorsk
consistently outperform classical methods on categorization tasks. In practice, however, scarcity of
labeled data and severe label imbalance remain barriers, leading to overfitting on frequent classes
and low recall in the long tail - effects that are amplified in multi-label regimes.
        </p>
        <p>
          To reduce reliance on large labeled corpora, several works explore adversarial generation for
text[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. SeqGAN casts the generator as a reinforcement-learning agent that receives a reward
from the discriminator after sequence completion, enabling GANs to operate over discrete tokens
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. TextGAN introduces a feature-matching loss that encourages the generator to align
distributions of discriminator-level features between real and generated sentences [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], [27].
MaliGAN reduces gradient variance by reparameterizing rewards, improving training stability [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
RankGAN replaces binary discrimination with pairwise ranking, which correlates better with
graded text quality [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Despite these advances, such models emphasize generation rather than
many–to–one classification and seldom incorporate label information explicitly during training.
        </p>
        <p>
          Hybrid approaches combine the strengths of autoencoding and adversarial training.
f-VAEGAN-D2 generates discriminative feature vectors in an embedding space and supports
any-shot scenarios (zero-/few-shot) by pairing a conditional discriminator with an unconditional that
improves the marginal feature distribution [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The presence of an encoder permits many-to-one
usage that aligns with classification. In text adaptations, contextual vectors (e.g., the BERT [CLS]
embedding) serve as inputs, and the generator synthesizes label–conditioned features to expand rare
classes without duplicating real examples. Such synthetic feature–level oversampling tends to
preserve semantics better than simple data–level heuristics such as EDA or back–translation and
integrates naturally with multi–label optimization and per–label threshold calibration.For
Norwegian’s dual standards, compounding, and dialectal variability, robust preprocessing and
subword tokenization (SentencePiece) are necessary components [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] – [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Building on these
findings, our approach fuses NbAiLab/nb–bert–base with f–VAEGAN–D2 to target rare–label
enrichment in a multi–label setting, addressing gaps left by prior work in handling imbalance and
preserving class semantics during training.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>3. Problem Statement</title>
        <p>
          The problem is to build multi label text classification system for Norwegian under scarce an
notation sand label imbalance. The input corpus contains labeled samples L={(ti , yi)}iN=L1, where
each text t i is accompanied by a binary label vector yi∈{0,1}K and unlabeled samples U ={t j}Nj=U1.
Texts are encoded with contextual features using NbAiLab/nb-bert-base[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The document
embedding is taken from the [CLS] token (classification token), which is added at the beginning of
the input sequence[28]. After passing through several transformer layers, this token aggregates
contextual information about the entire text, as in:
        </p>
        <p>The goal is to estimate the conditional probability p ( x| y ) of the label vector y given the
embedding x and learn a mapping, as in:
xreal=BERT[CLS]([ t1 , t 2 ,…, t n]) .</p>
        <p>
          f θ : ℝ768 → [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ]K ,
^y=f θ ( x ) ,
(1)
(2)
(3)
with independent per-label thresholding ^yk≥τ k environments, where ^yk - the model-predicted
probability that the document belongs to label k, τ k - the threshold set for label k.
        </p>
        <p>
          The function fθ is a parameterized model that takes as input a 768-dimensional document
embedding (the output of the BERT \[CLS] token) and returns K values in the range [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ]. Each of
these values represents the predicted probability that the document belongs to the corresponding
label.
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>4. Method overview</title>
        <sec id="sec-1-3-1">
          <title>4.1. Neural Network Model</title>
          <p>The key idea is to use the generative model f-VAEGAN-D2 as a classifier booster and as a source of
synthetic features (vector representations) for rare classes. This enables robust learning on
imbalanced datasets typical of real-world classification of municipal email inquiries.
4.1.1. Encoder
The encoder approximates the posterior of a latent variable z given text embeddings and class
labels. It maps [ x , y ], where ^x ∈ R768 is the BERT embedding and y ∈{0,1 }K is the multi-label
vector, to latent parameters z ∈ R64 with q ( z|x , y ) N ∼( μ , σ 2 I ) , where meaning the approximate
posterior over  is modeled as a multivariate normal distribution whose mean  μ ( x , y ) and diagonal
covariance σ 2( x , y ) I are output by the encoder.</p>
          <p>Architecture: one hidden layer with 128 units and two heads outputting μ and log σ 2.
Reparameterization:</p>
          <p>z= μ ( x , y )+ σ ( x , y )⊙ε , ε∼ N (0 , I ) , (4)
where μ ( x , y ) and σ ( x , y ) are the mean and standard-deviation vectors output by the encoder
for input (x, y) and ϵ ∼ N (0 , I ) is random noise drawn from the standard normal distribution.</p>
          <p>Encoder loss includes (i) reconstruction MSE between original and reconstructed embeddings,
and (ii) KL divergence between the approximate posterior q ( z|x , y ) and the prior p ( z )∼ N (0 , 1):
Lenc= MSE ( x , x^ )+ β⋅DKL(q ( z|x , y )∥p ( z )) ,
(5)
where x^ is the reconstruction of the input embedding, β is a weighting coefficient that balances
reconstruction (MSE) and latent-space regularization (encouraging proximity to N (0 , 1).</p>
          <p>This yields a smooth, meaningful latent space suitable for reconstruction and feature generation
for downstream classification.</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>4.1.2. Generator</title>
          <p>The generator synthesizes text embeddings. It is a two-layer MLP that takes ( z , y ) with z ∈ R64
and yi∈0 , 1K.
Layer 1: 128 units, ReLU. Layer 2: output x^ ∈ R768 matching BERT’s embedding size.
Models:</p>
          <p>reconstruction: z from the encoder to recover x ≈ ^x ;
–
–
generation: z∼ N (0 , 1) with arbitrary y to synthesize embeddings for zero-/few-shot support,
especially for rare classes. The synthetic samples augment the training set prior to
classification.</p>
          <p>Generator loss:</p>
          <p>LG= Ladv + λrec⋅LMSE+ λKL⋅LKL+ λFM⋅LFM ,
(6)
where Ladv is the adversarial loss, LMSE reconstruction error, as in (7) between real and
generated embeddings, LKL latent regularization via the encoder, as in (8), and LFM feature
matching between intermediate discriminator features ϕ(⋅), as in(9), and λrec , λKL , λFM ≥0  weight
the respective terms.</p>
          <p>LKL= DKL( z|x , y )∥N (0 , I ) ,</p>
          <p>LFM =‖ ϕ ( x^ , y )−ϕ ( x , y )‖1..</p>
        </sec>
        <sec id="sec-1-3-3">
          <title>4.1.3. Discriminator</title>
          <p>The discriminator solves two tasks:
1. Adversarial discrimination between real x and synthetic G(z,y)
2. Feature matching to stabilize training by aligning hidden-layer statistics
•
•
Input is the concatenation of an embedding and its label; processing follows[29]:
h 1= ReLU (W 1 h1+b1) ,
h 1= Dropout (h1 , p=0.4 ) ,</p>
          <p>D ( x , y )=σ (W 2 h1+b2) , (12)
where σ is the sigmoid, D ( x , y )∈(0 , 1) is the probability that ( x , y ) is real, W i - weight
matrix, bi is the bias, which yields the hidden representations before the nonlinearity. The hidden
activation h1 is also returned for feature matching.</p>
          <p>Total discriminator loss combines:</p>
          <p>Adversarial loss (binary cross-entropy), as in:</p>
          <p>Ladv=− E(x , y)∼ p [ log D ( x , y )]− Ez∼ N (0 ,1), y∼ py [ log (1− D (G ( z , y ) , y ))] ,
where E(x , y)∼ p [·] is expectation over real pairs ( x , y ), E z∼ N (0,1), y∼ py [·] is expectation over
latent noise z and sampled labels y.</p>
          <p>Feature matching comparing hidden means:</p>
          <p>LFM =‖ E x [ h ( x , y )]− E z [ h (G ( z , y ) , y )] ‖2 ,
where h is the discriminator’s hidden feature vector.</p>
          <p>The hidden vector h is also returned to compute LFM in the generator.
(7)
(8)
(9)
(10)
(11)
(13)
(14)</p>
        </sec>
        <sec id="sec-1-3-4">
          <title>4.1.4. Classifier</title>
          <p>The classifier is a three-layer MLP (384→192→output) trained on the expanded dataset (original
and synthetic embeddings). Each class uses its own decision threshold optimized for F1.
4.2. GANs Training Problems</p>
        </sec>
        <sec id="sec-1-3-5">
          <title>4.2.1. Vanishing Gradient Problem</title>
          <p>When the discriminator D becomes too accurate early, generator G receives almost no learning
signal. With the “saturating” generator objective:</p>
          <p>LG= Ez∼ pz(z)[ log (1− D (G ( z , y ) , y ))] ,</p>
          <p>LWGAN = Ex∼ pdata [ D ( x , y )]− Ez∼ pz [ D (G ( z , y ) , y )]
we get LG → 0 and ∇ θ LG → 0, so training stalls [24].To stabilize gradients, use the Wasserstein
objective that optimizes Earth Mover’s Distance [24]:</p>
          <p>G ( z , y ) ≈ x y ,</p>
          <p>∀ z∼ N (0 , I )</p>
        </sec>
        <sec id="sec-1-3-6">
          <title>4.2.2. Mode collapse</title>
          <p>Mode collapse occurs when G outputs a few patterns that fool D but do not cover the data
distribution [23]:</p>
          <p>The Wasserstein objective reduces collapse because minimizing a distance, not a log-probability,
continues to provide usable updates even when D separates modes well. An equivalent form is</p>
          <p>L= Ex∼ pdata [ D ( x , y )]− Ez∼ pz [ D (G ( z , y ) , y )] , (18)
which preserves non-degenerate gradients when diversity drops.</p>
        </sec>
        <sec id="sec-1-3-7">
          <title>4.2.3. Non-Convergence</title>
          <p>In this case, a small Gaussian noise is added to the real data to make the task slightly harder
for D and give G more time to adapt. Penalties on excessively large weights in D are also applied,
which makes its task harder and helps preserve competition [22].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Results</title>
      <p>The algorithm was tested on a manually collected dataset of the emails in Norwegian language,
collected from the internal correspondence of the Nord-Aurdal kommune. Each email could belong
to one or more of 17 predefined municipal categories (Aurdal omsorgssenter, Barnehage
(15)
(16)
(17)
virksomhetsleder, Brannvesenet, Eiendom, Fagernes legesenter, Helse og omsorg,
Helsesøstertjenesten, Interkommunal barneverntjenest, Kultur, Miljø og Naering, Nord-Aurdal
folkebibliotek, PP-tjeneste for Valdres, Regnskap, Skole virksomhetsleder, Teknisk, Økonomi),
forming a multi-label classification task.</p>
      <p>The dataset was split into 70/15/15, where 70% is training, 15% is validation and 15% is
test subsets with three random initializations (7, 42, 2025) ensembled by probability averaging.</p>
      <p>One of the key issues identified during the dataset analysis is a significant imbalance between
categories: several classes (such as Kultur and Regnskap) are represented by fewer than 20 examples.
This situation prevents stable classifier training—the model tends either to completely ignore rare
labels or to overfit on noisy patterns. For each underrepresented class, 1,000 new embeddings were
synthesized by passing a random latent vector through the generator along with the corresponding
one-hot label representation. The generated embeddings were integrated into the training set,
augmenting the real examples. The classification model (an ensemble of MLPs) was then trained on
this extended dataset and was trained for 50 epochs with a batch size of 32 using the
optimizer(learning rate = 1e-3). GAN was trained for 12 epochs with batch size 64 using Adam (lr =
1e-3) for Encoder, Generator, Discriminator. The loss combined BCE (adversarial), MSE
(reconstruction), KL-divergence, and feature matching and gradients were batch-averaged. Three
random initializations (7, 42, 2025) were ensembled by probability averaging.</p>
      <p>To evaluate the performance of the proposed neural network model, several standard metrics
are used:
•
•
•</p>
      <p>Precision
Recall
F1 score. Two forms of F1-score were analyzed separately:
1. Micro-F1 is calculated globally across all labels at once (i.e., all TP, FP, and FN are
summed). It is sensitive to classes with a large number of examples.
2. Macro-F1 is the average F1-score computed separately for each class. Unlike Micro-F1,
it is not affected by class frequency and better reflects performance on rare categories.
To evaluate the reliability of threshold-based predictions, calibration metrics are used:
•
•</p>
      <p>AUPRC (Area Under Precision–Recall Curve)
Brier score (mean squared error between predicted probabilities and actual labels)
This allows us to clearly compare model versions and ensure the system works reliably in
practical scenarios.</p>
      <p>Per-label thresholds is tuned on the validation split to maximize Macro-F1 and then froze them
for testing. An ablation study (no augmentation vs class-conditional augmentation) showed
consistent gains on tail labels when synthetic, label-conditioned embeddings were included.
Precision–Recall curves (micro and macro) further confirm robustness under class imbalance.
Calibration was evaluated using mean AUPRC = 0.83 and mean Brier score = 0.17 with good
alignment between predicted probabilities and true label frequencies.</p>
      <sec id="sec-2-1">
        <title>6. Conclusions</title>
        <p>In this work was presented system that combines BERT-based embeddings with generative data
augmentation for multi-label classification for texts in Norwegian Language. The pipeline ensures
normalized and denoised text, cleaned from noise and redundant morphology.</p>
        <p>Text vectorization is carried out using the NbAiLab/nb-bert-base model, which produces deep,
contextualized embeddings. To address class imbalance, used an f-VAEGAN-D2 architecture to
synthesize additional embeddings for rare categories, preserving the latent-space structure and
enhancing classification quality.</p>
        <p>Inference is performed using an ensemble of neural networks trained on both real and synthetic
embeddings, with per-label probability thresholds optimized for each category. Architectural
choices, regularization techniques, and a carefully designed training regimen prevent common
GAN-related failures—gradient vanishing, unstable convergence, and mode collapse—even in the
challenging setting of multi-label text classification.</p>
        <p>Evaluation on test dataset by macro-F1 and micro-F1(0.823 and 0.68, respectively) confirms that
overall performance improved and rare-class accuracy rose, reducing neglect of underrepresented
labels. A mean AUPRC of 0.83 and Brier score of 0.17 indicate strong calibration. Per-label
thresholds and ensemble inference ensured stable, accurate detection of rare categories. The
architecture therefore shows strong potential for classification.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4 in order to: Grammar and spelling
check. After using these tools, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[22] “GAN — Why it is so hard to train Generative Adversarial Networks!,” Medium.</p>
      <p>Available: https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generativeadvisory-networks-819a86b3750b.
[23] M. Zamorski, A. Zdobylak, M. Zięba, and J. Świątek, “Generative Adversarial Networks: recent
developments,” arXiv, 2019. doi: 10.48550/arXiv.1903.12266.
[24] M. M. Saad, R. O’Reilly, and M. H. Rehmani, “A Survey on Training Challenges in Generative
Adversarial Networks for Biomedical Image Analysis,” arXiv, 2022. doi:
10.48550/arXiv.2201.07646.
[25] “Evaluation Metrics in Machine Learning,” GeeksforGeeks.</p>
      <p>Available: https://www.geeksforgeeks.org/machine-learning/metrics-for-machine-learningmodel/.
[26] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota,
2019, pp. 4171–4186.
[27] Y. Zhang, Z. Gan, K. Fan, Z. Chen, R. Henao, D. Shen, L. Carin, Adversarial feature matching
for text generation, in: 34th International Conference on Machine Learning, 2017, pp. 4006–
4015.
[28] W. Nie, N. Narodytska, A. Patel, Relgan: Relational generative adversarial networks for text
generation, in: 7th International Conference on Learning Representations, 2019.
[29] L. Chen, S. Dai, C. Tao, H. Zhang, Z. Gan, D. Shen, Y. Zhang, G. Wang, R. Zhang, L. Carin,
Adversarial text generation via feature-mover’s dis- tance, in: Advances in Neural Information
Processing Systems, 2018, pp. 4666–4677.
[30] Sineglazov, V., Kot, A.Design of Hybrid Neural Networks of the Ensemble Structure Eastern</p>
      <p>European Journal of Enterprise Technologies, 2021, 1, pp. 31–45.
[31] Zgurovsky, M., Sineglazov, V., Chumachenko, E.Classification and Analysis Topologies
Known Artificial Neurons and Neural Networks Studies in Computational Intelligence, 2021,
904, pp. 1–58.
[32] Zgurovsky, M., Sineglazov, V., Chumachenko, E. Classification and Analysis of Multicriteria</p>
      <p>Optimization Methods Studies in Computational Intelligence, 2021, 904, pp. 59–174.
[33] Sineglazov, V.M., Riazanovskiy, K.D., Chumachencko, O.I. Multicriteria conditional
optimization based on genetic algorithms System Research and Information
Technologies, 2020, 2020(3), pp. 89–104.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] “How many work emails is too many?”, The Guardian</article-title>
          ,
          <source>Apr. 8</source>
          ,
          <year>2019</year>
          . Available: https://www.theguardian.com/technology/shortcuts/2019/apr/08/how-many
          <article-title>-workemails-is-too-many.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>“</surname>
          </string-name>
          <article-title>Digitalisation in the Norwegian municipalities: Development from</article-title>
          2018 to
          <year>2022</year>
          ,” Statistics Norway (SSB). Available: https://www.ssb.no/en/teknologi-og
          <article-title>-innovasjon/informasjons-ogkommunikasjonsteknologi-ikt/artikler/digitalisation-in-the-norwegian-municipalitiesdevelopment-</article-title>
          <string-name>
            <surname>from-</surname>
          </string-name>
          2018-to-2022.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[3] “Natural language processing”</source>
          , Wikipedia. Available: https://en.wikipedia.org/wiki/Natural_language_processing.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>“</given-names>
            <surname>Natural Language Processing</surname>
          </string-name>
          ”, Engati Glossary. Available: https://www.engati.com/glossary/natural-language-processing.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>[5] “What is NLP?”, IBM Think. Available: https://www.ibm.com/think/topics/natural-languageprocessing.</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] NbAiLab, “NbAiLab/nb-bert-base,” Hugging Face,
          <source>Accessed: Aug. 31</source>
          ,
          <year>2025</year>
          . [Online]. Available: https://huggingface.co/NbAiLab/nb-bert-base.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. S. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          , 
          <article-title>Natural Language Processing: A Textbook with Python Implementation</article-title>
          . Cham: Springer,
          <year>2023</year>
          , 437 p.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] “Evolution of NLP: From Past Limitations to Modern Capabilities,” Medium. Available: https://medium.com/@social_65128/
          <article-title>evolution-of-nlp-from-past-limitations-tomodern-capabilities-6dc1505faeb6.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] “Understanding the NLP Pipeline: A Comprehensive Guide</article-title>
          ,” Medium. Available: https://medium.com/@asjad_
          <article-title>ali/understanding-the-nlp-pipeline-a-comprehensiveguide-828b2b3cd4e2.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , “BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          ,” arXiv,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1810</year>
          .
          <volume>04805</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          , “Machine learning in
          <source>automated text categorization,” ACM Computing Surveys</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          , “
          <article-title>A Primer on Neural Network Models for Natural Language Processing</article-title>
          ,” 
          <source>Journal of Artificial Intelligence Research</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Langr</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Bok</surname>
          </string-name>
          , 
          <article-title>GANs in Action: Deep Learning with Generative Adversarial Networks</article-title>
          . Shelter Island, NY: Manning Publications,
          <year>2019</year>
          , 240 p.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          , “
          <article-title>SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient</article-title>
          ,” arXiv,
          <year>2017</year>
          . Available: https://arxiv.org/abs/1609.05473.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Henao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Carin</surname>
          </string-name>
          , “
          <article-title>Adversarial Feature Matching for Text Generation</article-title>
          ,” arXiv,
          <year>2017</year>
          . Available: https://arxiv.org/abs/1706.03850.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Che</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , R. D.
          <string-name>
            <surname>Hjelm</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Song</surname>
            , and
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          , “
          <string-name>
            <surname>Maximum-Likelihood Augmented Discrete Generative Adversarial Networks</surname>
          </string-name>
          ,” arXiv,
          <year>2017</year>
          . Available: https://arxiv.org/abs/1702.07983.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and M.-T. Sun, “
          <article-title>Adversarial Ranking for Language Generation</article-title>
          ,” arXiv,
          <year>2017</year>
          . Available: https://arxiv.org/abs/1705.11001.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Akata</surname>
          </string-name>
          , “
          <article-title>f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning,” in Proc</article-title>
          . IEEE/CVF Conf.
          <article-title>Computer Vision and Pattern Recognition (CVPR), Long Beach</article-title>
          , CA, USA,
          <year>2019</year>
          , pp.
          <fpage>10275</fpage>
          -
          <lpage>10284</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2019</year>
          .
          <volume>01053</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Språkrådet</surname>
          </string-name>
          , “
          <article-title>Om norsk språk og standarder</article-title>
          ,”
          <year>2023</year>
          . Available: https://www.sprakradet.no.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Øvrelid</surname>
          </string-name>
          and E. Velldal, “
          <article-title>Syntactic variation and parsing of Norwegian,”</article-title>
          <source>in Proceedings of VarDial</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Teodorescu</surname>
            <given-names>M.H.</given-names>
          </string-name>
          <article-title>Machine Learning Methods for Strategy Research</article-title>
          .
          <source>HBS Working Paper 18-011. Harvard Business School</source>
          ,
          <year>2017</year>
          . 59 p.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>