<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Context-Aware Search Space Adaptation of Hyperparameters and Architectures for AutoML in Text Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Parisa Safikhani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Broneske</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Otto von Guericke University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The German Centre for Higher Education Research and Science Studies (DZHW)</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>While Automated Machine Learning (AutoML) systems have shown strong performance on structured data, their application to natural language processing (NLP) tasks remains limited by static, task-agnostic search spaces. In this work, we propose a context-aware extension of AutoPyTorch that dynamically adapts both the hyperparameter search space and neural architecture configuration based on corpus-level meta-features. Our approach extracts interpretable textual statistics-such as average sequence length, vocabulary richness, and class imbalance-to guide the configuration of key hyperparameters. We also introduce two adaptive neural backbones, whose structures are shaped by these meta-features to improve model expressiveness and generalization. Experiments on 20 diverse text classification datasets-including subsets of GLUE, selected Kaggle benchmarks, and private corpora-demonstrate consistent performance improvements over strong baselines, particularly on datasets with limited training samples or severe class imbalance. Our results highlight the efectiveness of integrating dataset-level insights into the AutoML search process for NLP.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;AutoML</kwd>
        <kwd>Text Classification</kwd>
        <kwd>Hyperparameter Optimization</kwd>
        <kwd>Meta-Features</kwd>
        <kwd>Neural Architecture Search</kwd>
        <kwd>Context-Aware Modeling</kwd>
        <kwd>AutoPyTorch</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>AutoML frameworks have significantly advanced the
democratization of machine learning by automating the
design and optimization of learning pipelines. While
these systems have shown strong performance on
structured data, their extension to NLP tasks remains limited
due to the inherent complexity and diversity of textual
data. Text classification, a core NLP task, presents unique
challenges stemming from variable input lengths, diverse
syntactic structures, and high lexical variation—factors
that are often overlooked in conventional AutoML
worklfows.</p>
      <p>
        Most current AutoML approaches for NLP adopt static
pipeline configurations and search spaces, treating all
datasets uniformly regardless of their linguistic
characteristics. Even when modern frameworks include neural
networks or transformer models, their hyperparameter
search is usually performed within generic, manually
designed boundaries. This static design neglects crucial
dataset-specific properties such as text length
distribution, vocabulary richness, or class imbalance, which are
known to afect both model architecture performance and
training dynamics [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. As a result, these frameworks
may perform poorly on unusual or domain-specific text
datasets, where generic configurations fail to address
context-specific requirements.
      </p>
      <p>To address this gap, we propose a context-aware
extension of an AutoML Framework that dynamically adapts
its hyperparameter search space and model architecture
decisions based on corpus-level meta-features. Our
approach integrates a systematic extraction of statistical
and linguistic characteristics from each dataset—such
as text length variability, lexical diversity, sample size,
and class distribution—and uses these to inform both the
configuration of search spaces and the structural design
of neural backbones. By leveraging these insights, the
system can better align model complexity, optimization
schedules, and architectural choices with the demands
of the data.</p>
      <p>This paper makes two main contributions: First, we
introduce a context-aware mechanism for dynamically
adapting the hyperparameter search space in AutoML
based on text-level meta-features such as text length,
vocabulary diversity, and class imbalance. This enables the
AutoML process to tailor its optimization bounds—e.g.,
for batch size, learning rate, and dropout—according to
the statistical profile of each dataset. Second, we propose
two adaptive neural backbones, MetaMLP and
ContextualAttentionNet, whose configurations are shaped by
statistical and lexical characteristics of the input text. These
backbones enable the system to construct models from
scratch that better reflect the structural and distributional
properties of the data. Together, these innovations
facilitate a more robust and eficient adaptation of AutoML
pipelines to the unique demands of text classification
tasks.
2. Related Works
dataset-specific adaptation largely untapped. Moreover,
they rely on selecting from existing machine learning or
neural network architectures, rather than dynamically
constructing models based on the unique characteristics
of the textual data.</p>
      <sec id="sec-1-1">
        <title>2.2. Meta-Features and Meta-Learning for</title>
      </sec>
      <sec id="sec-1-2">
        <title>AutoML in NLP</title>
        <p>
          Automated Machine Learning (AutoML) aims to stream- To guide model selection and hyperparameter
optimizaline model development by automating the processes of tion, many studies have leveraged dataset characteristics
feature engineering, model selection, and hyperparame- (so-called meta-features). Early work by Lam and Lai [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
ter tuning. While AutoML has become widely successful characterized text datasets with a small set of features
for structured data, its adaptation to natural language pro- (e.g., number of documents, vocabulary size, average
cessing (NLP) tasks, particularly text classification, poses document length) to predict the classification error of
unique challenges due to the complexity and diversity of diferent algorithms, thus recommending the best
classitextual data. In this section, we review existing research ifer for the task. This pioneering meta-learning approach
relevant to AutoML applications in NLP, focusing on the demonstrated that simple corpus-level metrics can
inlimitations of current AutoML frameworks, the role of form algorithm selection. Subsequent research greatly
dataset-driven meta-features, and recent developments expanded the repertoire of meta-features for NLP. For
inin customizing both hyperparameter search spaces and stance, Madrid et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] define 72 corpus-level attributes
neural architectures specifically tailored to text data. – covering general dataset properties, class imbalance,
lexical diversity, stylometry, statistical measures, and
readability indices – to drive automated selection of text
2.1. AutoML for NLP Tasks and Search representation techniques. Many of these features
capSpace Design ture precisely the kind of information used in our
approach for another goal, which is automatic
customizaAutomated machine learning (AutoML) has traditionally tion of the search space and not a text representation
excelled on structured (tabular) data, whereas applying it method, such as average and standard deviation of
docuto raw text required additional efort to convert text into ment length, vocabulary richness (unique word ratios),
features [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. In recent years, several AutoML frameworks number of classes, and so on. By extracting such
metahave been extended to handle text classification, integrat- features from a new text dataset, one can compare them
ing NLP-specific models and pipeline steps. For example, to previously seen tasks and infer which models or
hyAutoGluon and AutoKeras can handle deep NLP models perparameter settings might be appropriate. Researchers
(including modern transformers) for classification, with have applied meta-features in various meta-learning
syssearch spaces that encompass state-of-the-art architec- tems for NLP. Gomez et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] introduced an
evolutiontures like BERT and RoBERTa [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ]. AutoKeras even ary meta-learning method (ELMR) that uses 11 statistical
adjusts its search space based on the task modality: it meta-features of a text corpus to evolve rules for
selectdetects when the input is text and accordingly includes ing the optimal classifier. Their approach automatically
appropriate text vectorization and neural network blocks learned decision rules (via a genetic algorithm) to
idenin the configuration space [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Cloud-based AutoML ser- tify, for example, when a Naïve Bayes vs. SVM vs. neural
vices such as Azure AutoML typically treat text as generic model would be most efective, based on corpus
characinput features (e.g., via TF-IDF or bag-of-words) and do teristics. In a broader approach, Ferreira and Brazdil [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
not customize hyperparameter settings based on dataset- leveraged an active testing strategy to recommend full
specific characteristics [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. text-classification pipelines, evaluating candidate
prepro
        </p>
        <p>
          Notably, researchers have evaluated general AutoML cessing methods and classifiers on small data samples
tools on NLP tasks by converting text to fixed embed- and using meta-features to pick the best pipeline.
Metadings (e.g. using Sentence-BERT to obtain features) to learning has also been used to warm-start
hyperparamift into a tabular AutoML pipeline [
          <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
          ]. These tools eter optimization in general AutoML frameworks.
Fercan discover efective models for text data, but they typi- reira and Brazdil [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] successfully employed 46 dataset
cally operate within broad, fixed search spaces and often descriptors to initialize Bayesian hyperparameter search
lack mechanisms for fine-grained hyperparameter tun- in Auto-Sklearn, improving eficiency by starting from
ing tailored to a specific corpus. In other words, current configurations that worked well on similar prior datasets.
AutoML frameworks for NLP tend to follow a one-size- More recently, Desai et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] built a text AutoML
sysifts-all approach, leaving potential eficiency gains from tem that uses a minimal set of only three meta-features
(e.g. dataset size, average sentence length) to choose space for the target problem instead of using the default
among three Transformer architectures (BERT, ALBERT, full space. By transferring knowledge of what
configuraXLNet) for a classification task. Despite its limited scope tions worked well on similar datasets, these methods aim
(restricted to only a few models), this work demonstrated to accelerate HPO by focusing on the most relevant parts
the promise of corpus features in guiding model selection of the space. Such techniques have shown benefits in
for NLP. Our approach extends these concepts further general AutoML settings, reducing the dimensionality or
by integrating a set of corpus-level characteristics to dy- bounds of hyperparameters to improve search eficiency.
namically guide not only architecture selection but also However, applying this idea in the NLP domain remains
hyperparameter tuning within AutoPyTorch, leveraging relatively unexplored – current AutoML tools do not
auits capability to construct neural networks from scratch, tomatically adjust fundamental hyperparameter ranges
which is essential for efectively handling the diverse and (e.g. maximum vocabulary size, network depth, learning
complex nature of textual data. rate schedules) based on text-specific data
characteris
        </p>
        <p>In summary, prior research shows that incorporating tics. The search space is usually defined a priori (often
dataset-derived features, ranging from simple counts to by human experts) and stays fixed regardless of whether
complex linguistic metrics, can significantly enhance au- the text data consists of tweets or pages of encyclopedia,
tomated model selection and configuration in NLP. How- or whether the vocabulary is 500 words or 50,000 words.
ever, these approaches predominantly focus on selecting Recently, a few nascent approaches have hinted at
among predefined models or representations. To the best the potential of dataset-driven search space adaptation.
of our knowledge, this work is the first to dynamically Notably, Zero-Shot AutoML techniques combine
metaadjust the hyperparameter search space itself based on learning with model selection to configure pipelines
withdataset-derived meta-knowledge, specifically aimed at out any trial-and-error on the new data. For example, the
constructing deep learning models from scratch. ZAP framework by Öztürk et al. [15] attempts to directly
select a pretrained model and its fine-tuning
hyperpa2.3. Hyperparameter Search Space rameters for a new dataset in a zero-shot manner. ZAP</p>
        <p>Adaptation in AutoML tursaiinngs oanmlyettrai-vmiaoldmeletoan-faealaturgreescooflleeaccthiodnaotafspertio(srutcahskass,
Typical AutoML systems rely on a fixed, expert-designed image resolution or the number of classes) to predict the
search space intended to be generic across many datasets. best pipeline. In their vision domain experiments, this
For example, Auto-WEKA formalized the Combined approach could successfully pick an appropriate model
Algorithm Selection and Hyperparameter optimization and hyperparameter configuration without searching,
(CASH) problem—searching over a joint space of 27 base underscoring that even coarse dataset descriptors can
classifiers, their respective hyperparameters, and various be informative for hyperparameter decisions. This idea
feature-selection techniques—using Bayesian optimiza- is very much in line with our goal of text-aware search
tion to navigate hundreds of parameters without dataset- space customization. However, aside from such
cuttingspecific specialization [ 12]. Auto-Sklearn similarly con- edge research prototypes, mainstream AutoML for NLP
structs a broad configuration space of 15 classifier types still lacks the capability to dynamically tailor the
hyperand over 110 hyperparameters (spanning preprocessing parameter search space based on the dataset.
and classifiers) yet remains agnostic to the particular
characteristics of the input data [13]. While such com- 2.4. Text-Oriented Architecture Search &amp;
prehensive spaces can cover many scenarios, they are Pruning
often ineficient: many configurations may be irrelevant
or suboptimal for a particular text dataset. For instance, a Recent research in AutoML for NLP has focused on
tailorsmall set of short tweets likely does not require deep en- ing neural architectures to the needs of text data. Neural
sembles or large n-gram ranges, yet a static search space Architecture Search (NAS) techniques, when specialized
devotes trials indiscriminately to these options. This inef- for textual tasks, have proven efective in discovering
ifciency has motivated research into reducing or tuning model structures that outperform generic designs. For
exthe search space based on prior knowledge. ample, Wang et al. [16] propose TextNAS, a search space</p>
        <p>One line of work is search space transfer via meta- explicitly designed for text representation, and show that
learning. Wistuba et al. [14] first proposed to leverage ex- automatically discovered architectures can surpass
manuperience from previous hyperparameter optimizations to ally crafted networks on sentiment analysis and inference
constrain the search for a new task. In their approach, the tasks. These results highlight that text-specific search
hyperparameter space is narrowed to a region (defined spaces – incorporating layers like CNNs or RNNs suited
by a center point and radius) believed to contain good so- to sequence data – can yield state-of-the-art performance
lutions, efectively pruning away less promising regions. where of-the-shelf image-inspired architectures falter.
They explored designing a smaller, task-specific search In parallel, the emergence of large pre-trained language
models has motivated architecture pruning and adap- leverages meta-features to steer both model
configuratation strategies. Rather than treat one model size as tion and training strategy during the AutoML search,
ift-for-all, researchers leverage NAS to compress or se- efectively bridging the gap between static AutoML
syslect architectures appropriate for a given task’s resource tems and the flexible demands of NLP tasks.
constraints. For instance, NAS-BERT uses neural
architecture search to automatically prune BERT, producing 3.1. Text-Level Meta-Feature Extraction
a family of smaller models that retain accuracy across
tasks while meeting various latency or memory require- To support both hyperparameter configuration and
ments [17]. Collectively, these eforts underscore that model architecture design (e.g., number of neurons and
architecture-level customization is crucial for optimizing layers), we extract a comprehensive set of text-level
metaNLP pipelines. By adjusting neural backbones to text features using an enhanced analysis function. These
incharacteristics (lengthy inputs, specialty domains, etc.), clude:
NAS and pruning approaches lay the foundation for more
adaptive AutoML solutions. 3.1.1. Text Length</p>
        <p>While significant advances have been made in
both meta-feature-driven hyperparameter tuning and
architecture-level customization (NAS/pruning), these
areas have evolved somewhat separately. To date, there
remains an absence of integrated methods that
dynamically combine architecture selection with
hyperparameter optimization based on explicit text dataset
characteristics. Our paper directly addresses this gap by introducing
a unified framework within AutoPyTorch that adapts
both model architectures and hyperparameter
configurations based on corpus-specific meta-features. This
approach ensures that every component of the AutoML
pipeline—from model structure to training parameters—is
tailored specifically for the dataset at hand, leading to a
more eficient and robust text-classification solution.</p>
        <p>Text length is a critical meta-feature in NLP that
impacts both architecture selection and hyperparameter
configuration. Short texts (e.g., fewer than 10 tokens)
lack suficient semantic context, leading to poor model
performance, as shown in McCartney et al. [22].
Conversely, very long texts exceed transformer input limits
(e.g., 512 tokens in BERT) and require either truncation or
specialized architectures such as Hierarchical Attention
Networks (HAN) [23] or Longformer [24].</p>
        <p>To address these issues, we compute the average and
standard deviation of text length at the corpus level and
incorporate them into multiple stages of the AutoML
pipeline. Specifically, long average sequence lengths
trigger smaller batch sizes (e.g., 8–16 for texts &gt;300
characters), shorter warm-up periods in cosine annealing
schedules, and reduced learning rates to stabilize
train3. Methodology ing. Additionally, we adapt the architectural shape of
candidate MLP backbones: datasets with long inputs
Our objective is to enhance the adaptability and perfor- receive “long funnel” configurations to compress
highmance of AutoML systems for text classification by dy- dimensional sequences, while very short texts invoke
namically customizing both the hyperparameter search compact “diamond” shapes to avoid overfitting. High
space and neural architectures based on intrinsic proper- variance in length distribution increases regularization
ties of the input dataset. We implement this within the (via dropout) to ensure generalization across
variableAutoPyTorch framework, which ofers modular exten- length inputs.
sibility, fine-grained pipeline control, and full support This integration ensures that the AutoML system
dyfor deep learning models constructed from scratch. This namically aligns model complexity and optimization
belfexibility is especially valuable for textual data, where havior with the distributional characteristics of the input
architectural decisions—such as incorporating attention text, improving both eficiency and robustness in the
mechanisms or shaping MLPs—must align with dataset- search process.
specific traits like sequence length, lexical diversity, and
class imbalance [18, 19]. 3.1.2. Vocabulary Richness and Lexical Diversity</p>
        <p>Unlike other AutoML frameworks that rely on fixed
pipelines or pre-trained models, our approach enables Vocabulary richness—commonly measured using the
the construction of neural architectures that are directly type-token ratio (TTR) or corpus-level approximations
informed by corpus-level characteristics. Prior work such as the unique-to-total word ratio—reflects the
sehas shown that such dynamic, data-driven architecture mantic complexity of a text corpus. Higher lexical
digeneration leads to better generalization and improved versity increases the dimensionality of the input space
performance, particularly in heterogeneous or domain- and often correlates with more complex linguistic
strucspecific scenarios [ 20, 21]. These findings motivate our ture [25, 26], requiring models with greater expressive
design of a context-aware adaptation mechanism that capacity. From a theoretical standpoint, diverse corpora
demand models with higher VC dimension and wider 3.1.4. Label Distribution and Class Imbalance
hypothesis classes to capture nuanced patterns [27].</p>
        <p>To account for this, our system dynamically adapts Imbalanced class distributions are a common challenge
architectural complexity based on measured lexical di- in text classification, where certain categories (e.g., hate
versity. For datasets with a high unique word ratio (e.g., speech, fraud cases) are underrepresented but critically
&gt; 0.3), we increase the number of neuron groups and important. When the class imbalance ratio—the
proporexpand the maximum layer width (max_units) in our tion between the most and least frequent class—exceeds a
text-aware MLP-based backbone, allowing the model to certain threshold, classification performance for minority
better capture semantic variation. Conversely, for low- classes deteriorates due to model bias toward majority
diversity corpora, we reduce network width and depth labels [31, 32]. This bias arises from the dificulty of
to prevent overfitting. In addition, the backbone shape estimating rare class probabilities under skewed priors,
is adjusted: high-diversity texts favor "long funnel" ar- which leads to inaccurate posterior approximations,
eschitectures, while simpler datasets default to "diamond"- pecially when using symmetric loss functions such as
shaped or regular "brick-like" architectures composed cross-entropy.
of repeated modules. We also modify activation func- In our AutoML framework, class imbalance is
meations: when diversity is low and the default ReLU may sured during the meta-feature extraction phase and
diunderperform, GELU is automatically selected to improve rectly influences the search space configuration. For
representation power for simple patterns. datasets with imbalance ratios exceeding 3.0, we expand</p>
        <p>These adaptations ensure that both the search space the dropout range (e.g., up to 0.8), reduce learning rates,
and the resulting architectures reflect the semantic vari- and increase the warm-up period in cosine learning rate
ability of the input corpus, allowing the AutoML process schedules. These measures are designed to stabilize
trainto match model expressiveness with linguistic richness. ing under uneven gradient updates and reduce
overfitting to dominant classes. Conversely, for nearly balanced
datasets (imbalance ratio below 1.5), regularization is
3.1.3. Number of Samples relaxed to allow more expressive learning.</p>
        <p>Although architectural constraints are not enforced
strictly based on imbalance, our search space prioritizes
configurations that are empirically robust to imbalance,
such as residual-normalized attention layers or
funnelshaped MLPs. Together, these mechanisms enable the
AutoML system to maintain balanced performance across
both major and minor classes.</p>
        <p>The number of training examples is a fundamental
metafeature that influences model complexity, training
dynamics, and generalization behavior. Small datasets tend to
increase the risk of overfitting—particularly when using
high-capacity neural networks—whereas large datasets
enable the use of deeper models, longer training
schedules, and reduced regularization. This is grounded in
statistical learning theory, which links generalization
error to both the size of the hypothesis class and the 4. Experiments
number of available training samples [28]. Empirical
studies support this connection: Domhan et al. [29] and To evaluate the efectiveness of our context-aware
AuProbst et al. [30] show that both training regimes and op- toML framework for text classification, we conducted
timal hyperparameter values (e.g., learning rate, dropout) comprehensive experiments on 20 diverse datasets. Our
scale with dataset size. experiments were designed to compare the performance</p>
        <p>In our approach, we compute the number of training of our proposed context-aware AutoPyTorch against a
samples during meta-feature extraction and use this to strong baseline using static configurations in
AutoPyadapt the AutoML search space. For datasets with fewer Torch.
than 1,000 examples, we expand the dropout search space
(up to 0.8), reduce learning rates, and favor simpler
backbones such as narrow MLPs or shallow attention blocks. 4.1. Datasets
Training budgets are also capped to avoid overfitting un- We conduct experiments on a diverse collection of
der data scarcity. In contrast, datasets with more than datasets, including a stratified 30% subset of each task
10,000 samples prompt relaxed regularization and en- from the GLUE benchmark [33], a widely used
evaluable higher-capacity configurations, such as increased ation suite for natural language understanding. GLUE
max_units and longer training horizons. These modifi- (General Language Understanding Evaluation) consists of
cations ensure that the resulting models are appropriately multiple sentence-level and sentence-pair classification
scaled to the statistical regime of the dataset, improving tasks, covering linguistic phenomena such as entailment,
both robustness and computational eficiency. paraphrase detection, sentiment analysis, and
grammaticality judgment. Our subset selection balances
computational feasibility and label distribution fidelity, enabling
eficient neural architecture search within AutoPyTorch
while maintaining representative task characteristics.</p>
        <p>In addition to GLUE, we evaluate our approach on
selected Kaggle datasets that span various text
classification domains (e.g., emotion detection, spam filtering),
as well as two private corpora in German. These
private datasets address real-world classification tasks and
introduce additional linguistic and domain-specific
variability, allowing us to assess the generalizability of our
context-aware AutoML framework across both English
and German texts. Detailed dataset statistics are provided
in Table 1.
• MetaMLP – a custom MLP architecture whose
depth, width, and shape are dynamically adapted
based on meta-features such as text length,
lexical diversity, number of samples, and class
imbalance.
• Contextual AttentionNet – a lightweight
attention-based model built with multi-head
selfattention layers, with structural parameters (e.g.,
number of heads, embedding dimensions)
conditioned on input characteristics.</p>
        <sec id="sec-1-2-1">
          <title>These architectures were treated as a categorical hy</title>
          <p>perparameter within the AutoML pipeline, allowing the
search process to explore and select the most appropriate
model type using Bayesian optimization.</p>
          <p>The text-aware version of our pipeline integrates the
meta-feature extraction step at the beginning of each
AutoPyTorch run. The extracted corpus-level properties
are then used to dynamically adapt the hyperparameter
search space. Key adaptations include:
• Batch size adjustments based on average
sequence length;
• Learning rate and dropout range scaling based
on dataset size and class imbalance;
• Architectural shaping (e.g., diamond vs. funnel
MLPs) based on input diversity and length
variance.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>4.5. Results</title>
        <sec id="sec-1-3-1">
          <title>Tables 2 and 3 summarize the performance of our context</title>
          <p>aware AutoML pipeline in comparison to a static
AutoPyTorch baseline across 20 text classification datasets.
The evaluation was based on four widely used metrics:
Accuracy, F1-micro, Matthews Correlation Coeficient</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>All experiments were constrained to a wall-clock time</title>
          <p>of 3,000 seconds (approx. 2 hours) and a per-model
training time of 600 seconds. We used multi-fidelity
optimization via Successive Halving with a training budget
4.2. Embedding Method ranging from 10 to 100 epochs.</p>
          <p>All runs were executed on a single NVIDIA A100 GPU
aFmtoeordcaeollnlfterexoxpmteuratihlmiezeeSndetntset,xewnteceemuTbsreeaddndstfihnoergmsa.leTrlsh-elMibmirnoairdyLeMlt-oenLgc6eo-ndveer2s- lfomaatcinhginpeowinitthpr4e0ciGsiBono. f memory, using standard 32-bit
each input text into a fixed-size dense vector of 384
dimensions. 4.4. Baseline Configuration</p>
          <p>To ensure a controlled comparison between the
baseline and our proposed method, the embedding layer was
kept identical across all experimental conditions.</p>
        </sec>
        <sec id="sec-1-3-3">
          <title>To establish a fair comparison, we define a strong baseline</title>
          <p>using the unmodified AutoPyTorch framework and the
same text embedding method (MiniLM). In this setting,
the hyperparameter search space remains static and is
not influenced by any dataset-specific meta-features.</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>4.3. Model Framework &amp; Search Strategy</title>
        <p>We implemented all experiments using the AutoPyTorch
framework, leveraging its modular design for deep
learning pipelines and extensible search space control. To
ensure focus on neural architecture optimization, the
traditional machine learning components (e.g., random
forests, SVMs) were disabled for our approach. Only deep
learning backbones were allowed in the search space.
(MCC), and Pearson correlation. These metrics were Table 2
selected to reflect the characteristics of each task—for Accuracy (%) comparison between Baseline and Custom
Hyexample, Accuracy for balanced classification tasks, F1- perparameter Search across GLUE Datasets.
micro for imbalanced or multi-class problems, MCC for Dataset Accuracy/F1-Micro/MCC/Pearson
binary grammaticality judgments (e.g., CoLA), and
Pearson correlation for sentence similarity (STS-B). Notably, Baseline Custom
these are also the oficial evaluation metrics adopted in CoLA 0.05 -0.2
the GLUE benchmark [33], ensuring compatibility and MRPC 72.7% 74.13%
comparability with prior NLP research. QNLI 53.3% 50.3%</p>
        <p>Overall, our method demonstrates consistent improve- QQP 49.17% 49.17%
ments, particularly on tasks characterized by limited RTE 52.8% 53.5%
training data, class imbalance, or high lexical diversity. SST-2 2.1% 83.4%</p>
        <p>On the GLUE benchmark, our pipeline yields signif- STS-B 24.7% 30.72%
icant gains on several tasks. Notably, WNLI accuracy WNLI 17.6% 54.9%
increases from 17.6% to 54.9%, and SST-2 sees a dramatic
rise from 2.1% to 83.4%. These results highlight the ef- Table 3
fectiveness of our adaptive architecture and regulariza- Prediction performance comparison between Baseline and
tion mechanisms in low-resource and sentiment-sensitive Custom Hyperparameter Search.
tasks. We also observe improvements in STS-B, where Dataset Accuracy/F1-Micro (%)
the Pearson correlation increases from 24.7% to 30.7%.</p>
        <p>Conversely, slight performance drops are observed in Baseline Custom
QNLI and CoLA, which may suggest that the current Occupation 77.9 77.3
adaptation strategy occasionally introduces suboptimal BBC 97.3 96.85
regularization or architectural choices. For QQP, both Cyber 76.57 77.2
the baseline and our pipeline failed to build a viable neu- Emails 66.9 76.15
ral or ensemble model within the computational budget, Emotion 87.1 88.3
resulting in fallback to a dummy classifier. Framing 69.1 71.3</p>
        <p>Our pipeline also outperforms the baseline on the Humor 91.8 92.87
majority of Kaggle and private datasets. Substantial SMpaatmh 9165.8.14 9166..93
gains are observed on Emails (from 66.9% to 76.2%), Troll Job 96.46 96.27
(from 52.9% to 56.7%), and Finished Sentence (from 79.4% Finished Sentence 79.43 81.12
to 81.1%), indicating that context-aware adaptation im- Troll 52.88 56.73
proves performance in tasks with either noisy data or
subtle class distinctions. A slight performance decrease
is observed on a few tasks, such as BBC and Occupa- 5. Conclusion and Future Works
tion. In the case of Occupation, the drop in performance
can be attributed to the nature of the dataset: it con- In this work, we proposed a context-aware AutoML
sists of short, open-ended answers to questions about framework that dynamically adapts the hyperparameter
a person’s job, which are then mapped to the top-level search space and neural architecture configurations in
relabels of the German occupation classification system sponse to corpus-level text features. Implemented within
(KldB). These free-text responses are often terse (e.g., the AutoPyTorch ecosystem, our approach integrates
one- or two-word entries like “Technician” or “Sales”) dataset-driven meta-feature extraction with a modular
and lack suficient contextual information to support nu- design for backbone selection and training parameter
anced classification. As a result, the dynamic adaptation control. Experiments across 20 datasets—including
submechanisms—designed to adjust architectures and hyper- sets of GLUE and diverse public corpora—demonstrate
parameters based on richer linguistic cues—have limited consistent improvements in classification performance,
room to operate efectively. The scarcity of semantic con- particularly in scenarios with imbalanced classes, small
text may also hinder the efectiveness of embeddings and training sets, or high lexical diversity.
prevent the model from learning discriminative patterns By coupling structural and optimization-level
deciacross fine-grained occupational categories. sions to dataset-specific traits, our framework ofers a</p>
        <p>Overall, the results support the utility of dynamic promising direction for more eficient and efective
Ausearch space tailoring across varied domains and textual toML in NLP. The results validate that even lightweight
characteristics. corpus features (e.g., text length, label imbalance) can
yield meaningful adaptations to both model topology and
hyperparameter scheduling. Lastly, our current evaluation focuses solely on
classifi</p>
        <p>While our method demonstrates strong empirical cation tasks using moderately sized monolingual datasets.
gains, several important avenues remain for future re- The applicability of our approach to large-scale corpora,
search. multilingual benchmarks, or more complex NLP tasks</p>
        <p>First, our current system relies on a limited set of meta- (e.g., sequence labeling or generation) remains
unexfeatures, such as average text length, vocabulary diver- plored.
sity, and class imbalance. In future work, we aim to
extend this analysis to include finer-grained linguistic
and structural features such as average sentence length, References
part-of-speech density, punctuation density, unique
character ratio, and readability scores. These features may
ofer deeper insight into the semantic and syntactic
complexity of text, enabling more informed search space
adjustments.</p>
        <p>Second, while we implement a contextual search space
by mapping meta-features to hyperparameter ranges, this
process currently uses static, hand-crafted rules. More
expressive and structured search spaces could allow
hyperparameter relevance and conditionality to adapt
dynamically based on dataset characteristics. For instance,
certain architecture components or regularization
parameters could be activated only when specific linguistic
conditions are met, allowing for more flexible and
principled adaptation.</p>
        <p>Finally, our current fusion strategy for resolving
conlficts between feature influences on the same
hyperparameter is based on simple heuristics—such as averaging
suggested values or intersecting ranges. In future work,
we plan to investigate more flexible fusion mechanisms,
such as weighting meta-features by importance or
learning fusion policies from prior task performance. These
improvements could make the contextual adaptation
process more scalable, robust, and interpretable across a
wide range of text classification tasks.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>6. Limitations</title>
      <sec id="sec-2-1">
        <title>Despite the overall efectiveness of our context-aware</title>
        <p>search space design, several limitations remain.</p>
        <p>First, while our system considers multiple
metafeatures to guide hyperparameter configurations, their
influence is combined using static heuristics. This
rulebased fusion does not account for potential interactions
or conflicts between features, and lacks the flexibility to
adapt based on task-specific dynamics or prior
performance.</p>
        <p>Second, the increased complexity introduced by
textspecific search space customization results in higher
computational cost. In most cases, we observed longer
processing times due to the additional overhead from
meta-feature analysis, search space updates, and more
expansive architecture evaluations. This may limit the
method’s applicability in time-constrained or
resourcelimited environments.</p>
        <p>ternational Journal of Advanced Computer Science [22] A. McCartney, S. Hensman, L. Longo, How short is
and Applications 13 (2022). a piece of string?: the impact of text length and text
[12] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton- augmentation on short-text classification accuracy
Brown, Auto-weka: Combined selection and hyper- (2017).
parameter optimization of classification algorithms, [23] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy,
in: Proceedings of the 19th ACM SIGKDD interna- Hierarchical attention networks for document
clastional conference on Knowledge discovery and data sification, in: Proceedings of the 2016 conference of
mining, 2013, pp. 847–855. the North American chapter of the association for
[13] M. Feurer, F. Hutter, Automated machine learning, computational linguistics: human language
tech</p>
        <p>Cham: Springer (2019) 113–134. nologies, 2016, pp. 1480–1489.
[14] M. Wistuba, N. Schilling, L. Schmidt-Thieme, Hy- [24] I. Beltagy, M. E. Peters, A. Cohan, Longformer:
perparameter search space pruning–a new compo- The long-document transformer, arXiv preprint
nent for sequential model-based hyperparameter arXiv:2004.05150 (2020).
optimization, in: Machine Learning and Knowl- [25] M. Monteiro, C. K. James, M. Kloft, S. Fellenz,
Charedge Discovery in Databases: European Conference, acterizing text datasets with psycholinguistic
feaECML PKDD 2015, Porto, Portugal, September 7- tures, in: Findings of the Association for
Compu11, 2015, Proceedings, Part II 15, Springer, 2015, pp. tational Linguistics: EMNLP 2024, 2024, pp. 14977–
104–119. 14990.
[15] E. Öztürk, F. Ferreira, H. Jomaa, L. Schmidt-Thieme, [26] M. Sokolova, Big text advantages and challenges:
J. Grabocka, F. Hutter, Zero-shot automl with pre- classification perspective, International Journal of
trained models, in: International Conference on Data Science and Analytics 5 (2018) 1–10.</p>
        <p>Machine Learning, PMLR, 2022, pp. 17138–17155. [27] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals,
[16] Y. Wang, Y. Yang, Y. Chen, J. Bai, C. Zhang, G. Su, Understanding deep learning requires rethinking
X. Kou, Y. Tong, M. Yang, L. Zhou, Textnas: A generalization, arXiv preprint arXiv:1611.03530
neural architecture search space tailored for text (2016).
representation, in: Proceedings of the AAAI con- [28] V. Vapnik, Statistical Learning Theory now plays a
ference on artificial intelligence, volume 34, 2020, more active role: after the general analysis of
learnpp. 9242–9249. ing processes, the research in the area of synthesis
[17] J. Xu, X. Tan, R. Luo, K. Song, J. Li, T. Qin, T.-Y. of optimal algorithms was started. These studies,
Liu, Nas-bert: Task-agnostic and adaptive-size bert however, do not belong to history yet. They are a
compression with neural architecture search, in: subject of today’s research activities., Ph.D. thesis,
Proceedings of the 27th ACM SIGKDD Conference These studies, however, do not belong to history
on Knowledge Discovery &amp; Data Mining, 2021, pp. yet. They are a subject of . . . , 1998.</p>
        <p>1933–1943. [29] T. Domhan, J. T. Springenberg, F. Hutter,
Speed[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: ing up automatic hyperparameter optimization of
Pre-training of deep bidirectional transformers for deep neural networks by extrapolation of learning
language understanding, in: Proceedings of the curves., in: IJCAI, volume 15, 2015, pp. 3460–8.
2019 conference of the North American chapter of [30] P. Probst, M. N. Wright, A.-L. Boulesteix,
Hyperpathe association for computational linguistics: hu- rameters and tuning strategies for random forest,
man language technologies, volume 1 (long and Wiley Interdisciplinary Reviews: data mining and
short papers), 2019, pp. 4171–4186. knowledge discovery 9 (2019) e1301.
[19] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut- [31] J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder,
dinov, Q. V. Le, Xlnet: Generalized autoregressive N. Seliya, A survey on addressing high-class
impretraining for language understanding, Advances balance in big data, Journal of Big Data 5 (2018)
in neural information processing systems 32 (2019). 1–30.
[20] L. Zimmer, M. Lindauer, F. Hutter, Auto-pytorch: [32] S. Uddin, H. Lu, Dataset meta-level and
statistiMulti-fidelity metalearning for eficient and robust cal features afect machine learning performance,
autodl, IEEE transactions on pattern analysis and Scientific Reports 14 (2024) 1670.</p>
        <p>machine intelligence 43 (2021) 3079–3090. [33] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R.
[21] Y. Li, Y. Shen, W. Zhang, Y. Chen, H. Jiang, M. Liu, Bowman, Glue: A multi-task benchmark and
analJ. Jiang, J. Gao, W. Wu, Z. Yang, et al., Openbox: ysis platform for natural language understanding,
A generalized black-box optimization service, in: arXiv preprint arXiv:1804.07461 (2018).</p>
        <p>Proceedings of the 27th ACM SIGKDD conference
on knowledge discovery &amp; data mining, 2021, pp.</p>
        <p>3209–3219.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order
to: Improve writing style and Grammar and spelling check. After using these tool(s)/service(s), the
author(s) reviewed and edited the content as needed and take(s) full responsibility for the
publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lam</surname>
          </string-name>
          , K.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <article-title>A meta-learning approach for text categorization</article-title>
          ,
          <source>in: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <year>2001</year>
          , pp.
          <fpage>303</fpage>
          -
          <lpage>309</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>J. G</surname>
          </string-name>
          . Madrid,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          , E. Morales,
          <article-title>Metalearning of textual representations</article-title>
          ,
          <source>in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Autokeras:</surname>
          </string-name>
          <article-title>An automl library for deep learning</article-title>
          ,
          <source>Journal of machine Learning research 24</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Erickson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Smola</surname>
          </string-name>
          ,
          <article-title>Benchmarking multimodal automl for tabular data with text fields</article-title>
          ,
          <source>arXiv preprint arXiv:2111.02705</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.-A.</given-names>
            <surname>Zöller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <article-title>Benchmark and survey of automated machine learning frameworks</article-title>
          ,
          <source>Journal of artificial intelligence research 70</source>
          (
          <year>2021</year>
          )
          <fpage>409</fpage>
          -
          <lpage>472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumarapathirage</surname>
          </string-name>
          ,
          <article-title>A systematic review of automl for text classification: From theory to practice (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Safikhani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Broneske</surname>
          </string-name>
          ,
          <article-title>Enhancing autonlp with ifne-tuned bert models: an evaluation of text representation methods for autopytorch</article-title>
          ,
          <source>Available at SSRN</source>
          <volume>4585459</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Safikhani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Broneske</surname>
          </string-name>
          ,
          <article-title>Automl meets hugging face: Domain-aware pretrained model selection for text classification, in: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          (Volume
          <volume>4</volume>
          : Student Research Workshop),
          <year>2025</year>
          , pp.
          <fpage>466</fpage>
          -
          <lpage>473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoskens</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Moens</surname>
          </string-name>
          ,
          <article-title>Evolutionary learning of meta-rules for text classification</article-title>
          ,
          <source>in: Proceedings of the Genetic and Evolutionary Computation Conference Companion</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>M. J. Ferreira</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brazdil</surname>
          </string-name>
          ,
          <article-title>Workflow recommendation for text classification with active testing method</article-title>
          ,
          <source>in: Workshop AutoML</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Desai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kothari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Surve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shekokar</surname>
          </string-name>
          , Textbrew:
          <article-title>Automated model selection and hyperparameter optimization for text classification</article-title>
          , In-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>