1. Introduction

Context-Aware Search Space Adaptation of Hyperparameters and Architectures for AutoML in Text Classification

Parisa Safikhani

0 1

David Broneske

1 0 Otto von Guericke University , Germany 1 The German Centre for Higher Education Research and Science Studies (DZHW) , Germany

2025

While Automated Machine Learning (AutoML) systems have shown strong performance on structured data, their application to natural language processing (NLP) tasks remains limited by static, task-agnostic search spaces. In this work, we propose a context-aware extension of AutoPyTorch that dynamically adapts both the hyperparameter search space and neural architecture configuration based on corpus-level meta-features. Our approach extracts interpretable textual statistics-such as average sequence length, vocabulary richness, and class imbalance-to guide the configuration of key hyperparameters. We also introduce two adaptive neural backbones, whose structures are shaped by these meta-features to improve model expressiveness and generalization. Experiments on 20 diverse text classification datasets-including subsets of GLUE, selected Kaggle benchmarks, and private corpora-demonstrate consistent performance improvements over strong baselines, particularly on datasets with limited training samples or severe class imbalance. Our results highlight the efectiveness of integrating dataset-level insights into the AutoML search process for NLP.

eol>AutoML Text Classification Hyperparameter Optimization Meta-Features Neural Architecture Search Context-Aware Modeling AutoPyTorch

1. Introduction

AutoML frameworks have significantly advanced the democratization of machine learning by automating the design and optimization of learning pipelines. While these systems have shown strong performance on structured data, their extension to NLP tasks remains limited due to the inherent complexity and diversity of textual data. Text classification, a core NLP task, presents unique challenges stemming from variable input lengths, diverse syntactic structures, and high lexical variation—factors that are often overlooked in conventional AutoML worklfows.

Most current AutoML approaches for NLP adopt static pipeline configurations and search spaces, treating all datasets uniformly regardless of their linguistic characteristics. Even when modern frameworks include neural networks or transformer models, their hyperparameter search is usually performed within generic, manually designed boundaries. This static design neglects crucial dataset-specific properties such as text length distribution, vocabulary richness, or class imbalance, which are known to afect both model architecture performance and training dynamics [ 1, 2 ]. As a result, these frameworks may perform poorly on unusual or domain-specific text datasets, where generic configurations fail to address context-specific requirements.

To address this gap, we propose a context-aware extension of an AutoML Framework that dynamically adapts its hyperparameter search space and model architecture decisions based on corpus-level meta-features. Our approach integrates a systematic extraction of statistical and linguistic characteristics from each dataset—such as text length variability, lexical diversity, sample size, and class distribution—and uses these to inform both the configuration of search spaces and the structural design of neural backbones. By leveraging these insights, the system can better align model complexity, optimization schedules, and architectural choices with the demands of the data.

This paper makes two main contributions: First, we introduce a context-aware mechanism for dynamically adapting the hyperparameter search space in AutoML based on text-level meta-features such as text length, vocabulary diversity, and class imbalance. This enables the AutoML process to tailor its optimization bounds—e.g., for batch size, learning rate, and dropout—according to the statistical profile of each dataset. Second, we propose two adaptive neural backbones, MetaMLP and ContextualAttentionNet, whose configurations are shaped by statistical and lexical characteristics of the input text. These backbones enable the system to construct models from scratch that better reflect the structural and distributional properties of the data. Together, these innovations facilitate a more robust and eficient adaptation of AutoML pipelines to the unique demands of text classification tasks. 2. Related Works dataset-specific adaptation largely untapped. Moreover, they rely on selecting from existing machine learning or neural network architectures, rather than dynamically constructing models based on the unique characteristics of the textual data.

2.2. Meta-Features and Meta-Learning for AutoML in NLP

Automated Machine Learning (AutoML) aims to stream- To guide model selection and hyperparameter optimizaline model development by automating the processes of tion, many studies have leveraged dataset characteristics feature engineering, model selection, and hyperparame- (so-called meta-features). Early work by Lam and Lai [ 1 ] ter tuning. While AutoML has become widely successful characterized text datasets with a small set of features for structured data, its adaptation to natural language pro- (e.g., number of documents, vocabulary size, average cessing (NLP) tasks, particularly text classification, poses document length) to predict the classification error of unique challenges due to the complexity and diversity of diferent algorithms, thus recommending the best classitextual data. In this section, we review existing research ifer for the task. This pioneering meta-learning approach relevant to AutoML applications in NLP, focusing on the demonstrated that simple corpus-level metrics can inlimitations of current AutoML frameworks, the role of form algorithm selection. Subsequent research greatly dataset-driven meta-features, and recent developments expanded the repertoire of meta-features for NLP. For inin customizing both hyperparameter search spaces and stance, Madrid et al. [ 2 ] define 72 corpus-level attributes neural architectures specifically tailored to text data. – covering general dataset properties, class imbalance, lexical diversity, stylometry, statistical measures, and readability indices – to drive automated selection of text 2.1. AutoML for NLP Tasks and Search representation techniques. Many of these features capSpace Design ture precisely the kind of information used in our approach for another goal, which is automatic customizaAutomated machine learning (AutoML) has traditionally tion of the search space and not a text representation excelled on structured (tabular) data, whereas applying it method, such as average and standard deviation of docuto raw text required additional efort to convert text into ment length, vocabulary richness (unique word ratios), features [ 2 ]. In recent years, several AutoML frameworks number of classes, and so on. By extracting such metahave been extended to handle text classification, integrat- features from a new text dataset, one can compare them ing NLP-specific models and pipeline steps. For example, to previously seen tasks and infer which models or hyAutoGluon and AutoKeras can handle deep NLP models perparameter settings might be appropriate. Researchers (including modern transformers) for classification, with have applied meta-features in various meta-learning syssearch spaces that encompass state-of-the-art architec- tems for NLP. Gomez et al. [ 9 ] introduced an evolutiontures like BERT and RoBERTa [ 3, 4 ]. AutoKeras even ary meta-learning method (ELMR) that uses 11 statistical adjusts its search space based on the task modality: it meta-features of a text corpus to evolve rules for selectdetects when the input is text and accordingly includes ing the optimal classifier. Their approach automatically appropriate text vectorization and neural network blocks learned decision rules (via a genetic algorithm) to idenin the configuration space [ 3 ]. Cloud-based AutoML ser- tify, for example, when a Naïve Bayes vs. SVM vs. neural vices such as Azure AutoML typically treat text as generic model would be most efective, based on corpus characinput features (e.g., via TF-IDF or bag-of-words) and do teristics. In a broader approach, Ferreira and Brazdil [ 10 ] not customize hyperparameter settings based on dataset- leveraged an active testing strategy to recommend full specific characteristics [ 5 ]. text-classification pipelines, evaluating candidate prepro

Notably, researchers have evaluated general AutoML cessing methods and classifiers on small data samples tools on NLP tasks by converting text to fixed embed- and using meta-features to pick the best pipeline. Metadings (e.g. using Sentence-BERT to obtain features) to learning has also been used to warm-start hyperparamift into a tabular AutoML pipeline [ 6, 7, 8 ]. These tools eter optimization in general AutoML frameworks. Fercan discover efective models for text data, but they typi- reira and Brazdil [ 10 ] successfully employed 46 dataset cally operate within broad, fixed search spaces and often descriptors to initialize Bayesian hyperparameter search lack mechanisms for fine-grained hyperparameter tun- in Auto-Sklearn, improving eficiency by starting from ing tailored to a specific corpus. In other words, current configurations that worked well on similar prior datasets. AutoML frameworks for NLP tend to follow a one-size- More recently, Desai et al. [ 11 ] built a text AutoML sysifts-all approach, leaving potential eficiency gains from tem that uses a minimal set of only three meta-features (e.g. dataset size, average sentence length) to choose space for the target problem instead of using the default among three Transformer architectures (BERT, ALBERT, full space. By transferring knowledge of what configuraXLNet) for a classification task. Despite its limited scope tions worked well on similar datasets, these methods aim (restricted to only a few models), this work demonstrated to accelerate HPO by focusing on the most relevant parts the promise of corpus features in guiding model selection of the space. Such techniques have shown benefits in for NLP. Our approach extends these concepts further general AutoML settings, reducing the dimensionality or by integrating a set of corpus-level characteristics to dy- bounds of hyperparameters to improve search eficiency. namically guide not only architecture selection but also However, applying this idea in the NLP domain remains hyperparameter tuning within AutoPyTorch, leveraging relatively unexplored – current AutoML tools do not auits capability to construct neural networks from scratch, tomatically adjust fundamental hyperparameter ranges which is essential for efectively handling the diverse and (e.g. maximum vocabulary size, network depth, learning complex nature of textual data. rate schedules) based on text-specific data characteris

In summary, prior research shows that incorporating tics. The search space is usually defined a priori (often dataset-derived features, ranging from simple counts to by human experts) and stays fixed regardless of whether complex linguistic metrics, can significantly enhance au- the text data consists of tweets or pages of encyclopedia, tomated model selection and configuration in NLP. How- or whether the vocabulary is 500 words or 50,000 words. ever, these approaches predominantly focus on selecting Recently, a few nascent approaches have hinted at among predefined models or representations. To the best the potential of dataset-driven search space adaptation. of our knowledge, this work is the first to dynamically Notably, Zero-Shot AutoML techniques combine metaadjust the hyperparameter search space itself based on learning with model selection to configure pipelines withdataset-derived meta-knowledge, specifically aimed at out any trial-and-error on the new data. For example, the constructing deep learning models from scratch. ZAP framework by Öztürk et al. [15] attempts to directly select a pretrained model and its fine-tuning hyperpa2.3. Hyperparameter Search Space rameters for a new dataset in a zero-shot manner. ZAP

Adaptation in AutoML tursaiinngs oanmlyettrai-vmiaoldmeletoan-faealaturgreescooflleeaccthiodnaotafspertio(srutcahskass, Typical AutoML systems rely on a fixed, expert-designed image resolution or the number of classes) to predict the search space intended to be generic across many datasets. best pipeline. In their vision domain experiments, this For example, Auto-WEKA formalized the Combined approach could successfully pick an appropriate model Algorithm Selection and Hyperparameter optimization and hyperparameter configuration without searching, (CASH) problem—searching over a joint space of 27 base underscoring that even coarse dataset descriptors can classifiers, their respective hyperparameters, and various be informative for hyperparameter decisions. This idea feature-selection techniques—using Bayesian optimiza- is very much in line with our goal of text-aware search tion to navigate hundreds of parameters without dataset- space customization. However, aside from such cuttingspecific specialization [ 12]. Auto-Sklearn similarly con- edge research prototypes, mainstream AutoML for NLP structs a broad configuration space of 15 classifier types still lacks the capability to dynamically tailor the hyperand over 110 hyperparameters (spanning preprocessing parameter search space based on the dataset. and classifiers) yet remains agnostic to the particular characteristics of the input data [13]. While such com- 2.4. Text-Oriented Architecture Search & prehensive spaces can cover many scenarios, they are Pruning often ineficient: many configurations may be irrelevant or suboptimal for a particular text dataset. For instance, a Recent research in AutoML for NLP has focused on tailorsmall set of short tweets likely does not require deep en- ing neural architectures to the needs of text data. Neural sembles or large n-gram ranges, yet a static search space Architecture Search (NAS) techniques, when specialized devotes trials indiscriminately to these options. This inef- for textual tasks, have proven efective in discovering ifciency has motivated research into reducing or tuning model structures that outperform generic designs. For exthe search space based on prior knowledge. ample, Wang et al. [16] propose TextNAS, a search space

One line of work is search space transfer via meta- explicitly designed for text representation, and show that learning. Wistuba et al. [14] first proposed to leverage ex- automatically discovered architectures can surpass manuperience from previous hyperparameter optimizations to ally crafted networks on sentiment analysis and inference constrain the search for a new task. In their approach, the tasks. These results highlight that text-specific search hyperparameter space is narrowed to a region (defined spaces – incorporating layers like CNNs or RNNs suited by a center point and radius) believed to contain good so- to sequence data – can yield state-of-the-art performance lutions, efectively pruning away less promising regions. where of-the-shelf image-inspired architectures falter. They explored designing a smaller, task-specific search In parallel, the emergence of large pre-trained language models has motivated architecture pruning and adap- leverages meta-features to steer both model configuratation strategies. Rather than treat one model size as tion and training strategy during the AutoML search, ift-for-all, researchers leverage NAS to compress or se- efectively bridging the gap between static AutoML syslect architectures appropriate for a given task’s resource tems and the flexible demands of NLP tasks. constraints. For instance, NAS-BERT uses neural architecture search to automatically prune BERT, producing 3.1. Text-Level Meta-Feature Extraction a family of smaller models that retain accuracy across tasks while meeting various latency or memory require- To support both hyperparameter configuration and ments [17]. Collectively, these eforts underscore that model architecture design (e.g., number of neurons and architecture-level customization is crucial for optimizing layers), we extract a comprehensive set of text-level metaNLP pipelines. By adjusting neural backbones to text features using an enhanced analysis function. These incharacteristics (lengthy inputs, specialty domains, etc.), clude: NAS and pruning approaches lay the foundation for more adaptive AutoML solutions. 3.1.1. Text Length

While significant advances have been made in both meta-feature-driven hyperparameter tuning and architecture-level customization (NAS/pruning), these areas have evolved somewhat separately. To date, there remains an absence of integrated methods that dynamically combine architecture selection with hyperparameter optimization based on explicit text dataset characteristics. Our paper directly addresses this gap by introducing a unified framework within AutoPyTorch that adapts both model architectures and hyperparameter configurations based on corpus-specific meta-features. This approach ensures that every component of the AutoML pipeline—from model structure to training parameters—is tailored specifically for the dataset at hand, leading to a more eficient and robust text-classification solution.

Text length is a critical meta-feature in NLP that impacts both architecture selection and hyperparameter configuration. Short texts (e.g., fewer than 10 tokens) lack suficient semantic context, leading to poor model performance, as shown in McCartney et al. [22]. Conversely, very long texts exceed transformer input limits (e.g., 512 tokens in BERT) and require either truncation or specialized architectures such as Hierarchical Attention Networks (HAN) [23] or Longformer [24].

To address these issues, we compute the average and standard deviation of text length at the corpus level and incorporate them into multiple stages of the AutoML pipeline. Specifically, long average sequence lengths trigger smaller batch sizes (e.g., 8–16 for texts >300 characters), shorter warm-up periods in cosine annealing schedules, and reduced learning rates to stabilize train3. Methodology ing. Additionally, we adapt the architectural shape of candidate MLP backbones: datasets with long inputs Our objective is to enhance the adaptability and perfor- receive “long funnel” configurations to compress highmance of AutoML systems for text classification by dy- dimensional sequences, while very short texts invoke namically customizing both the hyperparameter search compact “diamond” shapes to avoid overfitting. High space and neural architectures based on intrinsic proper- variance in length distribution increases regularization ties of the input dataset. We implement this within the (via dropout) to ensure generalization across variableAutoPyTorch framework, which ofers modular exten- length inputs. sibility, fine-grained pipeline control, and full support This integration ensures that the AutoML system dyfor deep learning models constructed from scratch. This namically aligns model complexity and optimization belfexibility is especially valuable for textual data, where havior with the distributional characteristics of the input architectural decisions—such as incorporating attention text, improving both eficiency and robustness in the mechanisms or shaping MLPs—must align with dataset- search process. specific traits like sequence length, lexical diversity, and class imbalance [18, 19]. 3.1.2. Vocabulary Richness and Lexical Diversity

Unlike other AutoML frameworks that rely on fixed pipelines or pre-trained models, our approach enables Vocabulary richness—commonly measured using the the construction of neural architectures that are directly type-token ratio (TTR) or corpus-level approximations informed by corpus-level characteristics. Prior work such as the unique-to-total word ratio—reflects the sehas shown that such dynamic, data-driven architecture mantic complexity of a text corpus. Higher lexical digeneration leads to better generalization and improved versity increases the dimensionality of the input space performance, particularly in heterogeneous or domain- and often correlates with more complex linguistic strucspecific scenarios [ 20, 21]. These findings motivate our ture [25, 26], requiring models with greater expressive design of a context-aware adaptation mechanism that capacity. From a theoretical standpoint, diverse corpora demand models with higher VC dimension and wider 3.1.4. Label Distribution and Class Imbalance hypothesis classes to capture nuanced patterns [27].

To account for this, our system dynamically adapts Imbalanced class distributions are a common challenge architectural complexity based on measured lexical di- in text classification, where certain categories (e.g., hate versity. For datasets with a high unique word ratio (e.g., speech, fraud cases) are underrepresented but critically > 0.3), we increase the number of neuron groups and important. When the class imbalance ratio—the proporexpand the maximum layer width (max_units) in our tion between the most and least frequent class—exceeds a text-aware MLP-based backbone, allowing the model to certain threshold, classification performance for minority better capture semantic variation. Conversely, for low- classes deteriorates due to model bias toward majority diversity corpora, we reduce network width and depth labels [31, 32]. This bias arises from the dificulty of to prevent overfitting. In addition, the backbone shape estimating rare class probabilities under skewed priors, is adjusted: high-diversity texts favor "long funnel" ar- which leads to inaccurate posterior approximations, eschitectures, while simpler datasets default to "diamond"- pecially when using symmetric loss functions such as shaped or regular "brick-like" architectures composed cross-entropy. of repeated modules. We also modify activation func- In our AutoML framework, class imbalance is meations: when diversity is low and the default ReLU may sured during the meta-feature extraction phase and diunderperform, GELU is automatically selected to improve rectly influences the search space configuration. For representation power for simple patterns. datasets with imbalance ratios exceeding 3.0, we expand

These adaptations ensure that both the search space the dropout range (e.g., up to 0.8), reduce learning rates, and the resulting architectures reflect the semantic vari- and increase the warm-up period in cosine learning rate ability of the input corpus, allowing the AutoML process schedules. These measures are designed to stabilize trainto match model expressiveness with linguistic richness. ing under uneven gradient updates and reduce overfitting to dominant classes. Conversely, for nearly balanced datasets (imbalance ratio below 1.5), regularization is 3.1.3. Number of Samples relaxed to allow more expressive learning.

Although architectural constraints are not enforced strictly based on imbalance, our search space prioritizes configurations that are empirically robust to imbalance, such as residual-normalized attention layers or funnelshaped MLPs. Together, these mechanisms enable the AutoML system to maintain balanced performance across both major and minor classes.

The number of training examples is a fundamental metafeature that influences model complexity, training dynamics, and generalization behavior. Small datasets tend to increase the risk of overfitting—particularly when using high-capacity neural networks—whereas large datasets enable the use of deeper models, longer training schedules, and reduced regularization. This is grounded in statistical learning theory, which links generalization error to both the size of the hypothesis class and the 4. Experiments number of available training samples [28]. Empirical studies support this connection: Domhan et al. [29] and To evaluate the efectiveness of our context-aware AuProbst et al. [30] show that both training regimes and op- toML framework for text classification, we conducted timal hyperparameter values (e.g., learning rate, dropout) comprehensive experiments on 20 diverse datasets. Our scale with dataset size. experiments were designed to compare the performance

In our approach, we compute the number of training of our proposed context-aware AutoPyTorch against a samples during meta-feature extraction and use this to strong baseline using static configurations in AutoPyadapt the AutoML search space. For datasets with fewer Torch. than 1,000 examples, we expand the dropout search space (up to 0.8), reduce learning rates, and favor simpler backbones such as narrow MLPs or shallow attention blocks. 4.1. Datasets Training budgets are also capped to avoid overfitting un- We conduct experiments on a diverse collection of der data scarcity. In contrast, datasets with more than datasets, including a stratified 30% subset of each task 10,000 samples prompt relaxed regularization and en- from the GLUE benchmark [33], a widely used evaluable higher-capacity configurations, such as increased ation suite for natural language understanding. GLUE max_units and longer training horizons. These modifi- (General Language Understanding Evaluation) consists of cations ensure that the resulting models are appropriately multiple sentence-level and sentence-pair classification scaled to the statistical regime of the dataset, improving tasks, covering linguistic phenomena such as entailment, both robustness and computational eficiency. paraphrase detection, sentiment analysis, and grammaticality judgment. Our subset selection balances computational feasibility and label distribution fidelity, enabling eficient neural architecture search within AutoPyTorch while maintaining representative task characteristics.

In addition to GLUE, we evaluate our approach on selected Kaggle datasets that span various text classification domains (e.g., emotion detection, spam filtering), as well as two private corpora in German. These private datasets address real-world classification tasks and introduce additional linguistic and domain-specific variability, allowing us to assess the generalizability of our context-aware AutoML framework across both English and German texts. Detailed dataset statistics are provided in Table 1. • MetaMLP – a custom MLP architecture whose depth, width, and shape are dynamically adapted based on meta-features such as text length, lexical diversity, number of samples, and class imbalance. • Contextual AttentionNet – a lightweight attention-based model built with multi-head selfattention layers, with structural parameters (e.g., number of heads, embedding dimensions) conditioned on input characteristics.

These architectures were treated as a categorical hy

perparameter within the AutoML pipeline, allowing the search process to explore and select the most appropriate model type using Bayesian optimization.

The text-aware version of our pipeline integrates the meta-feature extraction step at the beginning of each AutoPyTorch run. The extracted corpus-level properties are then used to dynamically adapt the hyperparameter search space. Key adaptations include: • Batch size adjustments based on average sequence length; • Learning rate and dropout range scaling based on dataset size and class imbalance; • Architectural shaping (e.g., diamond vs. funnel MLPs) based on input diversity and length variance.

4.5. Results Tables 2 and 3 summarize the performance of our context

aware AutoML pipeline in comparison to a static AutoPyTorch baseline across 20 text classification datasets. The evaluation was based on four widely used metrics: Accuracy, F1-micro, Matthews Correlation Coeficient

All experiments were constrained to a wall-clock time

of 3,000 seconds (approx. 2 hours) and a per-model training time of 600 seconds. We used multi-fidelity optimization via Successive Halving with a training budget 4.2. Embedding Method ranging from 10 to 100 epochs.

All runs were executed on a single NVIDIA A100 GPU aFmtoeordcaeollnlfterexoxpmteuratihlmiezeeSndetntset,xewnteceemuTbsreeaddndstfihnoergmsa.leTrlsh-elMibmirnoairdyLeMlt-oenLgc6eo-ndveer2s- lfomaatcinhginpeowinitthpr4e0ciGsiBono. f memory, using standard 32-bit each input text into a fixed-size dense vector of 384 dimensions. 4.4. Baseline Configuration

To ensure a controlled comparison between the baseline and our proposed method, the embedding layer was kept identical across all experimental conditions.

To establish a fair comparison, we define a strong baseline

using the unmodified AutoPyTorch framework and the same text embedding method (MiniLM). In this setting, the hyperparameter search space remains static and is not influenced by any dataset-specific meta-features.

4.3. Model Framework & Search Strategy

We implemented all experiments using the AutoPyTorch framework, leveraging its modular design for deep learning pipelines and extensible search space control. To ensure focus on neural architecture optimization, the traditional machine learning components (e.g., random forests, SVMs) were disabled for our approach. Only deep learning backbones were allowed in the search space. (MCC), and Pearson correlation. These metrics were Table 2 selected to reflect the characteristics of each task—for Accuracy (%) comparison between Baseline and Custom Hyexample, Accuracy for balanced classification tasks, F1- perparameter Search across GLUE Datasets. micro for imbalanced or multi-class problems, MCC for Dataset Accuracy/F1-Micro/MCC/Pearson binary grammaticality judgments (e.g., CoLA), and Pearson correlation for sentence similarity (STS-B). Notably, Baseline Custom these are also the oficial evaluation metrics adopted in CoLA 0.05 -0.2 the GLUE benchmark [33], ensuring compatibility and MRPC 72.7% 74.13% comparability with prior NLP research. QNLI 53.3% 50.3%

Overall, our method demonstrates consistent improve- QQP 49.17% 49.17% ments, particularly on tasks characterized by limited RTE 52.8% 53.5% training data, class imbalance, or high lexical diversity. SST-2 2.1% 83.4%

On the GLUE benchmark, our pipeline yields signif- STS-B 24.7% 30.72% icant gains on several tasks. Notably, WNLI accuracy WNLI 17.6% 54.9% increases from 17.6% to 54.9%, and SST-2 sees a dramatic rise from 2.1% to 83.4%. These results highlight the ef- Table 3 fectiveness of our adaptive architecture and regulariza- Prediction performance comparison between Baseline and tion mechanisms in low-resource and sentiment-sensitive Custom Hyperparameter Search. tasks. We also observe improvements in STS-B, where Dataset Accuracy/F1-Micro (%) the Pearson correlation increases from 24.7% to 30.7%.

Conversely, slight performance drops are observed in Baseline Custom QNLI and CoLA, which may suggest that the current Occupation 77.9 77.3 adaptation strategy occasionally introduces suboptimal BBC 97.3 96.85 regularization or architectural choices. For QQP, both Cyber 76.57 77.2 the baseline and our pipeline failed to build a viable neu- Emails 66.9 76.15 ral or ensemble model within the computational budget, Emotion 87.1 88.3 resulting in fallback to a dummy classifier. Framing 69.1 71.3

Our pipeline also outperforms the baseline on the Humor 91.8 92.87 majority of Kaggle and private datasets. Substantial SMpaatmh 9165.8.14 9166..93 gains are observed on Emails (from 66.9% to 76.2%), Troll Job 96.46 96.27 (from 52.9% to 56.7%), and Finished Sentence (from 79.4% Finished Sentence 79.43 81.12 to 81.1%), indicating that context-aware adaptation im- Troll 52.88 56.73 proves performance in tasks with either noisy data or subtle class distinctions. A slight performance decrease is observed on a few tasks, such as BBC and Occupa- 5. Conclusion and Future Works tion. In the case of Occupation, the drop in performance can be attributed to the nature of the dataset: it con- In this work, we proposed a context-aware AutoML sists of short, open-ended answers to questions about framework that dynamically adapts the hyperparameter a person’s job, which are then mapped to the top-level search space and neural architecture configurations in relabels of the German occupation classification system sponse to corpus-level text features. Implemented within (KldB). These free-text responses are often terse (e.g., the AutoPyTorch ecosystem, our approach integrates one- or two-word entries like “Technician” or “Sales”) dataset-driven meta-feature extraction with a modular and lack suficient contextual information to support nu- design for backbone selection and training parameter anced classification. As a result, the dynamic adaptation control. Experiments across 20 datasets—including submechanisms—designed to adjust architectures and hyper- sets of GLUE and diverse public corpora—demonstrate parameters based on richer linguistic cues—have limited consistent improvements in classification performance, room to operate efectively. The scarcity of semantic con- particularly in scenarios with imbalanced classes, small text may also hinder the efectiveness of embeddings and training sets, or high lexical diversity. prevent the model from learning discriminative patterns By coupling structural and optimization-level deciacross fine-grained occupational categories. sions to dataset-specific traits, our framework ofers a

Overall, the results support the utility of dynamic promising direction for more eficient and efective Ausearch space tailoring across varied domains and textual toML in NLP. The results validate that even lightweight characteristics. corpus features (e.g., text length, label imbalance) can yield meaningful adaptations to both model topology and hyperparameter scheduling. Lastly, our current evaluation focuses solely on classifi

While our method demonstrates strong empirical cation tasks using moderately sized monolingual datasets. gains, several important avenues remain for future re- The applicability of our approach to large-scale corpora, search. multilingual benchmarks, or more complex NLP tasks

First, our current system relies on a limited set of meta- (e.g., sequence labeling or generation) remains unexfeatures, such as average text length, vocabulary diver- plored. sity, and class imbalance. In future work, we aim to extend this analysis to include finer-grained linguistic and structural features such as average sentence length, References part-of-speech density, punctuation density, unique character ratio, and readability scores. These features may ofer deeper insight into the semantic and syntactic complexity of text, enabling more informed search space adjustments.

Second, while we implement a contextual search space by mapping meta-features to hyperparameter ranges, this process currently uses static, hand-crafted rules. More expressive and structured search spaces could allow hyperparameter relevance and conditionality to adapt dynamically based on dataset characteristics. For instance, certain architecture components or regularization parameters could be activated only when specific linguistic conditions are met, allowing for more flexible and principled adaptation.

Finally, our current fusion strategy for resolving conlficts between feature influences on the same hyperparameter is based on simple heuristics—such as averaging suggested values or intersecting ranges. In future work, we plan to investigate more flexible fusion mechanisms, such as weighting meta-features by importance or learning fusion policies from prior task performance. These improvements could make the contextual adaptation process more scalable, robust, and interpretable across a wide range of text classification tasks.

6. Limitations Despite the overall efectiveness of our context-aware

search space design, several limitations remain.

First, while our system considers multiple metafeatures to guide hyperparameter configurations, their influence is combined using static heuristics. This rulebased fusion does not account for potential interactions or conflicts between features, and lacks the flexibility to adapt based on task-specific dynamics or prior performance.

Second, the increased complexity introduced by textspecific search space customization results in higher computational cost. In most cases, we observed longer processing times due to the additional overhead from meta-feature analysis, search space updates, and more expansive architecture evaluations. This may limit the method’s applicability in time-constrained or resourcelimited environments.

ternational Journal of Advanced Computer Science [22] A. McCartney, S. Hensman, L. Longo, How short is and Applications 13 (2022). a piece of string?: the impact of text length and text [12] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton- augmentation on short-text classification accuracy Brown, Auto-weka: Combined selection and hyper- (2017). parameter optimization of classification algorithms, [23] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, in: Proceedings of the 19th ACM SIGKDD interna- Hierarchical attention networks for document clastional conference on Knowledge discovery and data sification, in: Proceedings of the 2016 conference of mining, 2013, pp. 847–855. the North American chapter of the association for [13] M. Feurer, F. Hutter, Automated machine learning, computational linguistics: human language tech

Cham: Springer (2019) 113–134. nologies, 2016, pp. 1480–1489. [14] M. Wistuba, N. Schilling, L. Schmidt-Thieme, Hy- [24] I. Beltagy, M. E. Peters, A. Cohan, Longformer: perparameter search space pruning–a new compo- The long-document transformer, arXiv preprint nent for sequential model-based hyperparameter arXiv:2004.05150 (2020). optimization, in: Machine Learning and Knowl- [25] M. Monteiro, C. K. James, M. Kloft, S. Fellenz, Charedge Discovery in Databases: European Conference, acterizing text datasets with psycholinguistic feaECML PKDD 2015, Porto, Portugal, September 7- tures, in: Findings of the Association for Compu11, 2015, Proceedings, Part II 15, Springer, 2015, pp. tational Linguistics: EMNLP 2024, 2024, pp. 14977– 104–119. 14990. [15] E. Öztürk, F. Ferreira, H. Jomaa, L. Schmidt-Thieme, [26] M. Sokolova, Big text advantages and challenges: J. Grabocka, F. Hutter, Zero-shot automl with pre- classification perspective, International Journal of trained models, in: International Conference on Data Science and Analytics 5 (2018) 1–10.

Machine Learning, PMLR, 2022, pp. 17138–17155. [27] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, [16] Y. Wang, Y. Yang, Y. Chen, J. Bai, C. Zhang, G. Su, Understanding deep learning requires rethinking X. Kou, Y. Tong, M. Yang, L. Zhou, Textnas: A generalization, arXiv preprint arXiv:1611.03530 neural architecture search space tailored for text (2016). representation, in: Proceedings of the AAAI con- [28] V. Vapnik, Statistical Learning Theory now plays a ference on artificial intelligence, volume 34, 2020, more active role: after the general analysis of learnpp. 9242–9249. ing processes, the research in the area of synthesis [17] J. Xu, X. Tan, R. Luo, K. Song, J. Li, T. Qin, T.-Y. of optimal algorithms was started. These studies, Liu, Nas-bert: Task-agnostic and adaptive-size bert however, do not belong to history yet. They are a compression with neural architecture search, in: subject of today’s research activities., Ph.D. thesis, Proceedings of the 27th ACM SIGKDD Conference These studies, however, do not belong to history on Knowledge Discovery & Data Mining, 2021, pp. yet. They are a subject of . . . , 1998.

1933–1943. [29] T. Domhan, J. T. Springenberg, F. Hutter, Speed[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: ing up automatic hyperparameter optimization of Pre-training of deep bidirectional transformers for deep neural networks by extrapolation of learning language understanding, in: Proceedings of the curves., in: IJCAI, volume 15, 2015, pp. 3460–8. 2019 conference of the North American chapter of [30] P. Probst, M. N. Wright, A.-L. Boulesteix, Hyperpathe association for computational linguistics: hu- rameters and tuning strategies for random forest, man language technologies, volume 1 (long and Wiley Interdisciplinary Reviews: data mining and short papers), 2019, pp. 4171–4186. knowledge discovery 9 (2019) e1301. [19] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut- [31] J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, dinov, Q. V. Le, Xlnet: Generalized autoregressive N. Seliya, A survey on addressing high-class impretraining for language understanding, Advances balance in big data, Journal of Big Data 5 (2018) in neural information processing systems 32 (2019). 1–30. [20] L. Zimmer, M. Lindauer, F. Hutter, Auto-pytorch: [32] S. Uddin, H. Lu, Dataset meta-level and statistiMulti-fidelity metalearning for eficient and robust cal features afect machine learning performance, autodl, IEEE transactions on pattern analysis and Scientific Reports 14 (2024) 1670.

machine intelligence 43 (2021) 3079–3090. [33] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. [21] Y. Li, Y. Shen, W. Zhang, Y. Chen, H. Jiang, M. Liu, Bowman, Glue: A multi-task benchmark and analJ. Jiang, J. Gao, W. Wu, Z. Yang, et al., Openbox: ysis platform for natural language understanding, A generalized black-box optimization service, in: arXiv preprint arXiv:1804.07461 (2018).

Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, 2021, pp.

3209–3219.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Improve writing style and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Lam , K.-

Lai , A meta-learning approach for text categorization , in: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , 2001 , pp. 303 - 309 .

[2] J. G . Madrid,

H. J.

Escalante , E. Morales, Metalearning of textual representations , in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases , Springer, 2019 , pp. 57 - 67 .

[3]

Jin ,

Chollet ,

Song ,

Hu , Autokeras: An automl library for deep learning , Journal of machine Learning research 24 ( 2023 ) 1 - 6 .

[4]

Shi ,

Mueller ,

Erickson ,

Li ,

A. J.

Smola , Benchmarking multimodal automl for tabular data with text fields , arXiv preprint arXiv:2111.02705 ( 2021 ).

[5]

M.-A.

Zöller ,

M. F.

Huber , Benchmark and survey of automated machine learning frameworks , Journal of artificial intelligence research 70 ( 2021 ) 409 - 472 .

[6]

Saleem ,

Kumarapathirage , A systematic review of automl for text classification: From theory to practice ( 2023 ).

[7]

Safikhani ,

Broneske , Enhancing autonlp with ifne-tuned bert models: an evaluation of text representation methods for autopytorch , Available at SSRN 4585459 ( 2023 ).

[8]

Safikhani ,

Broneske , Automl meets hugging face: Domain-aware pretrained model selection for text classification, in: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4 : Student Research Workshop), 2025 , pp. 466 - 473 .

[9]

J. C.

Gomez ,

Hoskens , M.-F. Moens , Evolutionary learning of meta-rules for text classification , in: Proceedings of the Genetic and Evolutionary Computation Conference Companion , 2017 , pp. 131 - 132 .

[10] M. J. Ferreira , P. Brazdil , Workflow recommendation for text classification with active testing method , in: Workshop AutoML , 2018 .

[11]

Desai ,

Shah ,

Kothari ,

Surve ,

Shekokar , Textbrew: Automated model selection and hyperparameter optimization for text classification , In-