<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Preprocessing for Clustering: Challenges and Future Directions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonard Labes</string-name>
          <email>leonard.labes@ipvs.uni-stuttgart.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Data Preprocessing, Automated Preprocessing, Clustering, Pipeline Optimization, AutoML</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Parallel and Distributed Systems, University of Stuttgart</institution>
          ,
          <addr-line>Universitätsstr. 38, Stuttgart</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper addresses the challenge of automating data preprocessing for clustering within unsupervised machine learning. A structured conceptual process is proposed, including imperfection diagnosis, candidate pipeline generation, quality estimation, selection and refinement, and result presentation. Through systematic identiifcation and discussion, significant challenges such as diverse preprocessing techniques, combinatorial search space explosion, the selection of appropriate evaluation metrics, efective refinement strategies, and the need for visualization and explainability are highlighted. An assessment of existing approaches reveals research gaps, showing that current solutions only partially address these challenges. This work outlines promising directions for future research, aiming towards comprehensive, robust, eficient, and explainable automated preprocessing to enhance the efectiveness of clustering analyses.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Clustering is a widely used technique for discovering patterns in unlabeled data, thereby supporting
decision-making in various fields, including biology, computer vision, marketing, and many others [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
In recent decades, researchers have proposed a wide range of clustering algorithms, each addressing
diferent aspects of similarity, cluster shapes, scalability, or improving clustering performance in specific
domains [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. These algorithms are typically developed and evaluated on synthetic or selected
benchmark datasets that have been cleaned or fitted to the algorithm’s assumptions. In contrast,
realworld data often shows characteristics that standard algorithms cannot handle directly, or the search
for a suitable algorithm is non-trivial. For example, most clustering algorithms cannot be applied when
missing values are present. Furthermore, feature scales that difer by orders of magnitude, variables of
diferent types (e.g., numerical and categorical), or high-dimensional spaces can significantly reduce
overall cluster quality [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>
        To address these characteristics and obtain good clustering results, the data must first be preprocessed
(e.g., imputing missing values), an important part in the KDD-Process [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and a step that has repeatedly
been shown to improve clustering performance [
        <xref ref-type="bibr" rid="ref3 ref4 ref6">3, 4, 6</xref>
        ]. Depending on the experience and domain
knowledge of data analysts, identifying relevant data characteristics and defining a suitable pipeline for
preprocessing can be a challenging task that may result in a time-consuming process. In the context of
clustering analyses, analysts require guidance in this process.
      </p>
      <p>
        In recent years, such guidance has been studied as AutoML [
        <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7, 8, 9, 10</xref>
        ]. In these approaches, the
goal is to support data analysts in finding appropriate clustering algorithms and their parameters. In a
similar way, automating data preprocessing in analytics pipelines would support data analysts in their
decision-making and reduce the trial-and-error search they typically face. However, current AutoML
approaches for unsupervised learning often neglect data preprocessing and do not address the additional
or diferent challenges that occur in this context.
Germany
https://www.ipvs.uni-stuttgart.de/institute/team/Labes/ (L. Labes)
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
3. Quality
Estimation
Score pipelines
based on CVI
Iterate</p>
      <p>While an automated approach for preprocessing may identify some characteristics, such as missing
values, relatively easily, more complex decisions, such as determining a suitable target dimension
for a dimensionality reduction algorithm, are dificult to define automatically. Hence, identifying the
underlying data characteristics is a primary challenge for an automated approach.</p>
      <p>Even with the characteristics identified, an automated approach needs to include one or several
appropriate methods to clean the dataset. Often, there are multiple methods available for the same
purpose, e.g., multiple dimensionality reduction methods. These methods have parameters and may be
interdependent. So, defining a search space and selecting a suitable set of preprocessing steps, as well
as their order and parameters, is a challenging task.</p>
      <p>
        A further challenge is based on the fact that clustering is inherently unsupervised. Therefore,
no ground truth is available, and internal Cluster Validity Indices (CVIs) are needed to evaluate the
clustering results. However, no single metric is always the right choice [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], making it challenging to
determine whether a preprocessing pipeline actually improves the quality of the clustering.
      </p>
      <p>Lastly, practical constraints such as runtime and scalability need to be factored in. High-quality
clustering results are not always the only part to consider. They must be balanced with eficiency,
especially when working with large or complex datasets. All of these uncertainties contribute to making
the automation of the preprocessing phase challenging for unsupervised learning.</p>
      <p>To make important challenges of automated preprocessing for unsupervised clustering explicit, this
paper (i) formulates a five-stage conceptual process covering imperfection diagnosis through result
presentation, (ii) distills five overarching research challenges from this process, (iii) compares existing
work against these challenges, and (iv) sketches directions for future solutions.</p>
      <p>The remainder of this paper is structured as follows. Section 2 describes the proposed process, outlines
what an automated approach needs to handle, and identifies challenges for actual implementations. In
Section 3, existing solutions are assessed with respect to these challenges to illustrate the state of the
art in this field. Section 4 outlines future directions and describes general techniques that can be used
to further develop automated preprocessing for clustering. Finally, Section 5 concludes this work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Conceptual Process for Automated Preprocessing</title>
      <p>The proposed process is divided into five sequential steps, as shown in Figure 1. Starting with the
diagnosis of data imperfections, we proceed to generating candidates for preprocessing pipelines,
followed by quality estimation, selection, and refinement stages, and finally presenting the results.
Compared to the Figure 1, existing work can repeat steps diferently or several times and merge or
split them. However, this process provides a mental model for better defining the challenges of an
automated approach.</p>
      <sec id="sec-2-1">
        <title>2.1. Step 1: Imperfection Diagnosis</title>
        <p>The initial step is to identify the relevant data characteristics. This step should automatically inspect
the raw data and report any existing problems. Examples that could be covered include
missingvalue handling, outlier detection, feature scaling, duplicate-feature removal, and transformations. It is
essential to identify such potential issues, as without this knowledge, the subsequent steps of the process
become significantly more challenging. The complexity of this step depends on the data characteristics
that need to be identified and can therefore range from a straightforward detection mechanism to
complex data analysis.</p>
        <p>Challenges: The primary challenge for the identification of all data characteristics is that each
characteristic requires a diferent detection method. Simple general rules can detect certain characteristics
in all kinds of data, but more advanced methods often depend on the domain, structure, or distribution
of the data. Examples of rule checks include the detection of missing values, whereas detecting outliers
is a more complex task that may involve statistical measures.</p>
        <p>
          Furthermore, the features to treat and those not to treat may be problem-dependent, as there may
be cases where outliers should be retained, e.g., if they are interesting for machine learning tasks. In
other cases, outliers reduce the clustering quality and should therefore be removed. Lastly, there are
many options to detect characteristics. For outlier detection alone, various methods exist, and the
results produced vary depending on the method and parameter choices [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Therefore, the diversity of
available methods and parameter settings yields a large search space of detection techniques, which is
dificult to constrain given the challenges mentioned above.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Step 2: Candidate Generation</title>
        <p>The goal of this step is to provide a first candidate set of preprocessing pipelines that could address the
characteristics identified in Step 1. For example, if missing values are present and possible outliers are
detected, the set of candidates includes multiple strategies for imputing missing values or removing
outliers. Such candidates can range from a single method that resolves one problem, like removing
outliers, to a multi-step pipeline that addresses several characteristics consecutively, e.g., first imputing
missing values and then scaling the features.</p>
        <p>Challenges: The main challenge in this step is to balance the pipelines in terms of diversity,
addressing characteristics, considering eficiency, and applicability, thus shaping the search space. To
make this space manageable, constraints, such as limits on pipeline length, definitions on how often a
method may recur, and explicit exclusions of incompatible method combinations must be introduced.
With these constraints defined, a candidate set that is broad enough to cover every relevant characteristic,
small enough to keep the downstream evaluation eficient, and diverse enough to leave multiple options
for later improvement instead of leading the optimization into a single local optimum of the space is
needed. Finding the right balance among breadth, compactness, and diversity while considering the
defined constraints remains an open question.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Step 3: Quality Estimation</title>
        <p>
          In Step 3, the candidates are evaluated, i.e., the improvement of the cluster quality is assessed. Because
clustering tasks have no ground truth, internal CVIs are used to validate the clustering result [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          Challenges: The core challenge here is to assess the quality using internal CVIs. Internal CVIs
measure cluster quality diferently, and when choosing one, such as the silhouette score [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] or the
Davies-Bouldin Index [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], the risk exists that the preprocessing is optimized towards this heuristic
metric rather than the actual clustering quality. At the same time, an exhaustive evaluation with
multiple CVIs might be cost-intensive.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Step 4: Selection and Refinement</title>
        <p>Next, the best-performing pipelines need to be selected, considering the results from the previous
quality estimation step. While the focus of Step 3 is only on quality, this step may also consider diferent
aspects, such as the eficiency of the pipeline. Depending on the specific approach, the best performing
pipelines are then either refined and the process goes back to Step 3 for further quality assessment
or the refinement part is finished and the process continues with Step 5 (cf. Figure 1). The goal of
the refinement is to improve the already promising pipelines. The details of how this is done depend
on the specific approach, but it typically involves generating new pipelines by modifying the existing</p>
        <p>C4 Refinement</p>
        <p>Strategy
✓
✓
✓
✓
✓
✓
ones. Examples of this include adding, removing, or replacing methods, changing parameter values, or
altering the sequence of the pipeline.</p>
        <p>
          Challenges: The dificulty in this step is steering the search towards improved clustering quality
without refining a vast number of candidates. Intuitively, pipelines that indicate promising
improvements from the previous step should be considered for further refinement. However, an approach
should still consider new or more drastic changes to avoid being trapped in a local optimum. This case is
typically defined as an exploration versus exploitation dilemma [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and is challenging to balance.
Multiobjective trade-ofs, such as balancing quality and runtime, further complicate the selection. Deciding
how to weigh runtime and quality is far from clear, and it may be use case dependent. Consequently,
the refinement stage must explore an even larger search space of possible pipeline configurations than
the initial candidate generation step (Step 2). To let the search space not become too large, it is critical
to decide how many best-performing pipelines should be considered in this step.
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Step 5: Present the Results</title>
        <p>This step concludes the process and consists mainly of two parts. The first part presents the user
with the final result of the clustering, which can range from displaying the achieved improvement and
ifnal scores to a visualization of the identified clusters. Secondly, explainability is essential. If such an
automated preprocessing approach is applied in the real world, it is not only interesting to see what
scores were achieved, but also how they were achieved. The explainability should include the identified
characteristics from Step 1 and the applied transformations during preprocessing.</p>
        <p>
          Challenges: This step faces two challenges, first, visualizing the final result to a user. Visualization
of results is a general challenge in machine learning [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], but it becomes more complicated when
multiple quality measures are considered. Showing only one CVI or an oversimplified version of
the data may hide important information. Secondly, the explainability is essential. Telling the user
why and how the improvement has been achieved in a comprehensive but understandable way is not
straightforward.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Summary and Derived Challenges</title>
        <p>From the challenges described for each step, five main challenges were identified to summarize the
process and ease comparison with existing work. Since some challenges can occur in more than one
step, Table 1 provides an overview of which challenges are relevant for which step. An automated
preprocessing solution for clustering should solve:</p>
        <p>C1 Diverse Characteristic Detection and Preprocessing Techniques (Steps 1,2, and 4):
Automated preprocessing must detect a wide range of data characteristics (e.g., missing values, outliers,
skewness) using diferent methods for each type of characteristic. Each issue may require specialized
detection logic, and the domain or task context can influence whether an imperfection should be
corrected or left in place. This heterogeneity makes a comprehensive diagnosis dificult. The same holds
for preprocessing. A diverse set of techniques must be available to clean the detected characteristics.</p>
        <p>C2 Search Space Explosion (Step 1, 2, and 4): There are many possible methods and parameter
settings for detecting data issues, leading to an enormous combinatorial search space. For instance,
dozens of outlier‐detection algorithms exist, and each may yield diferent results. These options create
a huge search space of detection techniques.</p>
        <p>Exhaustively defining and exploring this space is infeasible and complicates the diagnosis step.
Furthermore, generating and refining candidate preprocessing pipelines requires a balance between
diversity and tractability. The candidate set must cover all identified data issues in multiple ways, yet
remain small enough to evaluate later. In practice, generating many diverse pipelines can improve the
chances of finding a good solution, but it also risks a combinatorial explosion and high computational
cost. To illustrate this better, consider a search space with only three methods, each with two parameters,
and each parameter can have five discrete values, therefore having 3 × 52 = 75 choices per step. With
repeats allowed and the maximal length of a pipeline  is three, we would have ∑3 75 = 427, 575
possible candidates, and usually there are more methods, more parameters, and more parameter values.</p>
        <p>
          C3 Evaluation Metric (Step 3): Without ground truth (especially in clustering tasks), quality must
be assessed by internal CVIs. Relying on a single index (e.g., Silhouette Score [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]) risks optimizing
preprocessing towards that surrogate, rather than true clustering quality. Evaluating multiple CVIs can
mitigate this bias, but it is computationally expensive. Balancing metric choice and evaluation cost is,
therefore, a challenge.
        </p>
        <p>
          C4 Refinement Search Strategy (Step 4): After selecting promising pipelines, the approach
must decide how to refine them. Creating a detailed approach that covers many possibilities is very
challenging. Furthermore, there is an exploration-exploitation dilemma[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]: one can either make small
(exploitative) tweaks to top pipelines or try entirely new pipeline variations. Finding the right balance
is dificult. At the same time, multi‐objective trade-ofs (improving quality vs. keeping runtime low)
further complicate the selection of which pipelines to refine.
        </p>
        <p>C5 Visualization and Explainability (Step 5): The final preprocessing results must be conveyed
efectively to users, often involving multiple quality measures. Visualizing improvements and clusterings
is non-trivial, especially when showing more than one metric. An oversimplified view (e.g., a single
CVI) may hide important details. Designing visual summaries that capture multi-criteria results without
overwhelming the user is a significant challenge. Additionally, the user does not only need to see if
there is an improvement, but also why. This should include the answers to the questions of why a
specific preprocessing technique was selected and how it has improved the clustering quality.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Systematic Comparison of Related Work</title>
      <p>Section 2 describes the automation of steps in the data analysis process. Existing research examines
various aspects of the process from diferent angles, addressing individual points in distinct ways. For a
better overview, this section is grouped into the following parts: AutoML was initially developed for
supervised learning but has now also been extended to unsupervised learning, thereby demonstrating
automated processes in the context of unsupervised learning (Section 3.1). There is also research that
has already dealt with automated preprocessing, but not for unsupervised learning (Section 3.2). Some
of these research proposals have also been integrated into widely available libraries, thus serving as
the ”default” automation toolkits (Section 3.3). In addition, there are tools that provide information on
transformation steps, focusing on visualization and explainability (Section 3.4). In the following, the
approaches in each of these categories are briefly described and the extent to which they address the
challenges outlined in Section 2.6 is evaluated. Section 3 shows an overview of this evaluation.</p>
      <sec id="sec-3-1">
        <title>3.1. Unsupervised AutoML Approaches</title>
        <p>
          Several approaches address automated clustering [
          <xref ref-type="bibr" rid="ref17 ref7 ref9">7, 9, 17</xref>
          ]. While these approaches have a strong model
focus rather than a preprocessing focus, they provide solutions for automatically evaluating clustering
results and how optimization in the context of unsupervised machine learning can be accomplished.
        </p>
        <p>ML2DAC addresses the Combined Algorithm Selection and Hyperparameter Optimization (CASH)
problem. It uses a knowledge base, meta-learning, and a trained classifier to suggest clustering
algorithms, their hyperparameters, and a suitable CVI. After that, the selection is then further optimized with
a Bayesian optimization. While showing how a dataset can be automatically clustered and evaluated,
they leave the preprocessing to the user.</p>
        <p>
          From a broad perspective, AutoClust [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] applies a similar approach to ML2DAC [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. It uses
metalearning and Bayesian optimization to identify a suitable clustering algorithm and promising parameter
values. However, instead of a single CVI, 10 are used. A Multilayer Perceptron Regressor (MLP) is
trained and then utilized to predict an external CVI based on these 10 internal CVIs to overcome the
absence of ground truth, but still have a single metric that is similar to a ground truth evaluation. This
predicted score is then used for the optimization. Also, as with ML2DAC, no preprocessing is performed
during optimization.
        </p>
        <p>
          TPE-AutoClust [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] utilizes a knowledge base and meta-learning as well, but chooses an evolutionary
algorithm to optimize the suggested pipelines and further applies a majority vote of three CVIs to
evaluate the pipelines. In contrast to the other approaches, TPE-Autoclust applies preprocessing during
the optimization process. However, the considered preprocessing is limited to four functions from the
areas of missing value imputation, dimensionality reduction, and scaling. For these four functions, a
limited number of parameters is considered.
        </p>
        <p>
          Most research in AutoML for clustering focuses on the clustering algorithms, their parameter, and
the evaluation metric [
          <xref ref-type="bibr" rid="ref7 ref9">7, 9</xref>
          ], and only partially considers preprocessing [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Therefore, ML2Dac [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
and AutoClust [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] do not cover C1, C2, C4 and C5 from a preprocessing perspective, but showcase
how unsupervised learning can be evaluated, thus addressing C3. ML2DAC [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] completely addresses
C3 eficiently because it picks one CVI based on meta-learning and then only uses this CVI during
optimization. AutoClust [29] only partially addresses C3 because its strategy includes calculating
multiple CVIs during the optimization process and therefore neglects eficiency [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. TPE-AutoClust
is the only approach that includes preprocessing, but only in a limited way. The restriction to only a
few methods and few parameters per method addresses C1 only in a limited way, and the question
of whether this approach can be applied to a larger, more diverse search space remains unanswered,
thus only partially addressing C2 and C4. Additionally, their choice of the majority voting of CVI
also includes a limited number of CVIs, and always considering every CVI may not be the best way of
evaluating clustering [
          <xref ref-type="bibr" rid="ref7 ref9">7, 9</xref>
          ], therefore only partially addressing C3. Since none of the solutions focus on
explainability, none addresses C5.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Preprocessing for Supervised Machine Learning</title>
        <p>Besides unsupervised AutoML, some approaches solely focus on supervised learning or data cleaning
in general, but address preprocessing more specifically than the unsupervised approaches.</p>
        <p>Saga [18] is a general-purpose data cleaning framework. It identifies suitable data preprocessing
pipelines for a given dataset through a two-step optimization process. First, an evolutionary algorithm
is used to find methods assembled into a pipeline. In a second step, the Hyperband algorithm [ 30] is
used to further fine-tune the parameters of the pipelines identified in the first step. Saga can apply
several preprocessing functions from the area of outlier removal, missing value imputation, encoding,
and string handling [18].</p>
        <p>CtxPipe [19] additionally considers the context of the data. It uses pretrained embedding models
to capture the context or domain-specific information of the data. Then it uses a deep reinforcement
learning approach to construct a pipeline based on embeddings of statistical features and the extracted
context. CtxPipe can apply data preprocessing methods for missing value imputation, encoding, scaling,
and feature selection, but does not consider the parameters of these functions.</p>
        <p>LLMClean [20] technically does not perform preprocessing but uses Large Language Models (LLMs)
to construct a context model with Ontology Functional Dependencies (OFDs) to detect errors and repair
options in IoT data, but includes a generalization for all types of data. It can detect missing values, look
up additional information such as minimum and maximum values from external sources, and check for
violations of functional dependencies.</p>
        <p>Saga and CtxPipe do not explicitly detect specific characteristics, but discover many characteristics
during their optimization. Several techniques of Saga require labels, thus only partially addressing
C1, while CtxPipe only contains methods that are also applicable to unsupervised learning, therefore
addressing C1 completely. LLMClean focuses more on violations within the data and not classical
preprocessing, thus it also addresses C1 only in a limited way. Due to the design of Saga, C2 is addressed
thoroughly. CtxPipe limits the search space since it does not optimize parameters, resulting in a limited
addressing of C2. Furthermore, LLMClean provides no concrete pipeline optimization, which means it
does not cover C2. C3 is not addressed at all, because all approaches are designed and evaluated with
supervised machine learning algorithms. Saga and CtxPipe provide extensive tuning of the pipelines
and therefore fully address C4. Due to its design, LLMClean does not address C4. The focus of Saga
and CtxPipe is not on visualization or explainability, as both primarily provide methods for finding and
optimizing preprocessing pipelines. Thus, they do not address C5. LLMClean provides the user with a
context model, which displays derived rules, but the process by which these rules are obtained is not
explainable. Additionally, there is no visualization, so C5 is only addressed in a limited way.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Libraries</title>
        <p>Several libraries have emerged from research and focus on AutoML, including preprocessing to a certain
extent. Auto-Weka [21, 22], Auto-Sklearn [23, 24], TPOT [26], and AutoGluon [25] are considered here
as prominent examples to evaluate how well-established libraries optimize machine learning pipelines.</p>
        <p>Auto-Weka [21, 22] is one of the first libraries in the context of AutoML. It utilizes Bayesian
optimization to optimize the selection of machine learning algorithms, along with their parameters,
for supervised machine learning tasks. The only preprocessing step Auto-Weka considers during its
optimization is automatic feature selection.</p>
        <p>Auto-Sklearn [23, 24] utilizes Bayesian optimization as well and is developed on top of the well-known
scikit-learn project [31] using its functions to optimize classification or regression tasks. Auto-Sklearn
considers preprocessing in two phases: first, data preprocessing, such as missing value imputation or
scaling. Secondly, feature preprocessing, which is primarily a dimensionality reduction method.</p>
        <p>In contrast to Auto-Sklearn and Auto-Weka, AutoGluon [25] uses ensemble learning in combination
with supervised models to provide a good machine learning result. It includes preprocessing steps such
as imputing missing values, encoding, and text expansions. However, this is not applied in the search
space, but as a fixed step before starting the ensemble learning.</p>
        <p>Similar to the other libraries, TPOT [26] only supports supervised tasks. It uses a genetic
treebased optimization strategy to evolve a population of diferent pipelines. In contrast to the others,
TPOT explicitly includes preprocessing as part of its evolutionary optimization. Additionally, various
preprocessing methods from the area of normalization, dimensionality reduction, and feature selection
are included, automatically selected, and their parameters are tuned during the optimization process.</p>
        <p>Most libraries (Auto-Weka, Auto-Sklearn, AutoGluon) apply diferent strategies to automate the
machine learning tasks, yet primarily focus on the downstream tasks instead of preprocessing [21, 22,
23, 24, 25]. Although they do not address the challenges C1 and C3-C5 concerning preprocessing, they
may show directions on how large search spaces can be handled, therefore partially contributing to C2.
TPTO [26] is the only library that considers preprocessing as part of the optimization problem. TPOT
includes various preprocessing techniques. However, real-world data may have common characteristics
like outlier removal that TPOT does not cover. Additionally, some preprocessing methods are only
applicable for supervised machine learning, therefore only partially addressing C1. The same holds true
for challenges C2 and C4, TPOT showcases how this can be accomplished, but for a limited number
of characteristics. TPOT does not allow unsupervised evaluation and does not provide a dedicated
explanation of why the methods have been applied and what the efect is. Therefore, not addressing C3
and C5.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Semi-Automated Visualization Tools</title>
        <p>There is limited research on automated tools that also prioritize explainability or result visualization.
However, several tools exist that provide an interactive way to apply and suggest transformations,
covering both these topics.</p>
        <p>Wrangler [27] supports users by visualizing the data in a spreadsheet style. It combines user input
suggestions of transform types to generate multiple options for how the data can be transformed, and
shows the efect. It supports reformatting, splitting or combining columns, imputing missing values,
and detecting type anomalies.</p>
        <p>Another option is Vizier [28], which combines spreadsheet-style visualization with notebooks for
code execution and plots to provide the user multiple options to alter and view the data. It primarily
enables the user to apply data preprocessing in an easy-to-understand manner, while also incorporating
the work of Yang et al. [32] to perform schema matching, automatically impute missing values, or check
constraints. Additionally, Vizier keeps track of every transformation done by the user.</p>
        <p>The mentioned work does not focus on automating preprocessing for clustering, but showcases how
better explainability and visualization of data could be accomplished. Therefore, C1-4 are not addressed.
C5 is addressed, but still needs changes to fit the described process of Section 2. Therefore, the solutions
are considered as only partially addressing C5.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Summary</title>
        <p>Section 3 reviews existing research related to automated data preprocessing for clustering, categorizing
it into distinct categories. Each category addresses certain aspects of the challenges outlined in Section 2,
yet none comprehensively covers all necessary preprocessing steps and optimization requirements.
The detailed evaluation is summarized in Section 3. This highlights research gaps and emphasizes the
need for robust solutions that entirely automate preprocessing in an unsupervised clustering scenario.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Future Directions of Automated Preprocessing for Clustering</title>
      <p>Section 3 showcases approaches, none of which are directly transferable to the described process (cf.
Section 2) and automated preprocessing for clustering. However, they hint at directions or techniques
that could be applied or modified.</p>
      <p>
        Detecting data characteristics remains challenging. Some approaches utilize rule-based detection or
implicit detection through candidate generation or refinement. However, these approaches may not be
easily extended to a large search space, or this may imply a long runtime. Inspired by other AutoML
approaches [
        <xref ref-type="bibr" rid="ref17 ref7 ref9">7, 9, 17</xref>
        ], meta-learning is an interesting area since it can utilize knowledge from the past
[33], e.g., eficiently detecting characteristics based on previously seen datasets. To apply meta-learning,
well-selected meta-features are required, but it is unclear which meta-features should be considered to
identify a wide range of characteristics. Another step, especially when incorporating other information
such as domain knowledge, would be to use an embedding that represents the data [19]. However,
creating such an embedding eficiently is still challenging, and it may lack explainability.
      </p>
      <p>
        Handling the vast search space (C2) and eficiently refining it ( C4) is a topic that others have
considered for diferent problems (e.g., AutoML for Clustering) [
        <xref ref-type="bibr" rid="ref17">17, 18, 19, 21, 22, 23, 24, 25, 26</xref>
        ]. The
current approaches show that many optimization techniques exist and should be used instead of an
exhaustive search. Various methods like Bayesian optimization [21, 22, 23, 24], genetic algorithms
[
        <xref ref-type="bibr" rid="ref17">17, 18, 26</xref>
        ], reinforcement learning [18, 19], or even LLMs [20] could be applied, while genetic or
evolutionary algorithms seem to be a preferred choice when automating preprocessing, mainly because
of their flexibility. Again, this still requires a detailed analysis of how such methods can be employed, and
also seems promising in combination with meta-learning to restrict the search space during optimization
and start with already promising candidates. LLMs are a widely discussed topic, but they struggle
to handle big tabular data [34] and tend to generalize when provided with a dataset. It seems more
likely that a good representation of the data and preprocessing pipelines should be developed before
providing it to an LLM, or that their use case primarily focuses on the context area and processing other
information than the raw data (e.g., domain knowledge). In order to utilize LLMs, a good encoding
strategy is still required, but it is not yet available for unsupervised preprocessing.
      </p>
      <p>
        The evaluation with diferent approaches, such as utilizing internal CVIs [
        <xref ref-type="bibr" rid="ref17 ref7">17, 7</xref>
        ] or trying to predict
external CVI [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], is considered in the area of AutoML for clustering. Yet, an eficient way to transfer this
to preprocessing without drastically increasing the runtime remains an open challenge. An interesting
direction to evaluate preprocessing pipelines for clustering would be to use a model that predicts
external CVIs or suggests internal CVIs. However, creating such a model is challenging because the
data, their characteristics, and consequently the recommendations based on these characteristics change
during preprocessing. Simply incorporating a CVI can lead to optimization towards this CVI instead of
a promising preprocessing pipeline.
      </p>
      <p>
        Lastly, many tools demonstrate how visualization and explainability can be achieved [27, 28].
However, they are mostly user-focused or require human input. While helpful information is available
in some approaches that automate preprocessing [
        <xref ref-type="bibr" rid="ref17">17, 26, 18</xref>
        ], it is not efectively utilized. Existing
visualization strategies and concepts can be built upon a solution. This could range from simple result
visualization to more extensive analysis, showing what was detected, how a pipeline was derived from
an optimization process, and what data changes it results in.
      </p>
      <p>To summarize, while there are many interesting parts and pieces, none of them already fit perfectly
for automated preprocessing for clustering and need further research and new or adapted ideas to
create a well-working solution.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In conclusion, this work addresses the automation of data preprocessing for clustering, highlighting
the inherent complexities within unsupervised machine learning scenarios. By proposing a structured
conceptual process consisting of imperfection diagnosis, candidate pipeline generation, quality
estimation, selection, refinement, and result presentation, the paper systematically outlines the steps and
associated challenges involved.</p>
      <p>Key identified challenges include the need for diverse preprocessing techniques, the combinatorial
explosion of the search space, the selection of appropriate evaluation metrics, refinement strategies, and the
requirement for efective visualization and explainability. An assessment of existing approaches reveals
that while current research partially addresses these challenges, no existing solution comprehensively
addresses all of them.</p>
      <p>The clear definition of these challenges and the evaluation of current research not only reveal research
gaps but also outline promising future research directions. This contributes to the development of robust,
eficient, and explainable automated preprocessing solutions, thereby enhancing the efectiveness of
clustering analyses in practice.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used Grammarly and Writefull to correct grammar
and spelling and to apply rephrasing. After using these services, the author reviewed and edited the
content as needed and takes full responsibility for the publication’s content.
clustering, in: 2022 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE,
2022, pp. 1144–1153.
[18] S. Siddiqi, R. Kern, M. Boehm, Saga: a scalable framework for optimizing data cleaning pipelines
for machine learning applications, Proceedings of the ACM on Management of Data 1 (2023) 1–26.
[19] H. Gao, S. Cai, T. T. A. Dinh, Z. Huang, B. C. Ooi, Ctxpipe: Context-aware data preparation
pipeline construction for machine learning, Proceedings of the ACM on Management of Data 2
(2024) 1–27.
[20] F. Biester, M. Abdelaal, D. Del Gaudio, Llmclean: Context-aware tabular data cleaning via
llmgenerated ofds, in: European Conference on Advances in Databases and Information Systems,
Springer, 2024, pp. 68–78.
[21] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, Auto-weka: Combined selection and
hyperparameter optimization of classification algorithms, in: Proceedings of the 19th ACM
SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 847–855.
[22] L. Kotthof, C. Thornton, H. H. Hoos, F. Hutter, K. Leyton-Brown, Auto-weka 2.0: Automatic
model selection and hyperparameter optimization in weka, Journal of Machine Learning Research
18 (2017) 1–5.
[23] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, F. Hutter, Eficient and robust
automated machine learning, in: Advances in Neural Information Processing Systems 28 (2015),
2015, pp. 2962–2970.
[24] M. Feurer, K. Eggensperger, S. Falkner, M. Lindauer, F. Hutter, Auto-sklearn 2.0: Hands-free automl
via meta-learning, arXiv:2007.04074 [cs.LG] (2020).
[25] N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, A. Smola, Autogluon-tabular: Robust
and accurate automl for structured data, arXiv preprint arXiv:2003.06505 (2020).
[26] R. S. Olson, J. H. Moore, Tpot: A tree-based pipeline optimization tool for automating machine
learning, in: Workshop on automatic machine learning, PMLR, 2016, pp. 66–74.
[27] S. Kandel, A. Paepcke, J. Hellerstein, J. Heer, Wrangler: Interactive visual specification of data
transformation scripts, in: Proceedings of the sigchi conference on human factors in computing
systems, 2011, pp. 3363–3372.
[28] M. Brachmann, C. Bautista, S. Castelo, S. Feng, J. Freire, B. Glavic, O. Kennedy, H. Müeller,
R. Rampin, W. Spoth, et al., Data debugging and exploration with vizier, in: Proceedings of the
2019 International Conference on Management of Data, 2019, pp. 1877–1880.
[29] Y. Poulakis, C. Doulkeridis, D. Kyriazis, A survey on automl methods and systems for clustering,</p>
      <p>ACM Transactions on Knowledge Discovery from Data 18 (2024) 1–30.
[30] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar, Hyperband: A novel bandit-based
approach to hyperparameter optimization, Journal of Machine Learning Research 18 (2018) 1–52.
[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay,
Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011)
2825–2830.
[32] Y. Yang, N. Meneghetti, R. Fehling, Z. H. Liu, O. Kennedy, Lenses: An on-demand approach to etl,</p>
      <p>Proceedings of the VLDB Endowment 8 (2015) 1578–1589.
[33] P. Brazdil, J. N. Van Rijn, C. Soares, J. Vanschoren, Metalearning: applications to automated
machine learning and data mining, Springer Nature, 2022.
[34] Y. Sui, M. Zhou, M. Zhou, S. Han, D. Zhang, Table meets llm: Can large language models
understand structured table data? a benchmark and empirical study, in: Proceedings of the 17th
ACM International Conference on Web Search and Data Mining, 2024, pp. 645–654.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Data clustering: 50 years beyond k-means</article-title>
          ,
          <source>Pattern recognition letters 31</source>
          (
          <year>2010</year>
          )
          <fpage>651</fpage>
          -
          <lpage>666</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Ezugwu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Ikotun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. O.</given-names>
            <surname>Oyelade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Abualigah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Agushaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. I.</given-names>
            <surname>Eke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Akinyelu</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications</article-title>
          , taxonomy, challenges, and future research prospects,
          <source>Engineering Applications of Artificial Intelligence</source>
          <volume>110</volume>
          (
          <year>2022</year>
          )
          <fpage>104743</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Baligodugula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Amsaad</surname>
          </string-name>
          ,
          <article-title>Unsupervised learning: Comparative analysis of clustering techniques on high-dimensional data</article-title>
          ,
          <source>arXiv preprint arXiv:2503.23215</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I. B.</given-names>
            <surname>Mohamad</surname>
          </string-name>
          , D. Usman,
          <article-title>Research article standardization and its efects on k-means clustering algorithm</article-title>
          ,
          <source>Res J Appl Sci Eng Technol</source>
          <volume>6</volume>
          (
          <year>2013</year>
          )
          <fpage>3299</fpage>
          -
          <lpage>3303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>U.</given-names>
            <surname>Fayyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Piatetsky-Shapiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          ,
          <article-title>From data mining to knowledge discovery in databases</article-title>
          ,
          <source>AI</source>
          magazine
          <volume>17</volume>
          (
          <year>1996</year>
          )
          <fpage>37</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wongoutong</surname>
          </string-name>
          ,
          <article-title>The impact of neglecting feature scaling in k-means clustering</article-title>
          ,
          <source>PloS one 19</source>
          (
          <year>2024</year>
          )
          <article-title>e0310839</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Treder-Tschechlov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitschang</surname>
          </string-name>
          , Ml2dac:
          <article-title>Meta-learning to democratize automl for clustering analysis</article-title>
          ,
          <source>Proceedings of the ACM on Management of Data</source>
          <volume>1</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tian</surname>
          </string-name>
          , Autocluster:
          <article-title>Meta-learning based ensemble method for automated unsupervised clustering</article-title>
          ,
          <source>in: Pacific-Asia Conference on Knowledge Discovery and Data Mining</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>246</fpage>
          -
          <lpage>258</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Poulakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Doulkeridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kyriazis</surname>
          </string-name>
          ,
          <article-title>Autoclust: A framework for automated clustering based on cluster validity indices</article-title>
          ,
          <source>in: 2020 IEEE International Conference on Data Mining (ICDM)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>1220</fpage>
          -
          <lpage>1225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tschechlov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fritz</surname>
          </string-name>
          , H. Schwarz,
          <article-title>Automl4clust: Eficientautoml forclustering analyses, cit</article-title>
          . on (
          <year>2021</year>
          )
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G. O.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Campello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Micenková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schubert</surname>
          </string-name>
          , I. Assent,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Houle</surname>
          </string-name>
          ,
          <article-title>On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study, Data mining and knowledge discovery 30 (</article-title>
          <year>2016</year>
          )
          <fpage>891</fpage>
          -
          <lpage>927</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Understanding of internal clustering validation measures</article-title>
          ,
          <source>in: 2010 IEEE international conference on data mining, Ieee</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>911</fpage>
          -
          <lpage>916</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Rousseeuw</surname>
          </string-name>
          ,
          <article-title>Silhouettes: a graphical aid to the interpretation and validation of cluster analysis</article-title>
          ,
          <source>Journal of computational and applied mathematics 20</source>
          (
          <year>1987</year>
          )
          <fpage>53</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Davies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Bouldin</surname>
          </string-name>
          ,
          <article-title>A cluster separation measure</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          (
          <year>2009</year>
          )
          <fpage>224</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Blum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roli</surname>
          </string-name>
          ,
          <article-title>Metaheuristics in combinatorial optimization: Overview and conceptual comparison, ACM computing surveys (CSUR) 35 (</article-title>
          <year>2003</year>
          )
          <fpage>268</fpage>
          -
          <lpage>308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chatzimparmpas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Jusufi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kerren</surname>
          </string-name>
          ,
          <article-title>A survey of surveys on the use of visualization for interpreting machine learning models</article-title>
          ,
          <source>Information Visualization</source>
          <volume>19</volume>
          (
          <year>2020</year>
          )
          <fpage>207</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>ElShawi</surname>
          </string-name>
          , S. Sakr,
          <article-title>Tpe-autoclust: A tree-based pipline ensemble framework for automated</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>