An extensive comparison of preprocessing methods in the
                                context of configuration space learning
                                Damian Garber1,∗ , Alexander Felfernig1 , Viet-Man Le1 , Tamim Burgstaller1 and
                                Merfat El-Mansi1
                                1
                                    Graz University of Technology, Inffeldgasse 16b, Graz, Styria, Austria


                                                   Abstract
                                                   One of the core goals in the research field of configuration space learning is building precise predictive models that allow
                                                   for reliably estimating the performance of a configuration without requiring costly tests. The models used for this purpose
                                                   are usually machine learning-based. However, the models show significant deviations in their performance depending on
                                                   the investigated Software Product Line (SPL), the applied data preprocessing, and the number of sample configurations
                                                   collected. Thus, we investigate the impact of different preprocessing methods and their behavior when using different SPLs,
                                                   machine learning models, and sample sizes. Performance comparisons on this scale are usually not conducted due to their
                                                   prohibitively expensive execution time requirements, even for smaller SPLs. Thus, we used three fully enumerated spaces as
                                                   our training data, which allows for more generalized results. Our results show that the average factors between the worst and
                                                   best-performing preprocessing methods are 2.05 (BerkeleyDBC), 1.17 (7z), and 1.84 (VP9). Further, no single preprocessing
                                                   method tested was able to outperform all others, nor was this the case within one specific SPL or model type. This underlines
                                                   the importance of testing new approaches with multiple preprocessing methods.

                                                   Keywords
                                                   Configuration Space Learning, Machine Learning, Preprocessing


                                1. Introduction                                                   of the factors often neglected in SPL performance pre-
                                                                                                  diction: Preprocessing. We will define any necessary
                                The discovery of configurations that optimize the perfor- terms used in this paper in Section 2. Preprocessing has
                                mance of any given Software Product Line (SPL) is one proven itself in many other domains that employ machine
                                of the core goals of configuration space learning. The learning-based prediction models. However, literature
                                performance of a model can take many forms and rely reviews such as Gong and Chen [1] show that less than
                                heavily on the use case. For instance, one may optimize a half of the investigated studies within the field of config-
                                SPL to perform a core task very efficiently or optimize for uration performance learning use preprocessing, further
                                the size of the compiled SPL binary. This optimization discussed in Section 3. Accordingly, we thus conduct an
                                usually takes place in steps. The first step is sampling in-depth investigation on the influence of preprocessing
                                configurations from the configuration space of the SPL on performance prediction models for SPLs. To this end,
                                and measuring the target property, which often entails we measure the performance of 4 preprocessing methods
                                compiling and running tests or benchmarks, a very time in the context of 3 SPLs, 5 machine learning models, and
                                and resource-intensive undertaking. One can use these 20 different sizes of training sets. We discuss the details
                                samples to train a prediction model, which is then used of the experimental evaluation in Section 4, followed by
                                to find a configuration that optimizes the target property. a discussion of the results in Section 5.
                                In this paper, we focus on the creation and training of
                                the prediction model. Many factors can impact the per-
                                formance of a performance prediction model for SPLs, 2. Definitions
                                from the SPL itself to the sampling approach used to col-
                                lect the training data. However, our focus lies on one Software Product Line (SPL). SPLs, as a concept started
                                                                                                  to gain widespread popularity at the beginning of the
                                                                                                  2000s [2]. Engström and Runeson [3] describe SPLs as
                                ConfWS’24: 26th International Workshop on Configuration, Sep 2–3,
                                2024, Girona, Spain
                                                                                                  the paradigm of forming derivate products from a set of
                                ∗
                                     Corresponding author.                                        generic components. A SPL has multiple features, each
                                Envelope-Open dgarber@ist.tugraz.at (D. Garber);                  supporting an individual domain of values, which allows
                                alexander.felfernig@ist.tugraz.at (A. Felfernig);                 for the generation of diverse products using the same
                                vietman.le@ist.tugraz.at (V. Le); tamim.burgstaller@ist.tugraz.at components.
                                (T. Burgstaller); merfat.elmansi@un.org (M. El-Mansi)
                                Orcid 0009-0005-0993-0911 (D. Garber); 0000-0003-0108-3146
                                                                                                  Configuration. In the context of a SPL, a configuration
                                (A. Felfernig); 0000-0001-5778-975X (V. Le); 0009-0007-4522-8497  defines for each feature the corresponding feature value.
                                (T. Burgstaller); 0009-0005-2695-4210 (M. El-Mansi)               However, there may exist additional constraints within
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
the SPL. Thus, we speak of a valid configuration if none               Name                               Value
                                                                       Vendor                            Lenovo
of the assigned values is inconsistent with any of the
                                                                       Product                        20N6001GGE
constraints.
                                                                       CPU                 Intel Core i7-8665U (4x 1,90 GHz)
Configuration Space. The configuration space of a SPL                  RAM                            32GB (DDR4)
describes the space spanned by all valid configurations                OS                            Manjaro Linux
of the SPL. The size of configuration spaces commonly                  Kernel version             6.1.80.-1-MANJARO
grows exponentially with the number of features, and we
speak of colossal configuration spaces if its size is ≫ 1010   Table 1
                                                               Specifications of the machine used in the experiments
[4].


3. Related Work                                                et al. [12] named multiple data sampling approaches
                                                               used in configuration performance learning. However,
The three fully enumerated configuration spaces pro-           both identified random sampling as the most popular ap-
vided by Oh et al.[4] facilitate the comprehensive perfor-     proach. Pereira et al. [13] conducted a dedicated study on
mance analysis and comparison we conducted. They use           sampling approaches for learning configuration spaces.
them to show that relatively simple approaches like uni-       They suggest using uniform random sampling as long as
form random sampling can outperform well-established           it is computationally feasible. Accordingly, we adapted
tools like SPL Conquerer [5] to find near-optimal configu-     it for our comparison.
rations for SPLs. We build on this idea and conduct a com-
parison of different preprocessing methods. The reason-
ing behind conducting this comparison is the alarming          4. Experimental Setup
result of Gong and Chen [1]. They performed a literature
review on deep configuration performance learning and          This section will discuss the exact experimental setup
reported that 44 out of 85 investigated studies used the       for data collection and which machine learning models,
data as it was without any preprocessing. This limited         preprocessing methods, and datasets we were using. All
utilization implies a lack of awareness of the impact of       measurements were collected using the same machine
preprocessing methods. This lack of awareness may then         with specifications as they are listed in Table 1. We use
aggravate the difficulty of reproducing and validating         Mean Absolute Percentage Error (MAPE) to evaluate the
the results of published works. Dacrema et al. [6], for        model performances, which is one of the most commonly
example, investigated 18 new approaches published re-          used metrics in literature [1] [12] [13]. The code was im-
cently, of which they could only reproduce 7. Of the 7         plemented in Python using the widely used scikit-learn1
reproduced approaches, they showed that they can out-          library [14]. We used Uniform Random Sampling (URS)
perform 6 by using relatively simple other approaches.         to generate the training sets of different sizes for model
The importance of preprocessing methods in many do-            learning. We can perform URS by selecting configura-
mains has long since been established. Wu et al.[7], for       tions randomly from the set of valid configurations. The
example, shows that preprocessing improves the perfor-         size of the training sets range from 50 to 1000 in steps of
mance of streamflow forecasts. Rasekhi et al. [8] report       50. However, the tests for all models and preprocessing
improvements in the prediction of epileptic seizures by        methods use, within the same iteration, the same training
using preprocessing.                                           set of a specific size. After the performance of all models
We further include in our tests different sample sizes,        using the preprocessing applied to the training sets is
which allows us to investigate the reaction of the pre-        measured, these measurements are repeated 15 times,
processing methods to changing sample sizes. Acher et          each time with new training sets selected with URS. The
al.[9] sampled and measured 95854 Linux configurations,        average of the resulting MAPE values in the 15 iterations
a minute fraction of the configuration space of 215000         is the value we use when we discuss the results.
(2.818 ∗ 104515 ) configurations spaned by Linux. They re-
ported to have needed 15000 hours of computation time          4.1. Datasets
to collect the samples. Guo et al. [10] uses tree-based
models to predict configuration performances. Martin et        For our comparison, we use a dataset of fully enumerated
al.[11] focus on using transfer learning across different      configuration spaces. Thus, the dataset includes all valid
versions of the Linux kernel to predict the performances       configurations for a given SPL. In addition, a value, like
of the versions. They mention preprocessing only for           execution times of benchmarks or similar, representing
encoding configurations into formats compatible with           the performance of each configuration was measured.
their machine-learning approach.                               We use three such datasets based on three configurable
The literature reviews of Gong and Chen [1] and Pereira        1
                                                                   https://scikit-learn.org/stable/index.html
software projects: BerkeleyDBC2 , 7z3 , and VP94 . The                Name                           Domain
                                                                      root                           0|1
datasets are used in the work of Oh et al.[4], and they
                                                                      CompressionMethod              LZMA | LZMA2 | PPMd |
were made available in their resources5 . Oh et al.[4] pro-
                                                                                                     BZip2 | Deflate
vided the following description of the three datasets.                x                              x_0”, ”x_2 | x_4 | x_6 | x_8
 BerkeleyDBC is an embedded database system with 9                                                   | x_10
                                                                      BlockSize                      BlockSize_1”,       ”Block-
        Name                       Domain                                                            Size_2 | BlockSize_4
        DIAGNOSTIC                 0|1                                                               | BlockSize_8 | Block-
        HAVE_STATISTICS            0|1                                                               Size_16 | BlockSize_32
        HAVE_REPLICATION           0|1                                                               | BlockSize_64 | Block-
        HAVE_CRYPTO                0|1                                                               Size_128 | BlockSize_256
        HAVE_SEQUENCE              0|1                                                               | BlockSize_512 | Block-
        HAVE_VERIFY                0|1                                                               Size_1024      |     Block-
        HAVE_HASH                  0|1                                                               Size_2048      |     Block-
        CACHESIZE                  CS16MB | CS32MB |                                                 Size_4096
                                   CS64MB | CS512MB                   Files                          Files_0 | Files_10 |
        PAGESIZE                   PS1K | PS4K | PS8K |                                              Files_20 | Files_30 |
                                   PS16K | PS32K                                                     Files_40 | Files_50 |
Table 2                                                                                              Files_60 | Files_70 |
BerkeleyDBC variable names and their respective domains                                              Files_80 | Files_90 |
                                                                                                     Files_100
                                                                      tmOff                          0|1
variables and 2560 configurations. Benchmark response                 mtOff                          0|1
times were measured. We visualize the variable names                  HeaderCompressionOff           0|1
and their domains in Table 2.                                         filterOff                      0|1
 7z is a file archiver with 9 variables and 68640 configura-
                                                               Table 3
tions. Compression times were measured. We visualize           7z variable names and their respective domains
the variable names and their domains in Table 3.
  VP9 is a video encoder with 12 variables and 216000
configurations. Video encoding times were measured.
                                                              maximize the usefulness of our results. In our implemen-
We visualize the variable names and their domains in
                                                              tations, we used models from the scikit-learn6 python
Table 4.
                                                              library [14]. For the sake of reproducibility, we did not
Although these three configuration spaces do not ap-
                                                              perform any parameter tuning on the models and used
proach the sizes of colossal configuration spaces like
                                                              their respective default settings if not explicitly stated
Linux, which spans a configuration space with 215000
                                                              otherwise.
configurations[9], they still have sizes where an enumer-
                                                              The first model is a Multi-Layer Perceptron (MLP) model,
ation is no longer an option, and thus fall in the purview
                                                              a feedforward neural network approach. We set the max-
of the research field of configuration space learning. De-
                                                              imum of iterations to 1000 and activated early stopping
pending on the complexity of the tests and the underlying
                                                              for our tests. MLPs are, according to Gong and Chen [1],
system, procuring very few samples may already be very
                                                              the most popular approach when conducting deep con-
costly. Acher et al.[9], for example, reported 15000 hours
                                                              figuration performance learning. However, it is a very
of computation to build and measure 95854 Linux con-
                                                              data-intensive approach that needs comparatively large
figurations. We selected these datasets due to two main
                                                              training sets to perform well.
advantages. The first is avoiding the extreme compu-
                                                              The second model is a K-Nearest Neighbors (KNN) model,
tation times of collecting such data, and the second is
                                                              a memory-based approach. The model finds the k-nearest
that using them allows us to test multiple iterations of
                                                              neighbors to a configuration from the training set, in our
training sets of different sizes.
                                                              case the default value 5. The KNN model predicts the
                                                              performance of the configuration by calculating the aver-
4.2. Models                                                   age of the performances of the configuration’s k nearest
We selected five different types of machine-learning mod- neighbors. In our case, the average was weighted by the
els, each representing a different general approach to distance between the neighbor and the configuration.
                                                              The third model is a Random Forest (RF), an ensemble
2
  https://www.oracle.com/database/technologies/related/berke- method employing several decision trees generated using
  leydb.html
3
  https://www.7-zip.org/download.html
                                                              the training data to predict the performance of an un-
4
    https://www.webmproject.org/vp9/
5                                                              6
    https://zenodo.org/records/7776627                             https://scikit-learn.org/stable/index.html
 Name                           Domain                         into numbers this is done (e.g. CS32MB = 32, Table 2).
 root                           0|1
                                                               If this is not possible, the string values are encoded us-
 lagInFrames                    lagInFrames_0 | lag-
                                                               ing label encoding [15, 16], which assigns an increasing
                                InFrames_8        |  lagIn-
                                Frames_16                      numeric value for each unique string in a domain. This
 endUsage                       variableBitrate | constant-    format is the default state of the data. Thus, we apply all
                                Bitrate | constrainedQual-     preprocessing methods mentioned hereafter to the data
                                ity                            in this format.
 AdaptiveQuantizationMode       off | variance | complexity    The second preprocessing method is Min Max Scaling
                                | cyclicRefresh                (MMS) [17, 18], which reduces the scale of a given fea-
 TileColumns                    TileColumns_0              |   ture to be between 0 and 1. We achieve this by applying
                                TileColumns_3              |   Equation 1 on every feature of the configuration, where
                                TileColumns_6                  min and max are the minimum and maximum recorded
 cpuUsed                        cpuUsed_0 | cpuUsed_2 |
                                                               numbers for this feature, respectively. When we apply
                                cpuUsed_4 | cpuUsed_6 |
                                cpuUsed_8
                                                               this to the features encoded using label encoding, the
 Threads                        Threads_2 | Threads_4 |        result is a derivative of the former called scaled label
                                Threads_6 | Threads_8 |        encoding [19, 20].
                                Threads_10
                                                                                               𝑥𝑖 − 𝑚𝑖𝑛𝑖
 bitRate                        bitRate_300 | bitRate_600                        𝑓𝑖 (𝑥𝑖 ) =                           (1)
                                | bitRate_900 | bi-                                           𝑚𝑎𝑥𝑖 − 𝑚𝑖𝑛𝑖
                                tRate_1200 | bitRate_1500
 FrameBoost                     0|1                            The third preprocessing method is Standardization (STD)
 lossless                       0|1                            [21, 22], which is achieved by calculating the mean and
 AutoAltRef                     0|1                            standard deviation of each feature and applying Equation
 Quality                        good | realtime                2                               𝑥𝑖 − 𝜇 𝑖
Table 4                                                                             𝑓𝑖 (𝑥𝑖 ) =                       (2)
                                                                                                  𝜎𝑖
VP9 variable names and their respective domains
                                                            This results in the mean of every feature in the training
                                                            set being now 0 and the standard deviation being 1.
                                                            The final preprocessing method is One Hot Encoding
known data point. We use bagged trees, which means we
                                                            (OHE) [23, 24], which changes the domain of all features
train all underlying decision trees to solve the problem
                                                            to a boolean domain. We achieve this by increasing the
using all features. The final result is in the context of
                                                            dimensions of the data by an encoding of the domain.
classification decided based on a majority vote. However,
                                                            Thus, if, for example, feature 𝑓 has the domain 0, 6, 12,
in our context of regression, we calculate the final result
                                                            it would have been replaced with the features 𝑓0 , 𝑓6 , 𝑓12
by taking the mean of all results produced by the decision
                                                            each of the three resulting boolean features are mutu-
trees.
                                                            ally exclusive and encode one possible value assigned to
The fourth model is a Support Vector Machine (SVM),
                                                            feature 𝑓.
a well-established model based on statistical learning
frameworks. We use a radial basis function as our kernel
type.                                                       5. Results
The final model is an ElasticNet (EN) model, a derivate
of linear regression models. The model combines L1 and In this section, we showcase the measurements collected
L2 priors as a regularizer.                                 as described in the experimental setup section and dis-
                                                            cuss them. To this end, we will discuss the results of each
4.3. Preprocessing                                          dataset separately and what observations we made.
                                                            Firstly, we start with a discussion of our smallest SPL,
We used several preprocessing methods to test their im- BerkeleyDBC. All performance results are visualized in
pact on the different models and training sizes. For the Figure 1. In the results for MLP, we see that OHE is per-
sake of this comparison, we do not distinguish between forming best among all preprocessing methods regardless
actual preprocessing methods like Standardization and of sample size. However, we can also see a shift in the per-
encodings such as the One Hot Encoding.                     formances of the preprocessing approaches. STD started
The first preprocessing method discussed we call default as the worst-performing preprocessing method. Despite
(DEF). It provides a baseline for mostly unaltered data that, with increasing sample size, it outperformed MMS
and leaves numeric values untouched. The boolean val- and DEF. Accordingly, the results of STD approached the
ues are, however, encoded with 0 and 1 for false and true, results of the best performer OHE for the larger sample
respectively. If all values of a domain can be converted sizes. However, when looking at Figure 2 and Figure
Figure 1: The performances of each preprocessing method applied to each model and a comparison between the top performers
of each model aplyed to the BerkeleyDBC dataset.


Figure 2: The performances of each preprocessing method applied to each model, and a comparison between the top
performers of each model applied to the 7z dataset.
Figure 3: The performances of each preprocessing method applied to each model, and a comparison between the top
performers of each model applied to the VP9 dataset.


3, we see that this behavior does not occur in the other      except for EN show significantly better performances
datasets, but, in contrast to other approaches, the results   with larger sample sizes. However, MLP was impacted
of STD remain either relatively stable or improve with        the most by the sample size. It overtook the performance
increasing sample sizes. We see a similar behavior on a       of SVM at a sample size of 450 and EN at 900.
smaller scale with MMS and DEF. MMS performed ini-            Secondly, we will discuss the next larger SPL, 7z. We
tially worse than DEF, overtaking it as soon as sample        provide the results for this dataset in Figure 2. The first
sizes became larger than 200 and achieving similar re-        observation we can make is that the overall quality of
sults from sample sizes 500 and larger. The other models,     the predicted results decreased. This matches our ex-
in contrast, showed more pronounced preferences for           pectations, since we are predicting the performance of a
preprocessing methods. For the KNN model, DEF was             larger SPL using an equivalent setup. Another observa-
the best-performing preprocessing method, followed by         tion we can make is that three of the five tested models
STD, MMS, and OHE. RF showed the best performance             showed strong oscillations in their performances or, for
of all models with almost indistinguishable differences       some preprocessing methods, a worsening of the perfor-
of 0.01% between the preprocessing methods on average.        mance with increasing sample size. The MLP model, for
For the SVM model, DEF performed the worst with com-          example, showed for the best and second best performing
paratively little improvement with larger sample sizes.       preprocessing methods, MMS and STD, respectively, no
The remaining preprocessing methods, from worst to            significant changes with increasing sample sizes. DEF
best, MMS, STD, and OHE show relatively similar re-           and OHE showed meanwhile a decrease in performance
sults, improving with increasing sample sizes. For the        with increasing sample sizes. The EN and SVM models
EN model, OHE performs best, and the remaining three,         showed strong oscillations with increasing sample sizes.
from worst to best, MMS, DEF, and STD show relatively         However, the performances of the preprocessing meth-
similar results. The sample size has a comparatively small    ods all follow that same pattern, which suggests that the
impact on the performances. Results with sample sizes         cause for this may lie in the model or the SPL rather than
of 200 and larger only show a minor oscillation and re-       the preprocessing approaches. In the case of the SVM, the
main otherwise stable. The comparison between models          performances remained very similar. The preprocessing
shows that RF outperforms the other models. All models        performances in the EN model follow the same oscillation
                           Model     Preprocessing      BerkeleyDBC              7z        VP9
                            MLP           DEF                  25.68%       114.76%    273.94%
                            MLP          MMS                   26.29%        99.97%    144.24%
                            MLP           STD                  31.12%        99.98%    129.80%
                            MLP          OHE                  10.27%        104.41%   117.11%
                           KNN            DEF                   1.44%       119.01%    182.27%
                           KNN           MMS                    4.29%        82.30%    158.79%
                           KNN            STD                   2.85%        78.42%   124.31%
                           KNN           OHE                    4.47%        79.99%    241.55%
                            RF            DEF                   0.54%         9.51%     15.14%
                            RF           MMS                    0.55%         9.54%     14.97%
                            RF            STD                   0.55%         9.50%     15.23%
                            RF           OHE                    0.54%        10.82%     18.32%
                           SVM            DEF                   6.44%        91.46%    249.59%
                           SVM           MMS                    6.07%        91.33%    101.45%
                           SVM            STD                   6.04%        91.29%   100.35%
                           SVM           OHE                    6.03%        91.36%    116.08%
                            EN            DEF                   5.19%       176.36%    246.24%
                            EN           MMS                    5.31%       177.54%    273.19%
                            EN            STD                   5.09%      169.40%     267.77%
                            EN           OHE                    2.54%       173.59%   226.60%
Table 5
Average MAPE value over all tested sample sizes from 50 to 1000 with steps of 50


pattern while being displaced with a relatively constant       best performing being MMS.
margin along the y-axis, with STD performing best. The         Finally, we will discuss the results and our observations
KNN model performed as expected, showing constant              in general. To this end, we provide the average perfor-
improvements with increasing sample sizes for all pre-         mances of all models and preprocessing approaches in
processing methods. However, it is notable that the best       Table 5. The first observation must be that preprocessing
performer on the BerkelyDBC dataset DEF performs the           methods have a significant impact on the prediction per-
worst now, with the former second-best performing STD          formances. BerkeleyDBC has an average factor of 2.05
taking its place as the best performer. The RF model re-       between the best and worst-performing preprocessing
mains again the best performer with a significant margin.      methods. In comparison, 7z and VP9 have an average fac-
The preprocessing performances are again very similar,         tor of 1.17 and 1.84 respectively. This observation holds
but STD performs significantly better for the smallest         for all tested models, even for the best-performer RF.
tested sample size, thus outperforming the others.             However, RF shows this impact only with the larger SPLs
Thirdly, we will discuss the largest SPL we investigated,      like 7z and VP9. In general, the differences are maximized
VP9. The results collected for VP9 are shown in Figure 3.      at low sample sizes and become then smaller with increas-
Our first observation is that the results for VP9 are closer   ing sample sizes. We observe a similar situation with the
to the results from BerkeleyDBC. There are again some          SVM model, except DEF, which was largely unsuited.
oscillations in the results of MLP, but they are compara-      DEF was in two out of three tested SPLs performing the
tively minor and show a clear trend to improvement with        worst, showing insignificant improvement with increas-
increasing sample sizes. We see again that the perfor-         ing sample sizes. For MLP, KNN, and EN, on the other
mance of the EN model remains unaffected by increasing         hand, we can see significant performance differences
sample sizes, except for some minor oscillations. The          on every sample size tested, with, in general, more pro-
SVM model shows a similar pattern as it did with the           nounced differences when applying smaller sample sizes.
BerkeleyDBC dataset. The DEF preprocessing method              We also observe multiple occasions where misrepresen-
performs once more the worst and shows as the only             tation of performances could occur when conducting
method with no significant improvement with increas-           tests with only one preprocessing method. For instance,
ing sample size. KNN shows to be once more consistent,         one can conclude that SVMs outperform MLPs on the
showing stable improvement with increasing sample size,        BerkeleyDBC dataset for sample sizes smaller or equal
STD performing best once more. The RF model performs           to 1000 when conducting tests only with DEF or MMS.
once more best by a significant margin. The preprocess-        However, when testing with STD or OHE, we see that
ing methods have little impact on its performance, but         MLP outperforms SVMs on the BerkeleyDBC dataset for
some improve the prediction performance earlier, the           sample sizes greater than 650 or 400, respectively. From
this, we conclude that a sound comparison between two        ing the performance of five different machine learning
or more predictive models should compare their perfor-       models on three SPLs with training sets of increasing
mances when using their best-performing preprocessing        sizes. Except for two, all scenarios tested showed, in part,
methods. Omitting the preprocessing method used may,         radical changes in prediction quality depending on the
by extension, lead to poorly reproducible results.           preprocessing method used. These changes were most
MLP showed to work on average best with OHE. The             pronounced when we measured the model performances
performance of this model strongly correlated with the       with only a few samples to use as training sets and be-
sample size, and it usually started with comparatively       came less distinctive with training sets of increased size.
high MAPE scores that became more competitive with           On average, the disparity between the worst and the
increasing sample sizes. Furthermore, it is prone to os-     best performing preprocessing method were factors of
cillation. KNN showed to work on average best with           2.05 (BerkeleyDBC), 1.17 (7z), and 1.84 (VP9). While we
STD. It was one of the most stable and robust models,        identified the on average best performing preprocessing
achieving constant improvement with increasing sample        methods for each model we tested, we also see, as visu-
sizes, even in the context of SPLs like 7z that triggered    alized in Table 5, that no single method outperforms all
oscillation in most other models. However, its prediction    others for each dataset, which holds as well if we only
quality places it in the middle field. RF showed to work     focus on a single model. Thus, having shown both the
on average best with MMS. This model outperformed            significant impact and the inconsistency in the perfor-
every other model significantly in every aspect we mea-      mance of preprocessing methods, we draw the following
sured. Its worst performance using the smallest tested       conclusions. Results that do not state which, if any, pre-
sample size of 50 outperforms, in all but two cases, the     processing method was employed become hard to repro-
best performances of all other models. This performance      duce. Further, the disregard of preprocessing methods
is then improved further with increasing sample size. The    may pose a threat to the validity of results. In summary,
model usually reaches a plateau relatively early on aver-    preprocessing methods are a high-impact, low-effort, and
age at a sample size of 350, after which its improvement     inconsistent part of the field of SPL performance predic-
slows significantly. SVM showed to work on average           tion, and all these properties make them essential to be
best with STD. The model improves like RF on average         considered and tested.
with a sample size up to 600 steadily, after which the
model starts to plateau in its improvement, except for the
already mentioned DEF. EN showed to work on average          References
best with OHE. This model showed, on average, com-
                                                              [1] J. Gong, T. Chen, Deep configuration performance
paratively minor improvements with increased sample
                                                                  learning: A systematic survey and taxonomy, 2024.
size.
                                                                  arXiv:2403.03322 .
                                                              [2] P. Clements, L. Northrop, Software product lines,
6. Threats to validity                                            Addison-Wesley Boston, 2002.
                                                              [3] E. Engström, P. Runeson,           Software product
This paper compared multiple machine learning-based               line testing – a systematic mapping study, In-
models and explicitly did not perform any parameter tun-          formation and Software Technology 53 (2011)
ing for any one of the models. We used, if not stated             2–13. URL: https://www.sciencedirect.com/science/
explicitly differently, always the default parameters de-         article/pii/S0950584910001709. doi:https://doi.
fined by the scikit-learn7 library [14]. Thus, we must            org/10.1016/j.infsof.2010.05.011 .
acknowledge that fine-tuning the model parameters, es-        [4] J. Oh, D. Batory, R. Heradio, Finding near-optimal
pecially for the more complex models like MLP, likely             configurations in colossal spaces with statistical
will improve the performances of the models employed.             guarantees, ACM Trans. Softw. Eng. Methodol. 33
However, the measured results are still valid and valuable        (2023). URL: https://doi.org/10.1145/3611663. doi:10.
for comparing the model performances concerning the               1145/3611663 .
preprocessing methods and the sizes of the training sets      [5] N. Siegmund, M. Rosenmuller, C. Kastner, P. G. Gia-
employed.                                                         rrusso, S. Apel, S. S. Kolesnikov, Scalable prediction
                                                                  of non-functional properties in software product
                                                                  lines, in: 2011 15th International Software Product
7. Conclusion                                                     Line Conference, IEEE, 2011, pp. 160–169.
                                                              [6] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we
We tested 15 scenarios of machine learning-based per-
                                                                  really making much progress? A worrying analy-
formance prediction in the context of SPLs by measur-
                                                                  sis of recent neural recommendation approaches,
7
    https://scikit-learn.org/stable/index.html
     CoRR abs/1907.06902 (2019). URL: http://arxiv.org/              SPLC ’22, Association for Computing Machinery,
     abs/1907.06902. arXiv:1907.06902 .                              New York, NY, USA, 2022, p. 85–96. URL: https:
 [7] C.-L. Wu, K.-W. Chau, Y.-S. Li, Predicting monthly              //doi.org/10.1145/3546932.3546997. doi:10.1145/
     streamflow using data-driven models coupled with                3546932.3546997 .
     data-preprocessing techniques, Water Resources             [16] L. Bao, X. Liu, F. Wang, B. Fang, Actgan: Auto-
     Research 45 (2009).                                             matic configuration tuning for software systems
 [8] J. Rasekhi, M. R. K. Mollaei, M. Bandarabadi,                   with generative adversarial networks, in: 2019 34th
     C. A. Teixeira, A. Dourado,                Preprocess-          IEEE/ACM International Conference on Automated
     ing effects of 22 linear univariate features on                 Software Engineering (ASE), 2019, pp. 465–476.
     the performance of seizure prediction methods,                  doi:10.1109/ASE.2019.00051 .
     Journal of Neuroscience Methods 217 (2013)                 [17] J. Gong, T. Chen, Predicting software performance
     9–16. URL: https://www.sciencedirect.com/science/               with divide-and-learn, in: Proceedings of the 31st
     article/pii/S0165027013001246. doi:https://doi.                 ACM Joint European Software Engineering Confer-
     org/10.1016/j.jneumeth.2013.03.019 .                            ence and Symposium on the Foundations of Soft-
 [9] M. Acher, H. Martin, J. A. Pereira, A. Blouin, J.-M.            ware Engineering, ESEC/FSE 2023, Association for
     Jézéquel, D. E. Khelladi, L. Lesoil, O. Barais, Learn-          Computing Machinery, New York, NY, USA, 2023,
     ing very large configuration spaces: What matters               p. 858–870. URL: https://doi.org/10.1145/3611643.
     for Linux kernel sizes, Ph.D. thesis, Inria Rennes-             3616334. doi:10.1145/3611643.3616334 .
     Bretagne Atlantique, 2019.                                 [18] S. Fu, S. Gupta, R. Mittal, S. Ratnasamy, On
[10] J. Guo, D. Yang, N. Siegmund, S. Apel, A. Sarkar,               the use of ML for blackbox system performance
     P. Valov, K. Czarnecki, A. Wasowski, H. Yu, Data-               prediction, in: 18th USENIX Symposium on
     efficient performance learning for configurable sys-            Networked Systems Design and Implementation
     tems, Empirical Software Engineering 23 (2018)                  (NSDI 21), USENIX Association, 2021, pp. 763–784.
     1826–1867.                                                      URL: https://www.usenix.org/conference/nsdi21/
[11] H. Martin, M. Acher, J. A. Pereira, L. Lesoil, J.-M.            presentation/fu.
     Jézéquel, D. E. Khelladi, Transfer learning across         [19] Q. Cao, M.-O. Pun, Y. Chen, Deep learning in
     variants and versions: The case of linux kernel                 network-level performance prediction using cross-
     size, IEEE Transactions on Software Engineer-                   layer information, IEEE Transactions on Net-
     ing 48 (2022) 4274–4290. doi:10.1109/TSE.2021.                  work Science and Engineering 9 (2022) 2364–2377.
     3116768 .                                                       doi:10.1109/TNSE.2022.3163274 .
[12] J. A. Pereira, M. Acher, H. Martin, J. Jézéquel, G. Bot-   [20] J. Cheng, C. Gao, Z. Zheng, Hinnperf: Hierarchical
     terweck, A. Ventresque, Learning software con-                  interaction neural network for performance predic-
     figuration spaces: A systematic literature review,              tion of configurable systems, ACM Trans. Softw.
     Journal of Systems and Software 182 (2021) 111044.              Eng. Methodol. 32 (2023). URL: https://doi.org/10.
[13] J. Alves Pereira, M. Acher, H. Martin, J.-M. Jézéquel,          1145/3528100. doi:10.1145/3528100 .
     Sampling effect on performance prediction of con-          [21] K. Zhu, S. Ying, N. Zhang, D. Zhu, Software defect
     figurable systems: A case study, in: Proceedings                prediction based on enhanced metaheuristic feature
     of the ACM/SPEC International Conference on Per-                selection optimization and a hybrid deep neural net-
     formance Engineering, ICPE ’20, Association for                 work, Journal of Systems and Software 180 (2021)
     Computing Machinery, New York, NY, USA, 2020,                   111026. URL: https://www.sciencedirect.com/
     p. 277–288. URL: https://doi.org/10.1145/3358960.               science/article/pii/S0164121221001230. doi:https:
     3379137. doi:10.1145/3358960.3379137 .                          //doi.org/10.1016/j.jss.2021.111026 .
[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,        [22] D. Nemirovsky, T. Arkose, N. Markovic, M. Ne-
     B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,             mirovsky, O. Unsal, A. Cristal, A machine learning
     R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,                 approach for performance prediction and schedul-
     D. Cournapeau, M. Brucher, M. Perrot, E. Duch-                  ing on heterogeneous cpus, in: 2017 29th Interna-
     esnay, Scikit-learn: Machine learning in Python,                tional Symposium on Computer Architecture and
     Journal of Machine Learning Research 12 (2011)                  High Performance Computing (SBAC-PAD), 2017,
     2825–2830.                                                      pp. 121–128. doi:10.1109/SBAC- PAD.2017.23 .
[15] M. Acher, H. Martin, L. Lesoil, A. Blouin, J.-M.           [23] K.-T. Ding, H.-S. Chen, Y.-L. Pan, H.-H. Chen, Y.-C.
     Jézéquel, D. E. Khelladi, O. Barais, J. A. Pereira,             Lin, S.-H. Hung, Portable fast platform-aware neu-
     Feature subset selection for learning huge config-              ral architecture search for edge/mobile computing
     uration spaces: the case of linux kernel size, in:              ai applications, ICSEA 2021 (2021) 108.
     Proceedings of the 26th ACM International Systems          [24] Y. Gao, X. Gu, H. Zhang, H. Lin, M. Yang, Run-
     and Software Product Line Conference - Volume A,                time performance prediction for deep learning
models with graph neural network, in: 2023
IEEE/ACM 45th International Conference on Soft-
ware Engineering: Software Engineering in Prac-
tice (ICSE-SEIP), 2023, pp. 368–380. doi:10.1109/
ICSE- SEIP58684.2023.00039 .