<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>L. Bao, X. Liu, F. Wang, B. Fang, Actgan: Auto-
Research</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/ASE.2019.00051</article-id>
      <title-group>
        <article-title>An extensive comparison of preprocessing methods in the context of configuration space learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Damian Garber</string-name>
          <email>dgarber@ist.tugraz.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Felfernig</string-name>
          <email>alexander.felfernig@ist.tugraz.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viet-Man Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tamim Burgstaller</string-name>
          <email>tamim.burgstaller@ist.tugraz.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Merfat El-Mansi</string-name>
          <email>merfat.elmansi@un.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Configuration Space Learning, Machine Learning, Preprocessing</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ConfWS'24: 26th International Workshop on Configuration</institution>
          ,
          <addr-line>Sep 2-3</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Graz University of Technology</institution>
          ,
          <addr-line>Infeldgasse 16b, Graz, Styria</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>45</volume>
      <issue>2009</issue>
      <fpage>85</fpage>
      <lpage>96</lpage>
      <abstract>
        <p>One of the core goals in the research field of configuration space learning is building precise predictive models that allow for reliably estimating the performance of a configuration without requiring costly tests. The models used for this purpose are usually machine learning-based. However, the models show significant deviations in their performance depending on the investigated Software Product Line (SPL), the applied data preprocessing, and the number of sample configurations collected. Thus, we investigate the impact of diferent preprocessing methods and their behavior when using diferent SPLs, machine learning models, and sample sizes. Performance comparisons on this scale are usually not conducted due to their prohibitively expensive execution time requirements, even for smaller SPLs. Thus, we used three fully enumerated spaces as our training data, which allows for more generalized results. Our results show that the average factors between the worst and best-performing preprocessing methods are 2.05 (BerkeleyDBC), 1.17 (7z), and 1.84 (VP9). Further, no single preprocessing method tested was able to outperform all others, nor was this the case within one specific SPL or model type. This underlines the importance of testing new approaches with multiple preprocessing methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <sec id="sec-2-1">
        <title>The discovery of configurations that optimize the perfor</title>
        <p>mance of any given Software Product Line (SPL) is one
of the core goals of configuration space learning. The
performance of a model can take many forms and rely
heavily on the use case. For instance, one may optimize a
SPL to perform a core task very eficiently or optimize for
the size of the compiled SPL binary. This optimization
usually takes place in steps. The first step is sampling
configurations from the configuration space of the SPL
and measuring the target property, which often entails
compiling and running tests or benchmarks, a very time
and resource-intensive undertaking. One can use these
samples to train a prediction model, which is then used
to find a configuration that optimizes the target property.</p>
      </sec>
      <sec id="sec-2-2">
        <title>In this paper, we focus on the creation and training of</title>
        <p>the prediction model. Many factors can impact the
performance of a performance prediction model for SPLs,
from the SPL itself to the sampling approach used to
collect the training data. However, our focus lies on one</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Definitions</title>
      <p>Software Product Line (SPL).</p>
      <p>
        SPLs, as a concept started
to gain widespread popularity at the beginning of the
2000s [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Engström and Runeson [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] describe SPLs as
the paradigm of forming derivate products from a set of
generic components. A SPL has multiple features, each
supporting an individual domain of values, which allows
for the generation of diverse products using the same
components.
      </p>
      <sec id="sec-3-1">
        <title>Configuration. In the context of a SPL, a configuration</title>
        <p>defines for each feature the corresponding feature value.</p>
      </sec>
      <sec id="sec-3-2">
        <title>However, there may exist additional constraints within the SPL. Thus, we speak of a valid configuration if none of the assigned values is inconsistent with any of the constraints.</title>
        <p>
          Configuration Space. The configuration space of a SPL
describes the space spanned by all valid configurations
of the SPL. The size of configuration spaces commonly
grows exponentially with the number of features, and we
speak of colossal configuration spaces if its size is ≫ 1010
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <sec id="sec-3-2-1">
          <title>Name</title>
          <p>Vendor
Product
CPU
RAM
OS
Kernel version</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Value</title>
          <p>Lenovo
20N6001GGE
Intel Core i7-8665U (4x 1,90 GHz)
32GB (DDR4)</p>
          <p>Manjaro Linux
6.1.80.-1-MANJARO</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Related Work</title>
      <p>et al. [12] named multiple data sampling approaches
used in configuration performance learning. However,
both identified random sampling as the most popular
approach. Pereira et al. [13] conducted a dedicated study on
sampling approaches for learning configuration spaces.</p>
      <p>They suggest using uniform random sampling as long as
it is computationally feasible. Accordingly, we adapted
it for our comparison.</p>
      <p>
        The three fully enumerated configuration spaces
provided by Oh et al.[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] facilitate the comprehensive
performance analysis and comparison we conducted. They use
them to show that relatively simple approaches like
uniform random sampling can outperform well-established
tools like SPL Conquerer [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to find near-optimal
configurations for SPLs. We build on this idea and conduct a
comparison of diferent preprocessing methods. The
reasoning behind conducting this comparison is the alarming 4. Experimental Setup
result of Gong and Chen [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. They performed a literature
review on deep configuration performance learning and This section will discuss the exact experimental setup
reported that 44 out of 85 investigated studies used the for data collection and which machine learning models,
data as it was without any preprocessing. This limited preprocessing methods, and datasets we were using. All
utilization implies a lack of awareness of the impact of measurements were collected using the same machine
preprocessing methods. This lack of awareness may then with specifications as they are listed in Table 1. We use
aggravate the dificulty of reproducing and validating Mean Absolute Percentage Error (MAPE) to evaluate the
the results of published works. Dacrema et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], for model performances, which is one of the most commonly
example, investigated 18 new approaches published re- used metrics in literature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [12] [13]. The code was
imcently, of which they could only reproduce 7. Of the 7 plemented in Python using the widely used scikit-learn1
reproduced approaches, they showed that they can out- library [14]. We used Uniform Random Sampling (URS)
perform 6 by using relatively simple other approaches. to generate the training sets of diferent sizes for model
The importance of preprocessing methods in many do- learning. We can perform URS by selecting
configuramains has long since been established. Wu et al.[7], for tions randomly from the set of valid configurations. The
example, shows that preprocessing improves the perfor- size of the training sets range from 50 to 1000 in steps of
mance of streamflow forecasts. Rasekhi et al. [ 8] report 50. However, the tests for all models and preprocessing
improvements in the prediction of epileptic seizures by methods use, within the same iteration, the same training
using preprocessing. set of a specific size. After the performance of all models
We further include in our tests diferent sample sizes, using the preprocessing applied to the training sets is
which allows us to investigate the reaction of the pre- measured, these measurements are repeated 15 times,
processing methods to changing sample sizes. Acher et each time with new training sets selected with URS. The
al.[9] sampled and measured 95854 Linux configurations, average of the resulting MAPE values in the 15 iterations
a minute fraction of the configuration space of 215000 is the value we use when we discuss the results.
(2.818 ∗ 104515) configurations spaned by Linux. They
reported to have needed 15000 hours of computation time 4.1. Datasets
to collect the samples. Guo et al. [10] uses tree-based
models to predict configuration performances. Martin et For our comparison, we use a dataset of fully enumerated
al.[11] focus on using transfer learning across diferent configuration spaces. Thus, the dataset includes all valid
versions of the Linux kernel to predict the performances configurations for a given SPL. In addition, a value, like
of the versions. They mention preprocessing only for execution times of benchmarks or similar, representing
encoding configurations into formats compatible with the performance of each configuration was measured.
their machine-learning approach. We use three such datasets based on three configurable
The literature reviews of Gong and Chen [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Pereira
software projects: BerkeleyDBC2, 7z3, and VP94. The
datasets are used in the work of Oh et al.[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and they
were made available in their resources5. Oh et al.[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
provided the following description of the three datasets.
      </p>
      <p>BerkeleyDBC is an embedded database system with 9</p>
      <sec id="sec-4-1">
        <title>Name</title>
        <p>DIAGNOSTIC
HAVE_STATISTICS
HAVE_REPLICATION
HAVE_CRYPTO
HAVE_SEQUENCE
HAVE_VERIFY
HAVE_HASH
CACHESIZE
PAGESIZE</p>
        <p>Domain
0 | 1
0 | 1
0 | 1
0 | 1
0 | 1
0 | 1
0 | 1
CS16MB | CS32MB |
CS64MB | CS512MB
PS1K | PS4K | PS8K |</p>
        <p>PS16K | PS32K
tmOf
mtOf
HeaderCompressionOf
filterOf
variables and 2560 configurations. Benchmark response
times were measured. We visualize the variable names
and their domains in Table 2.
7z is a file archiver with 9 variables and 68640 configura- Table 3
tions. Compression times were measured. We visualize 7z variable names and their respective domains
the variable names and their domains in Table 3.</p>
        <p>VP9 is a video encoder with 12 variables and 216000
configurations. Video encoding times were measured.</p>
        <p>
          We visualize the variable names and their domains in maximize the usefulness of our results. In our
implemenTable 4. tations, we used models from the scikit-learn6 python
Although these three configuration spaces do not ap- library [14]. For the sake of reproducibility, we did not
proach the sizes of colossal configuration spaces like perform any parameter tuning on the models and used
Linux, which spans a configuration space with 215000 their respective default settings if not explicitly stated
configurations[ 9], they still have sizes where an enumer- otherwise.
ation is no longer an option, and thus fall in the purview The first model is a Multi-Layer Perceptron ( MLP) model,
of the research field of configuration space learning. De- a feedforward neural network approach. We set the
maxpending on the complexity of the tests and the underlying imum of iterations to 1000 and activated early stopping
system, procuring very few samples may already be very for our tests. MLPs are, according to Gong and Chen [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ],
costly. Acher et al.[9], for example, reported 15000 hours the most popular approach when conducting deep
conof computation to build and measure 95854 Linux con- ifguration performance learning. However, it is a very
ifgurations. We selected these datasets due to two main data-intensive approach that needs comparatively large
advantages. The first is avoiding the extreme compu- training sets to perform well.
tation times of collecting such data, and the second is The second model is a K-Nearest Neighbors (KNN) model,
that using them allows us to test multiple iterations of a memory-based approach. The model finds the k-nearest
training sets of diferent sizes. neighbors to a configuration from the training set, in our
case the default value 5. The KNN model predicts the
performance of the configuration by calculating the
aver4.2. Models age of the performances of the configuration’s k nearest
We selected five diferent types of machine-learning mod- neighbors. In our case, the average was weighted by the
els, each representing a diferent general approach to distance between the neighbor and the configuration.
The third model is a Random Forest (RF), an ensemble
method employing several decision trees generated using
the training data to predict the performance of an
un
        </p>
        <sec id="sec-4-1-1">
          <title>2https://www.oracle.com/database/technologies/related/berke</title>
          <p>leydb.html
3https://www.7-zip.org/download.html
4https://www.webmproject.org/vp9/
5https://zenodo.org/records/7776627</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>6https://scikit-learn.org/stable/index.html</title>
          <p>known data point. We use bagged trees, which means we
train all underlying decision trees to solve the problem
using all features. The final result is in the context of
classification decided based on a majority vote. However,
in our context of regression, we calculate the final result
by taking the mean of all results produced by the decision
trees.</p>
          <p>The fourth model is a Support Vector Machine (SVM),
a well-established model based on statistical learning
frameworks. We use a radial basis function as our kernel
type.</p>
          <p>The final model is an ElasticNet ( EN) model, a derivate
of linear regression models. The model combines L1 and
L2 priors as a regularizer.
into numbers this is done (e.g. CS32MB = 32, Table 2).
If this is not possible, the string values are encoded
using label encoding [15, 16], which assigns an increasing
numeric value for each unique string in a domain. This
format is the default state of the data. Thus, we apply all
preprocessing methods mentioned hereafter to the data
in this format.</p>
          <p>The second preprocessing method is Min Max Scaling
(MMS) [17, 18], which reduces the scale of a given
feature to be between 0 and 1. We achieve this by applying
Equation 1 on every feature of the configuration, where
min and max are the minimum and maximum recorded
numbers for this feature, respectively. When we apply
this to the features encoded using label encoding, the
result is a derivative of the former called scaled label
encoding [19, 20].</p>
          <p>(  ) =</p>
          <p>−  

 −  
The third preprocessing method is Standardization (STD)
[21, 22], which is achieved by calculating the mean and
standard deviation of each feature and applying Equation
2
  (  ) =
  −  
 
(1)
(2)</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>This results in the mean of every feature in the training</title>
          <p>set being now 0 and the standard deviation being 1.
The final preprocessing method is One Hot Encoding
(OHE) [23, 24], which changes the domain of all features
to a boolean domain. We achieve this by increasing the
dimensions of the data by an encoding of the domain.
Thus, if, for example, feature  has the domain 0, 6, 12,
it would have been replaced with the features  0,  6,  12
each of the three resulting boolean features are
mutually exclusive and encode one possible value assigned to
feature  .</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>In this section, we showcase the measurements collected</title>
        <p>as described in the experimental setup section and
discuss them. To this end, we will discuss the results of each
4.3. Preprocessing dataset separately and what observations we made.
Firstly, we start with a discussion of our smallest SPL,
We used several preprocessing methods to test their im- BerkeleyDBC. All performance results are visualized in
pact on the diferent models and training sizes. For the Figure 1. In the results for MLP, we see that OHE is
persake of this comparison, we do not distinguish between forming best among all preprocessing methods regardless
actual preprocessing methods like Standardization and of sample size. However, we can also see a shift in the
perencodings such as the One Hot Encoding. formances of the preprocessing approaches. STD started
The first preprocessing method discussed we call default as the worst-performing preprocessing method. Despite
(DEF). It provides a baseline for mostly unaltered data that, with increasing sample size, it outperformed MMS
and leaves numeric values untouched. The boolean val- and DEF. Accordingly, the results of STD approached the
ues are, however, encoded with 0 and 1 for false and true, results of the best performer OHE for the larger sample
respectively. If all values of a domain can be converted sizes. However, when looking at Figure 2 and Figure
3, we see that this behavior does not occur in the other except for EN show significantly better performances
datasets, but, in contrast to other approaches, the results with larger sample sizes. However, MLP was impacted
of STD remain either relatively stable or improve with the most by the sample size. It overtook the performance
increasing sample sizes. We see a similar behavior on a of SVM at a sample size of 450 and EN at 900.
smaller scale with MMS and DEF. MMS performed ini- Secondly, we will discuss the next larger SPL, 7z. We
tially worse than DEF, overtaking it as soon as sample provide the results for this dataset in Figure 2. The first
sizes became larger than 200 and achieving similar re- observation we can make is that the overall quality of
sults from sample sizes 500 and larger. The other models, the predicted results decreased. This matches our
exin contrast, showed more pronounced preferences for pectations, since we are predicting the performance of a
preprocessing methods. For the KNN model, DEF was larger SPL using an equivalent setup. Another
observathe best-performing preprocessing method, followed by tion we can make is that three of the five tested models
STD, MMS, and OHE. RF showed the best performance showed strong oscillations in their performances or, for
of all models with almost indistinguishable diferences some preprocessing methods, a worsening of the
perforof 0.01% between the preprocessing methods on average. mance with increasing sample size. The MLP model, for
For the SVM model, DEF performed the worst with com- example, showed for the best and second best performing
paratively little improvement with larger sample sizes. preprocessing methods, MMS and STD, respectively, no
The remaining preprocessing methods, from worst to significant changes with increasing sample sizes. DEF
best, MMS, STD, and OHE show relatively similar re- and OHE showed meanwhile a decrease in performance
sults, improving with increasing sample sizes. For the with increasing sample sizes. The EN and SVM models
EN model, OHE performs best, and the remaining three, showed strong oscillations with increasing sample sizes.
from worst to best, MMS, DEF, and STD show relatively However, the performances of the preprocessing
methsimilar results. The sample size has a comparatively small ods all follow that same pattern, which suggests that the
impact on the performances. Results with sample sizes cause for this may lie in the model or the SPL rather than
of 200 and larger only show a minor oscillation and re- the preprocessing approaches. In the case of the SVM, the
main otherwise stable. The comparison between models performances remained very similar. The preprocessing
shows that RF outperforms the other models. All models performances in the EN model follow the same oscillation
pattern while being displaced with a relatively constant best performing being MMS.
margin along the y-axis, with STD performing best. The Finally, we will discuss the results and our observations
KNN model performed as expected, showing constant in general. To this end, we provide the average
perforimprovements with increasing sample sizes for all pre- mances of all models and preprocessing approaches in
processing methods. However, it is notable that the best Table 5. The first observation must be that preprocessing
performer on the BerkelyDBC dataset DEF performs the methods have a significant impact on the prediction
perworst now, with the former second-best performing STD formances. BerkeleyDBC has an average factor of 2.05
taking its place as the best performer. The RF model re- between the best and worst-performing preprocessing
mains again the best performer with a significant margin. methods. In comparison, 7z and VP9 have an average
facThe preprocessing performances are again very similar, tor of 1.17 and 1.84 respectively. This observation holds
but STD performs significantly better for the smallest for all tested models, even for the best-performer RF.
tested sample size, thus outperforming the others. However, RF shows this impact only with the larger SPLs
Thirdly, we will discuss the largest SPL we investigated, like 7z and VP9. In general, the diferences are maximized
VP9. The results collected for VP9 are shown in Figure 3. at low sample sizes and become then smaller with
increasOur first observation is that the results for VP9 are closer ing sample sizes. We observe a similar situation with the
to the results from BerkeleyDBC. There are again some SVM model, except DEF, which was largely unsuited.
oscillations in the results of MLP, but they are compara- DEF was in two out of three tested SPLs performing the
tively minor and show a clear trend to improvement with worst, showing insignificant improvement with
increasincreasing sample sizes. We see again that the perfor- ing sample sizes. For MLP, KNN, and EN, on the other
mance of the EN model remains unafected by increasing hand, we can see significant performance diferences
sample sizes, except for some minor oscillations. The on every sample size tested, with, in general, more
proSVM model shows a similar pattern as it did with the nounced diferences when applying smaller sample sizes.
BerkeleyDBC dataset. The DEF preprocessing method We also observe multiple occasions where
misrepresenperforms once more the worst and shows as the only tation of performances could occur when conducting
method with no significant improvement with increas- tests with only one preprocessing method. For instance,
ing sample size. KNN shows to be once more consistent, one can conclude that SVMs outperform MLPs on the
showing stable improvement with increasing sample size, BerkeleyDBC dataset for sample sizes smaller or equal
STD performing best once more. The RF model performs to 1000 when conducting tests only with DEF or MMS.
once more best by a significant margin. The preprocess- However, when testing with STD or OHE, we see that
ing methods have little impact on its performance, but MLP outperforms SVMs on the BerkeleyDBC dataset for
some improve the prediction performance earlier, the sample sizes greater than 650 or 400, respectively. From
this, we conclude that a sound comparison between two ing the performance of five diferent machine learning
or more predictive models should compare their perfor- models on three SPLs with training sets of increasing
mances when using their best-performing preprocessing sizes. Except for two, all scenarios tested showed, in part,
methods. Omitting the preprocessing method used may, radical changes in prediction quality depending on the
by extension, lead to poorly reproducible results. preprocessing method used. These changes were most
MLP showed to work on average best with OHE. The pronounced when we measured the model performances
performance of this model strongly correlated with the with only a few samples to use as training sets and
besample size, and it usually started with comparatively came less distinctive with training sets of increased size.
high MAPE scores that became more competitive with On average, the disparity between the worst and the
increasing sample sizes. Furthermore, it is prone to os- best performing preprocessing method were factors of
cillation. KNN showed to work on average best with 2.05 (BerkeleyDBC), 1.17 (7z), and 1.84 (VP9). While we
STD. It was one of the most stable and robust models, identified the on average best performing preprocessing
achieving constant improvement with increasing sample methods for each model we tested, we also see, as
visusizes, even in the context of SPLs like 7z that triggered alized in Table 5, that no single method outperforms all
oscillation in most other models. However, its prediction others for each dataset, which holds as well if we only
quality places it in the middle field. RF showed to work focus on a single model. Thus, having shown both the
on average best with MMS. This model outperformed significant impact and the inconsistency in the
perforevery other model significantly in every aspect we mea- mance of preprocessing methods, we draw the following
sured. Its worst performance using the smallest tested conclusions. Results that do not state which, if any,
presample size of 50 outperforms, in all but two cases, the processing method was employed become hard to
reprobest performances of all other models. This performance duce. Further, the disregard of preprocessing methods
is then improved further with increasing sample size. The may pose a threat to the validity of results. In summary,
model usually reaches a plateau relatively early on aver- preprocessing methods are a high-impact, low-efort, and
age at a sample size of 350, after which its improvement inconsistent part of the field of SPL performance
predicslows significantly. SVM showed to work on average tion, and all these properties make them essential to be
best with STD. The model improves like RF on average considered and tested.
with a sample size up to 600 steadily, after which the
model starts to plateau in its improvement, except for the
already mentioned DEF. EN showed to work on average References
best with OHE. This model showed, on average,
comparatively minor improvements with increased sample
size.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Threats to validity</title>
      <p>This paper compared multiple machine learning-based
models and explicitly did not perform any parameter
tuning for any one of the models. We used, if not stated
explicitly diferently, always the default parameters
deifned by the scikit-learn 7 library [14]. Thus, we must
acknowledge that fine-tuning the model parameters,
especially for the more complex models like MLP, likely
will improve the performances of the models employed.
However, the measured results are still valid and valuable
for comparing the model performances concerning the
preprocessing methods and the sizes of the training sets
employed.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>We tested 15 scenarios of machine learning-based
performance prediction in the context of SPLs by
measur</p>
      <sec id="sec-7-1">
        <title>7https://scikit-learn.org/stable/index.html</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gong</surname>
          </string-name>
          , T. Chen,
          <article-title>Deep configuration performance learning: A systematic survey</article-title>
          and taxonomy,
          <year>2024</year>
          . arXiv:
          <volume>2403</volume>
          .
          <fpage>03322</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Clements</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Northrop</surname>
          </string-name>
          , Software product lines, Addison-Wesley Boston,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Engström</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Runeson</surname>
          </string-name>
          ,
          <article-title>Software product line testing - a systematic mapping study</article-title>
          ,
          <source>Information and Software Technology</source>
          <volume>53</volume>
          (
          <year>2011</year>
          )
          <fpage>2</fpage>
          -
          <lpage>13</lpage>
          . URL: https://www.sciencedirect.com/science/ article/pii/S0950584910001709. doi:https://doi. org/10.1016/j.infsof.
          <year>2010</year>
          .
          <volume>05</volume>
          .011.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batory</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Heradio</surname>
          </string-name>
          ,
          <article-title>Finding near-optimal configurations in colossal spaces with statistical guarantees</article-title>
          ,
          <source>ACM Trans. Softw. Eng. Methodol</source>
          .
          <volume>33</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.1145/3611663. doi:
          <volume>10</volume>
          . 1145/3611663.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Siegmund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosenmuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kastner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Giarrusso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Apel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <article-title>Scalable prediction of non-functional properties in software product lines</article-title>
          ,
          <source>in: 2011 15th International Software Product Line Conference</source>
          , IEEE,
          <year>2011</year>
          , pp.
          <fpage>160</fpage>
          -
          <lpage>169</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Dacrema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <article-title>Are we really making much progress? A worrying analysis of recent neural recommendation approaches,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>