-

L. Bao, X. Liu, F. Wang, B. Fang, Actgan: Auto- Research

1613-0073

10.1109/ASE.2019.00051

An extensive comparison of preprocessing methods in the context of configuration space learning

Damian Garber

dgarber@ist.tugraz.at 0 1

Alexander Felfernig

alexander.felfernig@ist.tugraz.at 0 1

Viet-Man Le

0 1

Tamim Burgstaller

tamim.burgstaller@ist.tugraz.at 0 1

Merfat El-Mansi

merfat.elmansi@un.org 0 1

Configuration Space Learning, Machine Learning, Preprocessing

0 ConfWS'24: 26th International Workshop on Configuration , Sep 2-3 1 Graz University of Technology , Infeldgasse 16b, Graz, Styria , Austria

2019

45 2009 85 96

One of the core goals in the research field of configuration space learning is building precise predictive models that allow for reliably estimating the performance of a configuration without requiring costly tests. The models used for this purpose are usually machine learning-based. However, the models show significant deviations in their performance depending on the investigated Software Product Line (SPL), the applied data preprocessing, and the number of sample configurations collected. Thus, we investigate the impact of diferent preprocessing methods and their behavior when using diferent SPLs, machine learning models, and sample sizes. Performance comparisons on this scale are usually not conducted due to their prohibitively expensive execution time requirements, even for smaller SPLs. Thus, we used three fully enumerated spaces as our training data, which allows for more generalized results. Our results show that the average factors between the worst and best-performing preprocessing methods are 2.05 (BerkeleyDBC), 1.17 (7z), and 1.84 (VP9). Further, no single preprocessing method tested was able to outperform all others, nor was this the case within one specific SPL or model type. This underlines the importance of testing new approaches with multiple preprocessing methods.

CEUR ceur-ws.org

1. Introduction The discovery of configurations that optimize the perfor

mance of any given Software Product Line (SPL) is one of the core goals of configuration space learning. The performance of a model can take many forms and rely heavily on the use case. For instance, one may optimize a SPL to perform a core task very eficiently or optimize for the size of the compiled SPL binary. This optimization usually takes place in steps. The first step is sampling configurations from the configuration space of the SPL and measuring the target property, which often entails compiling and running tests or benchmarks, a very time and resource-intensive undertaking. One can use these samples to train a prediction model, which is then used to find a configuration that optimizes the target property.

In this paper, we focus on the creation and training of

the prediction model. Many factors can impact the performance of a performance prediction model for SPLs, from the SPL itself to the sampling approach used to collect the training data. However, our focus lies on one

2. Definitions

Software Product Line (SPL).

SPLs, as a concept started to gain widespread popularity at the beginning of the 2000s [ 2 ]. Engström and Runeson [ 3 ] describe SPLs as the paradigm of forming derivate products from a set of generic components. A SPL has multiple features, each supporting an individual domain of values, which allows for the generation of diverse products using the same components.

Configuration. In the context of a SPL, a configuration

defines for each feature the corresponding feature value.

However, there may exist additional constraints within the SPL. Thus, we speak of a valid configuration if none of the assigned values is inconsistent with any of the constraints.

Configuration Space. The configuration space of a SPL describes the space spanned by all valid configurations of the SPL. The size of configuration spaces commonly grows exponentially with the number of features, and we speak of colossal configuration spaces if its size is ≫ 1010 [ 4 ].

Name

Vendor Product CPU RAM OS Kernel version

Value

Lenovo 20N6001GGE Intel Core i7-8665U (4x 1,90 GHz) 32GB (DDR4)

Manjaro Linux 6.1.80.-1-MANJARO

3. Related Work

et al. [12] named multiple data sampling approaches used in configuration performance learning. However, both identified random sampling as the most popular approach. Pereira et al. [13] conducted a dedicated study on sampling approaches for learning configuration spaces.

They suggest using uniform random sampling as long as it is computationally feasible. Accordingly, we adapted it for our comparison.

The three fully enumerated configuration spaces provided by Oh et al.[ 4 ] facilitate the comprehensive performance analysis and comparison we conducted. They use them to show that relatively simple approaches like uniform random sampling can outperform well-established tools like SPL Conquerer [ 5 ] to find near-optimal configurations for SPLs. We build on this idea and conduct a comparison of diferent preprocessing methods. The reasoning behind conducting this comparison is the alarming 4. Experimental Setup result of Gong and Chen [ 1 ]. They performed a literature review on deep configuration performance learning and This section will discuss the exact experimental setup reported that 44 out of 85 investigated studies used the for data collection and which machine learning models, data as it was without any preprocessing. This limited preprocessing methods, and datasets we were using. All utilization implies a lack of awareness of the impact of measurements were collected using the same machine preprocessing methods. This lack of awareness may then with specifications as they are listed in Table 1. We use aggravate the dificulty of reproducing and validating Mean Absolute Percentage Error (MAPE) to evaluate the the results of published works. Dacrema et al. [ 6 ], for model performances, which is one of the most commonly example, investigated 18 new approaches published re- used metrics in literature [ 1 ] [12] [13]. The code was imcently, of which they could only reproduce 7. Of the 7 plemented in Python using the widely used scikit-learn1 reproduced approaches, they showed that they can out- library [14]. We used Uniform Random Sampling (URS) perform 6 by using relatively simple other approaches. to generate the training sets of diferent sizes for model The importance of preprocessing methods in many do- learning. We can perform URS by selecting configuramains has long since been established. Wu et al.[7], for tions randomly from the set of valid configurations. The example, shows that preprocessing improves the perfor- size of the training sets range from 50 to 1000 in steps of mance of streamflow forecasts. Rasekhi et al. [ 8] report 50. However, the tests for all models and preprocessing improvements in the prediction of epileptic seizures by methods use, within the same iteration, the same training using preprocessing. set of a specific size. After the performance of all models We further include in our tests diferent sample sizes, using the preprocessing applied to the training sets is which allows us to investigate the reaction of the pre- measured, these measurements are repeated 15 times, processing methods to changing sample sizes. Acher et each time with new training sets selected with URS. The al.[9] sampled and measured 95854 Linux configurations, average of the resulting MAPE values in the 15 iterations a minute fraction of the configuration space of 215000 is the value we use when we discuss the results. (2.818 ∗ 104515) configurations spaned by Linux. They reported to have needed 15000 hours of computation time 4.1. Datasets to collect the samples. Guo et al. [10] uses tree-based models to predict configuration performances. Martin et For our comparison, we use a dataset of fully enumerated al.[11] focus on using transfer learning across diferent configuration spaces. Thus, the dataset includes all valid versions of the Linux kernel to predict the performances configurations for a given SPL. In addition, a value, like of the versions. They mention preprocessing only for execution times of benchmarks or similar, representing encoding configurations into formats compatible with the performance of each configuration was measured. their machine-learning approach. We use three such datasets based on three configurable The literature reviews of Gong and Chen [ 1 ] and Pereira software projects: BerkeleyDBC2, 7z3, and VP94. The datasets are used in the work of Oh et al.[ 4 ], and they were made available in their resources5. Oh et al.[ 4 ] provided the following description of the three datasets.

BerkeleyDBC is an embedded database system with 9

Name

DIAGNOSTIC HAVE_STATISTICS HAVE_REPLICATION HAVE_CRYPTO HAVE_SEQUENCE HAVE_VERIFY HAVE_HASH CACHESIZE PAGESIZE

Domain 0 | 1 0 | 1 0 | 1 0 | 1 0 | 1 0 | 1 0 | 1 CS16MB | CS32MB | CS64MB | CS512MB PS1K | PS4K | PS8K |

PS16K | PS32K tmOf mtOf HeaderCompressionOf filterOf variables and 2560 configurations. Benchmark response times were measured. We visualize the variable names and their domains in Table 2. 7z is a file archiver with 9 variables and 68640 configura- Table 3 tions. Compression times were measured. We visualize 7z variable names and their respective domains the variable names and their domains in Table 3.

VP9 is a video encoder with 12 variables and 216000 configurations. Video encoding times were measured.

We visualize the variable names and their domains in maximize the usefulness of our results. In our implemenTable 4. tations, we used models from the scikit-learn6 python Although these three configuration spaces do not ap- library [14]. For the sake of reproducibility, we did not proach the sizes of colossal configuration spaces like perform any parameter tuning on the models and used Linux, which spans a configuration space with 215000 their respective default settings if not explicitly stated configurations[ 9], they still have sizes where an enumer- otherwise. ation is no longer an option, and thus fall in the purview The first model is a Multi-Layer Perceptron ( MLP) model, of the research field of configuration space learning. De- a feedforward neural network approach. We set the maxpending on the complexity of the tests and the underlying imum of iterations to 1000 and activated early stopping system, procuring very few samples may already be very for our tests. MLPs are, according to Gong and Chen [ 1 ], costly. Acher et al.[9], for example, reported 15000 hours the most popular approach when conducting deep conof computation to build and measure 95854 Linux con- ifguration performance learning. However, it is a very ifgurations. We selected these datasets due to two main data-intensive approach that needs comparatively large advantages. The first is avoiding the extreme compu- training sets to perform well. tation times of collecting such data, and the second is The second model is a K-Nearest Neighbors (KNN) model, that using them allows us to test multiple iterations of a memory-based approach. The model finds the k-nearest training sets of diferent sizes. neighbors to a configuration from the training set, in our case the default value 5. The KNN model predicts the performance of the configuration by calculating the aver4.2. Models age of the performances of the configuration’s k nearest We selected five diferent types of machine-learning mod- neighbors. In our case, the average was weighted by the els, each representing a diferent general approach to distance between the neighbor and the configuration. The third model is a Random Forest (RF), an ensemble method employing several decision trees generated using the training data to predict the performance of an un

2https://www.oracle.com/database/technologies/related/berke

leydb.html 3https://www.7-zip.org/download.html 4https://www.webmproject.org/vp9/ 5https://zenodo.org/records/7776627

6https://scikit-learn.org/stable/index.html

known data point. We use bagged trees, which means we train all underlying decision trees to solve the problem using all features. The final result is in the context of classification decided based on a majority vote. However, in our context of regression, we calculate the final result by taking the mean of all results produced by the decision trees.

The fourth model is a Support Vector Machine (SVM), a well-established model based on statistical learning frameworks. We use a radial basis function as our kernel type.

The final model is an ElasticNet ( EN) model, a derivate of linear regression models. The model combines L1 and L2 priors as a regularizer. into numbers this is done (e.g. CS32MB = 32, Table 2). If this is not possible, the string values are encoded using label encoding [15, 16], which assigns an increasing numeric value for each unique string in a domain. This format is the default state of the data. Thus, we apply all preprocessing methods mentioned hereafter to the data in this format.

The second preprocessing method is Min Max Scaling (MMS) [17, 18], which reduces the scale of a given feature to be between 0 and 1. We achieve this by applying Equation 1 on every feature of the configuration, where min and max are the minimum and maximum recorded numbers for this feature, respectively. When we apply this to the features encoded using label encoding, the result is a derivative of the former called scaled label encoding [19, 20].

( ) =

− − The third preprocessing method is Standardization (STD) [21, 22], which is achieved by calculating the mean and standard deviation of each feature and applying Equation 2 ( ) = − (1) (2)

This results in the mean of every feature in the training

set being now 0 and the standard deviation being 1. The final preprocessing method is One Hot Encoding (OHE) [23, 24], which changes the domain of all features to a boolean domain. We achieve this by increasing the dimensions of the data by an encoding of the domain. Thus, if, for example, feature has the domain 0, 6, 12, it would have been replaced with the features 0, 6, 12 each of the three resulting boolean features are mutually exclusive and encode one possible value assigned to feature .

5. Results In this section, we showcase the measurements collected

as described in the experimental setup section and discuss them. To this end, we will discuss the results of each 4.3. Preprocessing dataset separately and what observations we made. Firstly, we start with a discussion of our smallest SPL, We used several preprocessing methods to test their im- BerkeleyDBC. All performance results are visualized in pact on the diferent models and training sizes. For the Figure 1. In the results for MLP, we see that OHE is persake of this comparison, we do not distinguish between forming best among all preprocessing methods regardless actual preprocessing methods like Standardization and of sample size. However, we can also see a shift in the perencodings such as the One Hot Encoding. formances of the preprocessing approaches. STD started The first preprocessing method discussed we call default as the worst-performing preprocessing method. Despite (DEF). It provides a baseline for mostly unaltered data that, with increasing sample size, it outperformed MMS and leaves numeric values untouched. The boolean val- and DEF. Accordingly, the results of STD approached the ues are, however, encoded with 0 and 1 for false and true, results of the best performer OHE for the larger sample respectively. If all values of a domain can be converted sizes. However, when looking at Figure 2 and Figure 3, we see that this behavior does not occur in the other except for EN show significantly better performances datasets, but, in contrast to other approaches, the results with larger sample sizes. However, MLP was impacted of STD remain either relatively stable or improve with the most by the sample size. It overtook the performance increasing sample sizes. We see a similar behavior on a of SVM at a sample size of 450 and EN at 900. smaller scale with MMS and DEF. MMS performed ini- Secondly, we will discuss the next larger SPL, 7z. We tially worse than DEF, overtaking it as soon as sample provide the results for this dataset in Figure 2. The first sizes became larger than 200 and achieving similar re- observation we can make is that the overall quality of sults from sample sizes 500 and larger. The other models, the predicted results decreased. This matches our exin contrast, showed more pronounced preferences for pectations, since we are predicting the performance of a preprocessing methods. For the KNN model, DEF was larger SPL using an equivalent setup. Another observathe best-performing preprocessing method, followed by tion we can make is that three of the five tested models STD, MMS, and OHE. RF showed the best performance showed strong oscillations in their performances or, for of all models with almost indistinguishable diferences some preprocessing methods, a worsening of the perforof 0.01% between the preprocessing methods on average. mance with increasing sample size. The MLP model, for For the SVM model, DEF performed the worst with com- example, showed for the best and second best performing paratively little improvement with larger sample sizes. preprocessing methods, MMS and STD, respectively, no The remaining preprocessing methods, from worst to significant changes with increasing sample sizes. DEF best, MMS, STD, and OHE show relatively similar re- and OHE showed meanwhile a decrease in performance sults, improving with increasing sample sizes. For the with increasing sample sizes. The EN and SVM models EN model, OHE performs best, and the remaining three, showed strong oscillations with increasing sample sizes. from worst to best, MMS, DEF, and STD show relatively However, the performances of the preprocessing methsimilar results. The sample size has a comparatively small ods all follow that same pattern, which suggests that the impact on the performances. Results with sample sizes cause for this may lie in the model or the SPL rather than of 200 and larger only show a minor oscillation and re- the preprocessing approaches. In the case of the SVM, the main otherwise stable. The comparison between models performances remained very similar. The preprocessing shows that RF outperforms the other models. All models performances in the EN model follow the same oscillation pattern while being displaced with a relatively constant best performing being MMS. margin along the y-axis, with STD performing best. The Finally, we will discuss the results and our observations KNN model performed as expected, showing constant in general. To this end, we provide the average perforimprovements with increasing sample sizes for all pre- mances of all models and preprocessing approaches in processing methods. However, it is notable that the best Table 5. The first observation must be that preprocessing performer on the BerkelyDBC dataset DEF performs the methods have a significant impact on the prediction perworst now, with the former second-best performing STD formances. BerkeleyDBC has an average factor of 2.05 taking its place as the best performer. The RF model re- between the best and worst-performing preprocessing mains again the best performer with a significant margin. methods. In comparison, 7z and VP9 have an average facThe preprocessing performances are again very similar, tor of 1.17 and 1.84 respectively. This observation holds but STD performs significantly better for the smallest for all tested models, even for the best-performer RF. tested sample size, thus outperforming the others. However, RF shows this impact only with the larger SPLs Thirdly, we will discuss the largest SPL we investigated, like 7z and VP9. In general, the diferences are maximized VP9. The results collected for VP9 are shown in Figure 3. at low sample sizes and become then smaller with increasOur first observation is that the results for VP9 are closer ing sample sizes. We observe a similar situation with the to the results from BerkeleyDBC. There are again some SVM model, except DEF, which was largely unsuited. oscillations in the results of MLP, but they are compara- DEF was in two out of three tested SPLs performing the tively minor and show a clear trend to improvement with worst, showing insignificant improvement with increasincreasing sample sizes. We see again that the perfor- ing sample sizes. For MLP, KNN, and EN, on the other mance of the EN model remains unafected by increasing hand, we can see significant performance diferences sample sizes, except for some minor oscillations. The on every sample size tested, with, in general, more proSVM model shows a similar pattern as it did with the nounced diferences when applying smaller sample sizes. BerkeleyDBC dataset. The DEF preprocessing method We also observe multiple occasions where misrepresenperforms once more the worst and shows as the only tation of performances could occur when conducting method with no significant improvement with increas- tests with only one preprocessing method. For instance, ing sample size. KNN shows to be once more consistent, one can conclude that SVMs outperform MLPs on the showing stable improvement with increasing sample size, BerkeleyDBC dataset for sample sizes smaller or equal STD performing best once more. The RF model performs to 1000 when conducting tests only with DEF or MMS. once more best by a significant margin. The preprocess- However, when testing with STD or OHE, we see that ing methods have little impact on its performance, but MLP outperforms SVMs on the BerkeleyDBC dataset for some improve the prediction performance earlier, the sample sizes greater than 650 or 400, respectively. From this, we conclude that a sound comparison between two ing the performance of five diferent machine learning or more predictive models should compare their perfor- models on three SPLs with training sets of increasing mances when using their best-performing preprocessing sizes. Except for two, all scenarios tested showed, in part, methods. Omitting the preprocessing method used may, radical changes in prediction quality depending on the by extension, lead to poorly reproducible results. preprocessing method used. These changes were most MLP showed to work on average best with OHE. The pronounced when we measured the model performances performance of this model strongly correlated with the with only a few samples to use as training sets and besample size, and it usually started with comparatively came less distinctive with training sets of increased size. high MAPE scores that became more competitive with On average, the disparity between the worst and the increasing sample sizes. Furthermore, it is prone to os- best performing preprocessing method were factors of cillation. KNN showed to work on average best with 2.05 (BerkeleyDBC), 1.17 (7z), and 1.84 (VP9). While we STD. It was one of the most stable and robust models, identified the on average best performing preprocessing achieving constant improvement with increasing sample methods for each model we tested, we also see, as visusizes, even in the context of SPLs like 7z that triggered alized in Table 5, that no single method outperforms all oscillation in most other models. However, its prediction others for each dataset, which holds as well if we only quality places it in the middle field. RF showed to work focus on a single model. Thus, having shown both the on average best with MMS. This model outperformed significant impact and the inconsistency in the perforevery other model significantly in every aspect we mea- mance of preprocessing methods, we draw the following sured. Its worst performance using the smallest tested conclusions. Results that do not state which, if any, presample size of 50 outperforms, in all but two cases, the processing method was employed become hard to reprobest performances of all other models. This performance duce. Further, the disregard of preprocessing methods is then improved further with increasing sample size. The may pose a threat to the validity of results. In summary, model usually reaches a plateau relatively early on aver- preprocessing methods are a high-impact, low-efort, and age at a sample size of 350, after which its improvement inconsistent part of the field of SPL performance predicslows significantly. SVM showed to work on average tion, and all these properties make them essential to be best with STD. The model improves like RF on average considered and tested. with a sample size up to 600 steadily, after which the model starts to plateau in its improvement, except for the already mentioned DEF. EN showed to work on average References best with OHE. This model showed, on average, comparatively minor improvements with increased sample size.

6. Threats to validity

This paper compared multiple machine learning-based models and explicitly did not perform any parameter tuning for any one of the models. We used, if not stated explicitly diferently, always the default parameters deifned by the scikit-learn 7 library [14]. Thus, we must acknowledge that fine-tuning the model parameters, especially for the more complex models like MLP, likely will improve the performances of the models employed. However, the measured results are still valid and valuable for comparing the model performances concerning the preprocessing methods and the sizes of the training sets employed.

7. Conclusion

We tested 15 scenarios of machine learning-based performance prediction in the context of SPLs by measur

7https://scikit-learn.org/stable/index.html

[1]

Gong , T. Chen, Deep configuration performance learning: A systematic survey and taxonomy, 2024 . arXiv: 2403 . 03322 .

[2]

Clements ,

Northrop , Software product lines, Addison-Wesley Boston, 2002 .

[3]

Engström ,

Runeson , Software product line testing - a systematic mapping study , Information and Software Technology 53 ( 2011 ) 2 - 13 . URL: https://www.sciencedirect.com/science/ article/pii/S0950584910001709. doi:https://doi. org/10.1016/j.infsof. 2010 . 05 .011.

[4]

Oh ,

Batory ,

Heradio , Finding near-optimal configurations in colossal spaces with statistical guarantees , ACM Trans. Softw. Eng. Methodol . 33 ( 2023 ). URL: https://doi.org/10.1145/3611663. doi: 10 . 1145/3611663.

[5]

Siegmund ,

Rosenmuller ,

Kastner ,

P. G.

Giarrusso ,

Apel ,

S. S.

Kolesnikov , Scalable prediction of non-functional properties in software product lines , in: 2011 15th International Software Product Line Conference , IEEE, 2011 , pp. 160 - 169 .

[6]

M. F.

Dacrema ,

Cremonesi ,

Jannach , Are we really making much progress? A worrying analysis of recent neural recommendation approaches,