An extensive comparison of preprocessing methods in the context of configuration space learning Damian Garber1,∗ , Alexander Felfernig1 , Viet-Man Le1 , Tamim Burgstaller1 and Merfat El-Mansi1 1 Graz University of Technology, Inffeldgasse 16b, Graz, Styria, Austria Abstract One of the core goals in the research field of configuration space learning is building precise predictive models that allow for reliably estimating the performance of a configuration without requiring costly tests. The models used for this purpose are usually machine learning-based. However, the models show significant deviations in their performance depending on the investigated Software Product Line (SPL), the applied data preprocessing, and the number of sample configurations collected. Thus, we investigate the impact of different preprocessing methods and their behavior when using different SPLs, machine learning models, and sample sizes. Performance comparisons on this scale are usually not conducted due to their prohibitively expensive execution time requirements, even for smaller SPLs. Thus, we used three fully enumerated spaces as our training data, which allows for more generalized results. Our results show that the average factors between the worst and best-performing preprocessing methods are 2.05 (BerkeleyDBC), 1.17 (7z), and 1.84 (VP9). Further, no single preprocessing method tested was able to outperform all others, nor was this the case within one specific SPL or model type. This underlines the importance of testing new approaches with multiple preprocessing methods. Keywords Configuration Space Learning, Machine Learning, Preprocessing 1. Introduction of the factors often neglected in SPL performance pre- diction: Preprocessing. We will define any necessary The discovery of configurations that optimize the perfor- terms used in this paper in Section 2. Preprocessing has mance of any given Software Product Line (SPL) is one proven itself in many other domains that employ machine of the core goals of configuration space learning. The learning-based prediction models. However, literature performance of a model can take many forms and rely reviews such as Gong and Chen [1] show that less than heavily on the use case. For instance, one may optimize a half of the investigated studies within the field of config- SPL to perform a core task very efficiently or optimize for uration performance learning use preprocessing, further the size of the compiled SPL binary. This optimization discussed in Section 3. Accordingly, we thus conduct an usually takes place in steps. The first step is sampling in-depth investigation on the influence of preprocessing configurations from the configuration space of the SPL on performance prediction models for SPLs. To this end, and measuring the target property, which often entails we measure the performance of 4 preprocessing methods compiling and running tests or benchmarks, a very time in the context of 3 SPLs, 5 machine learning models, and and resource-intensive undertaking. One can use these 20 different sizes of training sets. We discuss the details samples to train a prediction model, which is then used of the experimental evaluation in Section 4, followed by to find a configuration that optimizes the target property. a discussion of the results in Section 5. In this paper, we focus on the creation and training of the prediction model. Many factors can impact the per- formance of a performance prediction model for SPLs, 2. Definitions from the SPL itself to the sampling approach used to col- lect the training data. However, our focus lies on one Software Product Line (SPL). SPLs, as a concept started to gain widespread popularity at the beginning of the 2000s [2]. Engström and Runeson [3] describe SPLs as ConfWS’24: 26th International Workshop on Configuration, Sep 2–3, 2024, Girona, Spain the paradigm of forming derivate products from a set of ∗ Corresponding author. generic components. A SPL has multiple features, each Envelope-Open dgarber@ist.tugraz.at (D. Garber); supporting an individual domain of values, which allows alexander.felfernig@ist.tugraz.at (A. Felfernig); for the generation of diverse products using the same vietman.le@ist.tugraz.at (V. Le); tamim.burgstaller@ist.tugraz.at components. (T. Burgstaller); merfat.elmansi@un.org (M. El-Mansi) Orcid 0009-0005-0993-0911 (D. Garber); 0000-0003-0108-3146 Configuration. In the context of a SPL, a configuration (A. Felfernig); 0000-0001-5778-975X (V. Le); 0009-0007-4522-8497 defines for each feature the corresponding feature value. (T. Burgstaller); 0009-0005-2695-4210 (M. El-Mansi) However, there may exist additional constraints within © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings the SPL. Thus, we speak of a valid configuration if none Name Value Vendor Lenovo of the assigned values is inconsistent with any of the Product 20N6001GGE constraints. CPU Intel Core i7-8665U (4x 1,90 GHz) Configuration Space. The configuration space of a SPL RAM 32GB (DDR4) describes the space spanned by all valid configurations OS Manjaro Linux of the SPL. The size of configuration spaces commonly Kernel version 6.1.80.-1-MANJARO grows exponentially with the number of features, and we speak of colossal configuration spaces if its size is ≫ 1010 Table 1 Specifications of the machine used in the experiments [4]. 3. Related Work et al. [12] named multiple data sampling approaches used in configuration performance learning. However, The three fully enumerated configuration spaces pro- both identified random sampling as the most popular ap- vided by Oh et al.[4] facilitate the comprehensive perfor- proach. Pereira et al. [13] conducted a dedicated study on mance analysis and comparison we conducted. They use sampling approaches for learning configuration spaces. them to show that relatively simple approaches like uni- They suggest using uniform random sampling as long as form random sampling can outperform well-established it is computationally feasible. Accordingly, we adapted tools like SPL Conquerer [5] to find near-optimal configu- it for our comparison. rations for SPLs. We build on this idea and conduct a com- parison of different preprocessing methods. The reason- ing behind conducting this comparison is the alarming 4. Experimental Setup result of Gong and Chen [1]. They performed a literature review on deep configuration performance learning and This section will discuss the exact experimental setup reported that 44 out of 85 investigated studies used the for data collection and which machine learning models, data as it was without any preprocessing. This limited preprocessing methods, and datasets we were using. All utilization implies a lack of awareness of the impact of measurements were collected using the same machine preprocessing methods. This lack of awareness may then with specifications as they are listed in Table 1. We use aggravate the difficulty of reproducing and validating Mean Absolute Percentage Error (MAPE) to evaluate the the results of published works. Dacrema et al. [6], for model performances, which is one of the most commonly example, investigated 18 new approaches published re- used metrics in literature [1] [12] [13]. The code was im- cently, of which they could only reproduce 7. Of the 7 plemented in Python using the widely used scikit-learn1 reproduced approaches, they showed that they can out- library [14]. We used Uniform Random Sampling (URS) perform 6 by using relatively simple other approaches. to generate the training sets of different sizes for model The importance of preprocessing methods in many do- learning. We can perform URS by selecting configura- mains has long since been established. Wu et al.[7], for tions randomly from the set of valid configurations. The example, shows that preprocessing improves the perfor- size of the training sets range from 50 to 1000 in steps of mance of streamflow forecasts. Rasekhi et al. [8] report 50. However, the tests for all models and preprocessing improvements in the prediction of epileptic seizures by methods use, within the same iteration, the same training using preprocessing. set of a specific size. After the performance of all models We further include in our tests different sample sizes, using the preprocessing applied to the training sets is which allows us to investigate the reaction of the pre- measured, these measurements are repeated 15 times, processing methods to changing sample sizes. Acher et each time with new training sets selected with URS. The al.[9] sampled and measured 95854 Linux configurations, average of the resulting MAPE values in the 15 iterations a minute fraction of the configuration space of 215000 is the value we use when we discuss the results. (2.818 ∗ 104515 ) configurations spaned by Linux. They re- ported to have needed 15000 hours of computation time 4.1. Datasets to collect the samples. Guo et al. [10] uses tree-based models to predict configuration performances. Martin et For our comparison, we use a dataset of fully enumerated al.[11] focus on using transfer learning across different configuration spaces. Thus, the dataset includes all valid versions of the Linux kernel to predict the performances configurations for a given SPL. In addition, a value, like of the versions. They mention preprocessing only for execution times of benchmarks or similar, representing encoding configurations into formats compatible with the performance of each configuration was measured. their machine-learning approach. We use three such datasets based on three configurable The literature reviews of Gong and Chen [1] and Pereira 1 https://scikit-learn.org/stable/index.html software projects: BerkeleyDBC2 , 7z3 , and VP94 . The Name Domain root 0|1 datasets are used in the work of Oh et al.[4], and they CompressionMethod LZMA | LZMA2 | PPMd | were made available in their resources5 . Oh et al.[4] pro- BZip2 | Deflate vided the following description of the three datasets. x x_0”, ”x_2 | x_4 | x_6 | x_8 BerkeleyDBC is an embedded database system with 9 | x_10 BlockSize BlockSize_1”, ”Block- Name Domain Size_2 | BlockSize_4 DIAGNOSTIC 0|1 | BlockSize_8 | Block- HAVE_STATISTICS 0|1 Size_16 | BlockSize_32 HAVE_REPLICATION 0|1 | BlockSize_64 | Block- HAVE_CRYPTO 0|1 Size_128 | BlockSize_256 HAVE_SEQUENCE 0|1 | BlockSize_512 | Block- HAVE_VERIFY 0|1 Size_1024 | Block- HAVE_HASH 0|1 Size_2048 | Block- CACHESIZE CS16MB | CS32MB | Size_4096 CS64MB | CS512MB Files Files_0 | Files_10 | PAGESIZE PS1K | PS4K | PS8K | Files_20 | Files_30 | PS16K | PS32K Files_40 | Files_50 | Table 2 Files_60 | Files_70 | BerkeleyDBC variable names and their respective domains Files_80 | Files_90 | Files_100 tmOff 0|1 variables and 2560 configurations. Benchmark response mtOff 0|1 times were measured. We visualize the variable names HeaderCompressionOff 0|1 and their domains in Table 2. filterOff 0|1 7z is a file archiver with 9 variables and 68640 configura- Table 3 tions. Compression times were measured. We visualize 7z variable names and their respective domains the variable names and their domains in Table 3. VP9 is a video encoder with 12 variables and 216000 configurations. Video encoding times were measured. maximize the usefulness of our results. In our implemen- We visualize the variable names and their domains in tations, we used models from the scikit-learn6 python Table 4. library [14]. For the sake of reproducibility, we did not Although these three configuration spaces do not ap- perform any parameter tuning on the models and used proach the sizes of colossal configuration spaces like their respective default settings if not explicitly stated Linux, which spans a configuration space with 215000 otherwise. configurations[9], they still have sizes where an enumer- The first model is a Multi-Layer Perceptron (MLP) model, ation is no longer an option, and thus fall in the purview a feedforward neural network approach. We set the max- of the research field of configuration space learning. De- imum of iterations to 1000 and activated early stopping pending on the complexity of the tests and the underlying for our tests. MLPs are, according to Gong and Chen [1], system, procuring very few samples may already be very the most popular approach when conducting deep con- costly. Acher et al.[9], for example, reported 15000 hours figuration performance learning. However, it is a very of computation to build and measure 95854 Linux con- data-intensive approach that needs comparatively large figurations. We selected these datasets due to two main training sets to perform well. advantages. The first is avoiding the extreme compu- The second model is a K-Nearest Neighbors (KNN) model, tation times of collecting such data, and the second is a memory-based approach. The model finds the k-nearest that using them allows us to test multiple iterations of neighbors to a configuration from the training set, in our training sets of different sizes. case the default value 5. The KNN model predicts the performance of the configuration by calculating the aver- 4.2. Models age of the performances of the configuration’s k nearest We selected five different types of machine-learning mod- neighbors. In our case, the average was weighted by the els, each representing a different general approach to distance between the neighbor and the configuration. The third model is a Random Forest (RF), an ensemble 2 https://www.oracle.com/database/technologies/related/berke- method employing several decision trees generated using leydb.html 3 https://www.7-zip.org/download.html the training data to predict the performance of an un- 4 https://www.webmproject.org/vp9/ 5 6 https://zenodo.org/records/7776627 https://scikit-learn.org/stable/index.html Name Domain into numbers this is done (e.g. CS32MB = 32, Table 2). root 0|1 If this is not possible, the string values are encoded us- lagInFrames lagInFrames_0 | lag- ing label encoding [15, 16], which assigns an increasing InFrames_8 | lagIn- Frames_16 numeric value for each unique string in a domain. This endUsage variableBitrate | constant- format is the default state of the data. Thus, we apply all Bitrate | constrainedQual- preprocessing methods mentioned hereafter to the data ity in this format. AdaptiveQuantizationMode off | variance | complexity The second preprocessing method is Min Max Scaling | cyclicRefresh (MMS) [17, 18], which reduces the scale of a given fea- TileColumns TileColumns_0 | ture to be between 0 and 1. We achieve this by applying TileColumns_3 | Equation 1 on every feature of the configuration, where TileColumns_6 min and max are the minimum and maximum recorded cpuUsed cpuUsed_0 | cpuUsed_2 | numbers for this feature, respectively. When we apply cpuUsed_4 | cpuUsed_6 | cpuUsed_8 this to the features encoded using label encoding, the Threads Threads_2 | Threads_4 | result is a derivative of the former called scaled label Threads_6 | Threads_8 | encoding [19, 20]. Threads_10 𝑥𝑖 − 𝑚𝑖𝑛𝑖 bitRate bitRate_300 | bitRate_600 𝑓𝑖 (𝑥𝑖 ) = (1) | bitRate_900 | bi- 𝑚𝑎𝑥𝑖 − 𝑚𝑖𝑛𝑖 tRate_1200 | bitRate_1500 FrameBoost 0|1 The third preprocessing method is Standardization (STD) lossless 0|1 [21, 22], which is achieved by calculating the mean and AutoAltRef 0|1 standard deviation of each feature and applying Equation Quality good | realtime 2 𝑥𝑖 − 𝜇 𝑖 Table 4 𝑓𝑖 (𝑥𝑖 ) = (2) 𝜎𝑖 VP9 variable names and their respective domains This results in the mean of every feature in the training set being now 0 and the standard deviation being 1. The final preprocessing method is One Hot Encoding known data point. We use bagged trees, which means we (OHE) [23, 24], which changes the domain of all features train all underlying decision trees to solve the problem to a boolean domain. We achieve this by increasing the using all features. The final result is in the context of dimensions of the data by an encoding of the domain. classification decided based on a majority vote. However, Thus, if, for example, feature 𝑓 has the domain 0, 6, 12, in our context of regression, we calculate the final result it would have been replaced with the features 𝑓0 , 𝑓6 , 𝑓12 by taking the mean of all results produced by the decision each of the three resulting boolean features are mutu- trees. ally exclusive and encode one possible value assigned to The fourth model is a Support Vector Machine (SVM), feature 𝑓. a well-established model based on statistical learning frameworks. We use a radial basis function as our kernel type. 5. Results The final model is an ElasticNet (EN) model, a derivate of linear regression models. The model combines L1 and In this section, we showcase the measurements collected L2 priors as a regularizer. as described in the experimental setup section and dis- cuss them. To this end, we will discuss the results of each 4.3. Preprocessing dataset separately and what observations we made. Firstly, we start with a discussion of our smallest SPL, We used several preprocessing methods to test their im- BerkeleyDBC. All performance results are visualized in pact on the different models and training sizes. For the Figure 1. In the results for MLP, we see that OHE is per- sake of this comparison, we do not distinguish between forming best among all preprocessing methods regardless actual preprocessing methods like Standardization and of sample size. However, we can also see a shift in the per- encodings such as the One Hot Encoding. formances of the preprocessing approaches. STD started The first preprocessing method discussed we call default as the worst-performing preprocessing method. Despite (DEF). It provides a baseline for mostly unaltered data that, with increasing sample size, it outperformed MMS and leaves numeric values untouched. The boolean val- and DEF. Accordingly, the results of STD approached the ues are, however, encoded with 0 and 1 for false and true, results of the best performer OHE for the larger sample respectively. If all values of a domain can be converted sizes. However, when looking at Figure 2 and Figure Figure 1: The performances of each preprocessing method applied to each model and a comparison between the top performers of each model aplyed to the BerkeleyDBC dataset. Figure 2: The performances of each preprocessing method applied to each model, and a comparison between the top performers of each model applied to the 7z dataset. Figure 3: The performances of each preprocessing method applied to each model, and a comparison between the top performers of each model applied to the VP9 dataset. 3, we see that this behavior does not occur in the other except for EN show significantly better performances datasets, but, in contrast to other approaches, the results with larger sample sizes. However, MLP was impacted of STD remain either relatively stable or improve with the most by the sample size. It overtook the performance increasing sample sizes. We see a similar behavior on a of SVM at a sample size of 450 and EN at 900. smaller scale with MMS and DEF. MMS performed ini- Secondly, we will discuss the next larger SPL, 7z. We tially worse than DEF, overtaking it as soon as sample provide the results for this dataset in Figure 2. The first sizes became larger than 200 and achieving similar re- observation we can make is that the overall quality of sults from sample sizes 500 and larger. The other models, the predicted results decreased. This matches our ex- in contrast, showed more pronounced preferences for pectations, since we are predicting the performance of a preprocessing methods. For the KNN model, DEF was larger SPL using an equivalent setup. Another observa- the best-performing preprocessing method, followed by tion we can make is that three of the five tested models STD, MMS, and OHE. RF showed the best performance showed strong oscillations in their performances or, for of all models with almost indistinguishable differences some preprocessing methods, a worsening of the perfor- of 0.01% between the preprocessing methods on average. mance with increasing sample size. The MLP model, for For the SVM model, DEF performed the worst with com- example, showed for the best and second best performing paratively little improvement with larger sample sizes. preprocessing methods, MMS and STD, respectively, no The remaining preprocessing methods, from worst to significant changes with increasing sample sizes. DEF best, MMS, STD, and OHE show relatively similar re- and OHE showed meanwhile a decrease in performance sults, improving with increasing sample sizes. For the with increasing sample sizes. The EN and SVM models EN model, OHE performs best, and the remaining three, showed strong oscillations with increasing sample sizes. from worst to best, MMS, DEF, and STD show relatively However, the performances of the preprocessing meth- similar results. The sample size has a comparatively small ods all follow that same pattern, which suggests that the impact on the performances. Results with sample sizes cause for this may lie in the model or the SPL rather than of 200 and larger only show a minor oscillation and re- the preprocessing approaches. In the case of the SVM, the main otherwise stable. The comparison between models performances remained very similar. The preprocessing shows that RF outperforms the other models. All models performances in the EN model follow the same oscillation Model Preprocessing BerkeleyDBC 7z VP9 MLP DEF 25.68% 114.76% 273.94% MLP MMS 26.29% 99.97% 144.24% MLP STD 31.12% 99.98% 129.80% MLP OHE 10.27% 104.41% 117.11% KNN DEF 1.44% 119.01% 182.27% KNN MMS 4.29% 82.30% 158.79% KNN STD 2.85% 78.42% 124.31% KNN OHE 4.47% 79.99% 241.55% RF DEF 0.54% 9.51% 15.14% RF MMS 0.55% 9.54% 14.97% RF STD 0.55% 9.50% 15.23% RF OHE 0.54% 10.82% 18.32% SVM DEF 6.44% 91.46% 249.59% SVM MMS 6.07% 91.33% 101.45% SVM STD 6.04% 91.29% 100.35% SVM OHE 6.03% 91.36% 116.08% EN DEF 5.19% 176.36% 246.24% EN MMS 5.31% 177.54% 273.19% EN STD 5.09% 169.40% 267.77% EN OHE 2.54% 173.59% 226.60% Table 5 Average MAPE value over all tested sample sizes from 50 to 1000 with steps of 50 pattern while being displaced with a relatively constant best performing being MMS. margin along the y-axis, with STD performing best. The Finally, we will discuss the results and our observations KNN model performed as expected, showing constant in general. To this end, we provide the average perfor- improvements with increasing sample sizes for all pre- mances of all models and preprocessing approaches in processing methods. However, it is notable that the best Table 5. The first observation must be that preprocessing performer on the BerkelyDBC dataset DEF performs the methods have a significant impact on the prediction per- worst now, with the former second-best performing STD formances. BerkeleyDBC has an average factor of 2.05 taking its place as the best performer. The RF model re- between the best and worst-performing preprocessing mains again the best performer with a significant margin. methods. In comparison, 7z and VP9 have an average fac- The preprocessing performances are again very similar, tor of 1.17 and 1.84 respectively. This observation holds but STD performs significantly better for the smallest for all tested models, even for the best-performer RF. tested sample size, thus outperforming the others. However, RF shows this impact only with the larger SPLs Thirdly, we will discuss the largest SPL we investigated, like 7z and VP9. In general, the differences are maximized VP9. The results collected for VP9 are shown in Figure 3. at low sample sizes and become then smaller with increas- Our first observation is that the results for VP9 are closer ing sample sizes. We observe a similar situation with the to the results from BerkeleyDBC. There are again some SVM model, except DEF, which was largely unsuited. oscillations in the results of MLP, but they are compara- DEF was in two out of three tested SPLs performing the tively minor and show a clear trend to improvement with worst, showing insignificant improvement with increas- increasing sample sizes. We see again that the perfor- ing sample sizes. For MLP, KNN, and EN, on the other mance of the EN model remains unaffected by increasing hand, we can see significant performance differences sample sizes, except for some minor oscillations. The on every sample size tested, with, in general, more pro- SVM model shows a similar pattern as it did with the nounced differences when applying smaller sample sizes. BerkeleyDBC dataset. The DEF preprocessing method We also observe multiple occasions where misrepresen- performs once more the worst and shows as the only tation of performances could occur when conducting method with no significant improvement with increas- tests with only one preprocessing method. For instance, ing sample size. KNN shows to be once more consistent, one can conclude that SVMs outperform MLPs on the showing stable improvement with increasing sample size, BerkeleyDBC dataset for sample sizes smaller or equal STD performing best once more. The RF model performs to 1000 when conducting tests only with DEF or MMS. once more best by a significant margin. The preprocess- However, when testing with STD or OHE, we see that ing methods have little impact on its performance, but MLP outperforms SVMs on the BerkeleyDBC dataset for some improve the prediction performance earlier, the sample sizes greater than 650 or 400, respectively. From this, we conclude that a sound comparison between two ing the performance of five different machine learning or more predictive models should compare their perfor- models on three SPLs with training sets of increasing mances when using their best-performing preprocessing sizes. Except for two, all scenarios tested showed, in part, methods. Omitting the preprocessing method used may, radical changes in prediction quality depending on the by extension, lead to poorly reproducible results. preprocessing method used. These changes were most MLP showed to work on average best with OHE. The pronounced when we measured the model performances performance of this model strongly correlated with the with only a few samples to use as training sets and be- sample size, and it usually started with comparatively came less distinctive with training sets of increased size. high MAPE scores that became more competitive with On average, the disparity between the worst and the increasing sample sizes. Furthermore, it is prone to os- best performing preprocessing method were factors of cillation. KNN showed to work on average best with 2.05 (BerkeleyDBC), 1.17 (7z), and 1.84 (VP9). While we STD. It was one of the most stable and robust models, identified the on average best performing preprocessing achieving constant improvement with increasing sample methods for each model we tested, we also see, as visu- sizes, even in the context of SPLs like 7z that triggered alized in Table 5, that no single method outperforms all oscillation in most other models. However, its prediction others for each dataset, which holds as well if we only quality places it in the middle field. RF showed to work focus on a single model. Thus, having shown both the on average best with MMS. This model outperformed significant impact and the inconsistency in the perfor- every other model significantly in every aspect we mea- mance of preprocessing methods, we draw the following sured. Its worst performance using the smallest tested conclusions. Results that do not state which, if any, pre- sample size of 50 outperforms, in all but two cases, the processing method was employed become hard to repro- best performances of all other models. This performance duce. Further, the disregard of preprocessing methods is then improved further with increasing sample size. The may pose a threat to the validity of results. In summary, model usually reaches a plateau relatively early on aver- preprocessing methods are a high-impact, low-effort, and age at a sample size of 350, after which its improvement inconsistent part of the field of SPL performance predic- slows significantly. SVM showed to work on average tion, and all these properties make them essential to be best with STD. The model improves like RF on average considered and tested. with a sample size up to 600 steadily, after which the model starts to plateau in its improvement, except for the already mentioned DEF. EN showed to work on average References best with OHE. This model showed, on average, com- [1] J. Gong, T. Chen, Deep configuration performance paratively minor improvements with increased sample learning: A systematic survey and taxonomy, 2024. size. arXiv:2403.03322 . [2] P. Clements, L. Northrop, Software product lines, 6. Threats to validity Addison-Wesley Boston, 2002. [3] E. Engström, P. Runeson, Software product This paper compared multiple machine learning-based line testing – a systematic mapping study, In- models and explicitly did not perform any parameter tun- formation and Software Technology 53 (2011) ing for any one of the models. We used, if not stated 2–13. URL: https://www.sciencedirect.com/science/ explicitly differently, always the default parameters de- article/pii/S0950584910001709. doi:https://doi. fined by the scikit-learn7 library [14]. Thus, we must org/10.1016/j.infsof.2010.05.011 . acknowledge that fine-tuning the model parameters, es- [4] J. Oh, D. Batory, R. Heradio, Finding near-optimal pecially for the more complex models like MLP, likely configurations in colossal spaces with statistical will improve the performances of the models employed. guarantees, ACM Trans. Softw. Eng. Methodol. 33 However, the measured results are still valid and valuable (2023). URL: https://doi.org/10.1145/3611663. doi:10. for comparing the model performances concerning the 1145/3611663 . preprocessing methods and the sizes of the training sets [5] N. Siegmund, M. Rosenmuller, C. Kastner, P. G. Gia- employed. rrusso, S. Apel, S. S. Kolesnikov, Scalable prediction of non-functional properties in software product lines, in: 2011 15th International Software Product 7. Conclusion Line Conference, IEEE, 2011, pp. 160–169. [6] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we We tested 15 scenarios of machine learning-based per- really making much progress? A worrying analy- formance prediction in the context of SPLs by measur- sis of recent neural recommendation approaches, 7 https://scikit-learn.org/stable/index.html CoRR abs/1907.06902 (2019). URL: http://arxiv.org/ SPLC ’22, Association for Computing Machinery, abs/1907.06902. arXiv:1907.06902 . New York, NY, USA, 2022, p. 85–96. URL: https: [7] C.-L. Wu, K.-W. Chau, Y.-S. Li, Predicting monthly //doi.org/10.1145/3546932.3546997. doi:10.1145/ streamflow using data-driven models coupled with 3546932.3546997 . data-preprocessing techniques, Water Resources [16] L. Bao, X. Liu, F. Wang, B. Fang, Actgan: Auto- Research 45 (2009). matic configuration tuning for software systems [8] J. Rasekhi, M. R. K. Mollaei, M. Bandarabadi, with generative adversarial networks, in: 2019 34th C. A. Teixeira, A. Dourado, Preprocess- IEEE/ACM International Conference on Automated ing effects of 22 linear univariate features on Software Engineering (ASE), 2019, pp. 465–476. the performance of seizure prediction methods, doi:10.1109/ASE.2019.00051 . Journal of Neuroscience Methods 217 (2013) [17] J. Gong, T. Chen, Predicting software performance 9–16. URL: https://www.sciencedirect.com/science/ with divide-and-learn, in: Proceedings of the 31st article/pii/S0165027013001246. doi:https://doi. ACM Joint European Software Engineering Confer- org/10.1016/j.jneumeth.2013.03.019 . ence and Symposium on the Foundations of Soft- [9] M. Acher, H. Martin, J. A. Pereira, A. Blouin, J.-M. ware Engineering, ESEC/FSE 2023, Association for Jézéquel, D. E. Khelladi, L. Lesoil, O. Barais, Learn- Computing Machinery, New York, NY, USA, 2023, ing very large configuration spaces: What matters p. 858–870. URL: https://doi.org/10.1145/3611643. for Linux kernel sizes, Ph.D. thesis, Inria Rennes- 3616334. doi:10.1145/3611643.3616334 . Bretagne Atlantique, 2019. [18] S. Fu, S. Gupta, R. Mittal, S. Ratnasamy, On [10] J. Guo, D. Yang, N. Siegmund, S. Apel, A. Sarkar, the use of ML for blackbox system performance P. Valov, K. Czarnecki, A. Wasowski, H. Yu, Data- prediction, in: 18th USENIX Symposium on efficient performance learning for configurable sys- Networked Systems Design and Implementation tems, Empirical Software Engineering 23 (2018) (NSDI 21), USENIX Association, 2021, pp. 763–784. 1826–1867. URL: https://www.usenix.org/conference/nsdi21/ [11] H. Martin, M. Acher, J. A. Pereira, L. Lesoil, J.-M. presentation/fu. Jézéquel, D. E. Khelladi, Transfer learning across [19] Q. Cao, M.-O. Pun, Y. Chen, Deep learning in variants and versions: The case of linux kernel network-level performance prediction using cross- size, IEEE Transactions on Software Engineer- layer information, IEEE Transactions on Net- ing 48 (2022) 4274–4290. doi:10.1109/TSE.2021. work Science and Engineering 9 (2022) 2364–2377. 3116768 . doi:10.1109/TNSE.2022.3163274 . [12] J. A. Pereira, M. Acher, H. Martin, J. Jézéquel, G. Bot- [20] J. Cheng, C. Gao, Z. Zheng, Hinnperf: Hierarchical terweck, A. Ventresque, Learning software con- interaction neural network for performance predic- figuration spaces: A systematic literature review, tion of configurable systems, ACM Trans. Softw. Journal of Systems and Software 182 (2021) 111044. Eng. Methodol. 32 (2023). URL: https://doi.org/10. [13] J. Alves Pereira, M. Acher, H. Martin, J.-M. Jézéquel, 1145/3528100. doi:10.1145/3528100 . Sampling effect on performance prediction of con- [21] K. Zhu, S. Ying, N. Zhang, D. Zhu, Software defect figurable systems: A case study, in: Proceedings prediction based on enhanced metaheuristic feature of the ACM/SPEC International Conference on Per- selection optimization and a hybrid deep neural net- formance Engineering, ICPE ’20, Association for work, Journal of Systems and Software 180 (2021) Computing Machinery, New York, NY, USA, 2020, 111026. URL: https://www.sciencedirect.com/ p. 277–288. URL: https://doi.org/10.1145/3358960. science/article/pii/S0164121221001230. doi:https: 3379137. doi:10.1145/3358960.3379137 . //doi.org/10.1016/j.jss.2021.111026 . [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, [22] D. Nemirovsky, T. Arkose, N. Markovic, M. Ne- B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, mirovsky, O. Unsal, A. Cristal, A machine learning R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, approach for performance prediction and schedul- D. Cournapeau, M. Brucher, M. Perrot, E. Duch- ing on heterogeneous cpus, in: 2017 29th Interna- esnay, Scikit-learn: Machine learning in Python, tional Symposium on Computer Architecture and Journal of Machine Learning Research 12 (2011) High Performance Computing (SBAC-PAD), 2017, 2825–2830. pp. 121–128. doi:10.1109/SBAC- PAD.2017.23 . [15] M. Acher, H. Martin, L. Lesoil, A. Blouin, J.-M. [23] K.-T. Ding, H.-S. Chen, Y.-L. Pan, H.-H. Chen, Y.-C. Jézéquel, D. E. Khelladi, O. Barais, J. A. Pereira, Lin, S.-H. Hung, Portable fast platform-aware neu- Feature subset selection for learning huge config- ral architecture search for edge/mobile computing uration spaces: the case of linux kernel size, in: ai applications, ICSEA 2021 (2021) 108. Proceedings of the 26th ACM International Systems [24] Y. Gao, X. Gu, H. Zhang, H. Lin, M. Yang, Run- and Software Product Line Conference - Volume A, time performance prediction for deep learning models with graph neural network, in: 2023 IEEE/ACM 45th International Conference on Soft- ware Engineering: Software Engineering in Prac- tice (ICSE-SEIP), 2023, pp. 368–380. doi:10.1109/ ICSE- SEIP58684.2023.00039 .