=Paper=
{{Paper
|id=Vol-3135/darliap_paper11
|storemode=property
|title=Predicting job execution time on a high-performance computing cluster using a hierarchical data-driven methodology
|pdfUrl=https://ceur-ws.org/Vol-3135/darliap_paper11.pdf
|volume=Vol-3135
|authors=Paolo Bethaz,Bartolomeo Vacchetti,Enrica Capitelli,Vladi Nosenzo,Luca Chiosso,Tania Cerquitelli
|dblpUrl=https://dblp.org/rec/conf/edbt/BethazVCNCC22
}}
==Predicting job execution time on a high-performance computing cluster using a hierarchical data-driven methodology==
Predicting job execution time on a high-performance
Computing cluster using a hierarchical data-driven
methodology
Paolo Bethaz1 , Bartolomeo Vacchetti1 , Enrica Capitelli2 , Vladi Nosenzo2 , Luca Chiosso3 and
Tania Cerquitelli1
1
Department of Control and Computer Engineering, Politecnico di Torino, Italy
2
Iveco Group, Torino, Italy
3
NPO Torino Srl, Torino, Italy
Abstract
Nowadays, evaluating the performance of a vehicle before the production phase is challenging and important. In the
automotive industry, many virtual simulations are needed to model the vehicle behavior in the best possible way. However,
these simulations require a lot of time without the user knowing their runtime in advance. Knowing the required time in
advance would allow the user to manage the simulations more effectively and choose the best strategy to use the available
computational resources. For this reason, we present an innovative data-driven method to estimate in advance the execution
time of simulations. Our approach integrates unsupervised techniques, such as constrained k-means clustering, with
classification and regression algorithms based on tree structures.
In this paper, we present an innovative and hierarchical data-driven method for estimating the execution time of jobs.
Numerous experiments were conducted on a real dataset to verify the effectiveness of the proposed approach. The experimental
results show that the proposed method is promising.
Keywords
Execution-time prediction, data-driven model, hierarchical model
1. Introduction waiting time can lead to wasted cluster resources.
Since the number of analysis and simulations that must
Today, more and more manufacturing industries rely be performed for a single product is considerable, it is
on either online data centers or physical HPC clusters important to find a method to avoid wasting cluster re-
to run a large variety of tasks to perform analyses and sources and increasing delays. This issue is relevant in
simulations. These tasks, or jobs, range from simulating the context of software application development for the
individual mechanical components to analyze entire man- industrial domain and we want to address it by relying
ufacturing processes. In this way, it is possible to shorten on an innovative methodology based on data analysis
the time to market of the final product while reducing the and machine learning techniques. In order to predict
number of errors made during the process. However, the the execution time of jobs, we have taken into account
execution of these jobs often requires resources that may not only the HPC resources required by the different
not be immediately available, thus delaying the job exe- jobs, but also the settings of the different solvers, that
cution and increasing the time needed to obtain the final are the various kind of software used for analysis and
results. In this paper, we present a data-driven methodol- simulation. Each simulation is characterized by a series
ogy for predicting the jobs’ execution time. We focus on of parameters (specific for each solver) that describe its
predicting the execution time of simulations and analysis configuration. These parameters are inserted manually
because it directly affects the waiting time of other jobs from the user in phase of submission of the job and are
before they are submitted and the problem of unknown then extracted in automatic way from the server used for
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint running the simulations. The proposed approach is based
Conference (March 29-April 1, 2022), Edinburgh, UK on a hierarchical classification model. Our methodology
$ paolo.bethaz@polito.it (P. Bethaz); is based on three separated models. The first model does
bartolomeo.vacchetti@polito.it (B. Vacchetti); a preliminary binary classification and then it divides the
enrica.capitelli@external.cnhind.com (E. Capitelli); data accordingly. The other two models classify the two
vladi.nosenzo@ivecogroup.com (V. Nosenzo);
luca.chiosso@external.nposervices.com (L. Chiosso); different portions of data, one portion of data for each
tania.cerquitelli@polito.it (T. Cerquitelli) model. In this way we are able to classify the data into
0000-0001-5016-8635 (P. Bethaz); 0000-0001-5583-4692 four different classes while reducing the complexity of
(B. Vacchetti); 0000-0002-9039-6226 (T. Cerquitelli) the task from a multiclass problem to a binary one at
© 2022 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). each step.
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
The rest of the paper is divided as follows. Section the waiting time is impacted by many factors, such as
2 deals with the literature review related to HPC, from HPC specifics. Technical specifics aside, one factor that
resource allocation to runtime prediction. Section 3 deals impacts the waiting time of every simulation is the exe-
with the proposed methodology, while Section 4 presents cution time of the job that are running on the HPC. Some
our results so far. Finally, in Section 5, we discuss our studies have focused on this approach, such as [4, 14].
methods and future steps to continue this research. While we agree on the centrality of the execution time
we have adopted a different strategy compared to the
studies mentioned before. As a matter of fact we use
2. Literature Review pretty much the same toolbox, i.e. clustering for data
preprocessing and classification and or regression to esti-
There are several approaches and studies investigating
mate the execution time, however, as far as we know, we
how to improve HPC resource allocation. Research ac-
differentiate ourselves from previous work through the
tivities can be classified as: (i) predicting job failures,
use of a hierarchical approach, which will be explained
(ii) developing scheduling algorithms based on machine
in detail in the following section.
learning techniques, and (iii) using simulation execution
time estimation to predict the waiting time of jobs that
have yet to be submitted. 3. Data-driven methodology
The first research strand’s goal is to estimate if a specific
job will fail or complete its execution. The intent is to The building blocks of the proposed methodology, shown
stop prematurely those jobs that are predicted by the in Figure 1, are as follows: i) data cleaning, ii) model
algorithms as failures [8, 6, 9]. By learning which jobs building and iii) model evaluation. Each of these steps is
are more likely to fail and stopping them the HPC is able adequately described in the related subsection.
to improve its performance, while saving energy that After the data collection phase, the generated dataset
would be wasted[12, 7]. However this type of approach contains a record for each submitted job. These jobs may
requires a lot of data in order to identify a pattern be- have been executed by different solvers. Here with solver
tween the attributes and the target variables. The most we mean the software used to run the simulation, each
significant variables can be extracted in different ways, of which is characterized by different model variables.
either through some feature engineering process or from Due to the different parameters that characterize each
different databases. Even so, parsing and transformation solver, and due to very different execution times between
operations have been proven very useful in order to im- solvers, we decided to consider only one solver at a time,
prove the prediction results [5] are extremely useful to thus avoiding working with a dataset too sparse.
obtain better prediction results. The issue here is related
to the fact that some failures are rare, hence they are not
easy to predict, but still consume a lot of resources [10].
For example, Liu et al. [8] integrated two algorithms in
order to estimate whether a job fails or not. The first
algorithm is a clustering one. It measures the correla-
tion among jobs from different contexts. The other one
is a multitask learning algorithm trained on correlated
jobs. Since our data is not enough to achieve meaningful
results in this paper we do not tackle this problem.
Another research approach focuses on the optimization Figure 1: Schema of the proposed methodology
of the available resources performed by a scheduler. Af-
ter analyzing the behavior of successful and failed jobs,
Jassas et al. [6] investigated scheduling algorithms in-
tended to optimize the reliability and availability of cloud 3.1. Data cleaning
applications. Since the number of resources cannot be
Since the parameters characterizing each job are entered
unlimited, large-scale HPC exploit waiting queues. How-
manually by the user during the job submission phase, it
ever it has been proved by Nurmi et al. [11] that regard-
is essential to check the correctness of the available data
less of the performance of a resources scheduler one of
before using them for subsequent analysis.
the main factor that impacts the prediction efficiency is
The data cleaning phase started with a collaboration with
the amount of time that a job has to wait before being
domain experts, who, thanks to their knowledge, helped
submitted. Following this idea, also authors in [2] try
us to define the admissibility ranges for each collected
to predict the waiting time of a job using a hierarchi-
variable, including the execution time. This phase helped
cal classification approach. However, the estimation of
us to better understand the available data, and led us
to the decision of eliminating all those jobs associated good results obtained with the hierarchical classification
with anomalous values of execution time. Specifically, approach pushed us to also try a mix between classifi-
all jobs with an execution time that was too low or too cation and regression algorithms. However the results,
high compared to the execution time of other jobs run while in some cases were better than the normal regres-
by the same solver were eliminated. To define when an sion approach, were not satisfying. This led us to choose
execution time should be considered too high or too low, the hierarchical classification approach. The following
we tried using the following three approaches to define subsections describe in more detail which the structures,
appropriate thresholds, beyond which a point must be the algorithms and the techniques that were used, for
considered an outlier [15]: both regression and classification approaches. Due to the
Interquartile Range (IQR): IQR is the difference be- fact that the amount of data in our possession is limited
tween the values ranking in 25% (Q1) and 75% (Q3) in a we have decided to rely on the XGBoost method with
data set. IQR thresholding strategy calculates the thresh- both the mixed regression and the classification approach.
old as follows: The XGBoost is a tree model that relies on the Extreme
Gradient Boosting technique [3].
• min threshold: Q1 - (1.5*IQR)
• max threshold: Q3 + (1.5*IQR)
3.2.1. Classification Approach
95th centile: the minimum threshold is set by taking the
Unlike a single level classification where the model is
value ranking in 5% and the maximum threshold is set
trained only once on all available data, in the hierarchical
by taking the value ranking in 95%;
approach a binary tree structure is used, in which each
99th centile: the strategy is identical to that described in
node of the tree corresponds to a binary classification
the previous point, but here the thresholds are defined so
where a model predicts to which of the two classes the
that fewer outliers are identified. The minimum threshold
job belongs. The depth of the tree depends on the number
is set by taking the value ranking in 1% and the maximum
of total classes that we want to obtain (each of which
threshold is set by taking the value ranking in 99%.
represents a temporal range of values). The hierarchical
We compared the number of jobs labeled as outliers with
binary structure we used in this methodology has two
each of the 3 approaches, and in subsequent experiments
levels of depth, to which correspond 3 predictive mod-
we tried to evaluate how the performance of a predictive
els (nodes) and 4 total classes (leaves). Solutions with
model varies depending on the preprocessing used.
different depths have also been tested experimentally,
but the one with four classes has demonstrated to be
3.2. Model Building the best compromise between good models’ performance
and enough detailed classes. An example of how the
The task of our model is to predict the execution time
structure looks is reported in Figure 2. In this way every
of a simulation running on a HPC. Since the goal is to
classifier has to deal with a binary classification problem,
predict a time that is a continuous value, this could be
but overall the classes considered are four.
tagged as a regression task. However, the experimental
results obtained by treating this task as a regression one
led to poor results. This behavior can be justified by the
fact that the data collection phase is quite recent, so the
available data at the moment are not numerous. For this
reason, we decided to treat it like a classification problem;
this was made possible by categorizing the available run-
times, dividing them into classes representing contiguous
time intervals. After this categorization, a classification
algorithm can then try to predict in which range of val-
ues the execution time of the analyzed job will fall. In
other words we have a multiclass problem. However, due Figure 2: Hierarchical Structure Schema
to the scarce amount of data in our possession, the per-
formance of our initial model was not enough. Since the From the figure, it is evident that the key of the en-
amount of data is limited it is difficult for a classification tire structure are the three conditions that determine in
algorithm to make good predictions in a multiclass con- which sub-branch the obtained prediction must go. To
text. On the other hand we did not want to oversimplify each of these conditions corresponds a subdivision of the
the problem by reducing the number of classes consid- dataset (or of a portion of it), in two classes that represent
ered. Thus by implementing a classification hierarchical different time intervals of the execution time of the jobs.
approach we were able to improve the goodness of the The overall performance of the classification method is
predictions without sacrificing too much quality. The
greatly influenced by the identified thresholds, so it is
important to try to define them as best as possible. To do
this, two different approaches have been tested:
• balanced approach: in each division of the iden-
tified hierarchical structure, the available data
is divided into two classes, each containing the
same number of jobs. This technique prevents
any unbalanced class problem, allowing the pre-
dictive model to be trained on balanced classes;
• k-means approach: the classes to use for training a
predictive model in each node of the structure, are
chosen in an automatic way through a clustering Figure 3: Mixed Approach Structure
algorithm. In particular, the k-means algorithm
is used, with K=2. In addition, to prevent the
proposed solution from leading to an unbalanced- 3.3. Model Evaluation
classes problem, we used a constrained version of
the k-means algorithm, in which a minimum size For the evaluation of every algorithm used in all the per-
for each cluster can be specified. In particular, formed experiments, we exploited the Leave-One-Out
the constraint we used here is that each identified Cross Validation (LOOCV) technique. Even if it is a com-
class had to contain at least 40% of the total jobs. putationally expensive technique, LOOCV results in a
reliable and unbiased estimate of model performance.
3.2.2. Mixed Approach Moreover, this method turns out particularly useful when
the data available are limited, like in our case in which
Several regression algorithms (XGBoost [3], RandomFor- the phase of data collection is begun recently.
est [1], Lasso [13]) were tested and compared with each LOOCV is an extremization of k-fold validation, where
other, evaluating their performance based on the R2 value the value of k is equal to the number of available items
obtained. The Lasso Regression is a regression analysis (N). Its operations can be summarized in the following
method that enhance its prediction accuracy by com- steps:
bining variable selection and regularization techniques;
while RandomForest and XGBoost are both tree-based al- 1. Split the dataset into N disjoined groups, where
gorithms exploiting several decision trees, differing from each group contains a single element;
each other on how the trees are constructed and how the 2. For each group:
results are combined. • Take the element in the considered group
Using a regression algorithm has the advantage of yield- as test set
ing a punctual value of the estimated runtime. However, • Take the remaining N-1 groups as training
due to limited availability of initial data, the obtained set
predictive model built on a single level is not very robust. • Fit a model on the training set and evaluate
For this reason, we have decided to test a mixed hier- it on the test set
archical regression approach. With mixed approach we • Retain the evaluation of the model and
mean that there is a combination between classification then discard it
and regression. At the first prediction layer we have a
classifier, while at the second prediction layer we have 3. The model performance is estimate as the average
used two regressors. Thus the two regressors at the sec- of the N experiments executed
ond prediction layer have to estimate the execution time
of jobs that belong to two different time intervals, one 4. Preliminary experimental
for each model. In this way we simplified the problem
while keeping the final prediction as a continuous value. results
Figure 3 shows the scheme of the proposed mixed ap-
proach. Regarding the techniques used to obtain the two The experiments presented here show the actual results
classes in the first level of classification, both approaches obtained and that motivate us to rely on the hierarchical
described in the previous classification case were tested. classification approach. Section 4.1 offers insight on the
data that we have used to train our models. Section 4.2
discusses the effectiveness of the proposed techniques
to correctly identify outliers and how they impact the
selectivity of the dataset cardinality. Section 4.3 shows
the performance of classifier models, both normal and Table 1
with the hierarchical structure. Section 4.4 presents the Solvers
results obtained with different regression algorithms and Type Solver name Application Field
the hierarchical mixed approach.
Explicit Adams Multi body approach
Explicit Radioss Crash Simulation
4.1. Dataset Description Implicit Abaqus Linear/Non linear analysis
Implicit Nastran FE Analysis
Our data was extracted from a PBS (Portable Batch Sys-
Implicit Optistruct FE Analysis and optimization
tem) server used to run simulations of various nature,
from aerodynamics to virtual crash tests, related to the au-
tomotive context. Our data belong to two main categories,
i.e. explicit and implicit jobs. The implicit methods use 4.2. Data Cleaning
an algorithm "step by step", in which an appropriate con- In this preprocessing phase we tried to remove all the
vergence criterion allows the analysis to continue or not, jobs having an anomalous execution time compared to
reducing the time increment, depending on the accuracy the runtimes of the other jobs. To do this, 3 different
of the results at the end of each step. Using the explicit methodologies were tested as discussed in Section 3.1,
methods does not have problems of non-convergence, based on: i) interquartile range (IQR), 95th centile, 99th
since in this case the time increment is defined at the centile. The percentage of jobs labeled as outliers by each
beginning and remains constant during the calculation. of the three techniques, separately by solver, is shown in
After the data collection phase, the dataset contains about Table 2.
6000 records, each of which represents a job submitted
in the cluster. Jobs can be performed by five different Table 2
solvers, depending on the type of analysis to be run. A The number of outlier jobs for each solver
summary of the solvers contained in our dataset is given
Solver IQR 95th centile 99th centile
in Table 1. Due to the different parameters that character-
ize each solver, and due to very different execution times Adams 14% 10% 2%
between solvers, the analyses described below consider Radioss 1% 10% 2%
only one solver at a time. Nastran 11% 10% 3%
Optistruct 8% 11% 3%
Abaqus 11% 10% 2%
Since we can not know in advance which of the 3
techniques will lead to greater benefits, in the following
analysis we compared the results obtained with different
preprocessing techniques, evaluating then the best of
them. However, from Table 2 we can see that the IQR
and 95th centile techniques show rather similar results
(with an average difference of about 3%); instead the 99th
centile technique often identifies a very low percentage
of outliers compared to the other techniques. For this
reason, in the following analyses we will consider only
Figure 4: KDE plot for execution times of two different solvers the IQR and the 99th centile techniques, comparing the
results obtained with these two different approaches.
As a demonstration of the differences between the
solvers, Figure 4 shows the kernel density estimate (KDE) 4.3. Classification Model Evaluation
plot for the execution times of an explicit solver (Adams) We have conducted a series of experiments with the pro-
and an implicit solver (Optistruct). KDE represents the posed hierarchical approach, testing different preprocess-
data using a continuous probability density curve and ing techniques and different strategies for identifying
the x-axis in the figure show how the jobs belonging to thresholds.
the two solvers occupy very different ranges of execution Table 3 contains the F-score values obtained using the
times, with much greater times for the explicit solver. 99th centile as preprocessing step and the k-means for
thresholds identification, since it is the configuration
with which the best results were obtained. For each
solver, the first column of the table shows the results
Table 3 baseline against which we want to compare our method-
F-score for Hierarchical Classification ology. The table shows the results obtained with both
1st level 2nd level the two preprocessing techniques (IQR vs 99th centile)
classification classification and with both the strategies to define the subdivision in
Solver 1 2 1 2 3 4 classes (k-means vs balanced approach). For reasons of
Adams 0.90 0.82 0.91 0.86 0.82 0.66
space, each solver has been indicated in this table only
Radioss 0.71 0.72 0.76 0.25 0.79 0.61
Nastran 0.85 0.79 0.89 0.86 0.93 0.90 by its initials (Ad, R, N, O, Ab); moreover, ’K-M’ stands
Optistruct 0.91 0.94 0.89 0.81 0.82 0.64 for k-means, while ’Bal’ indicates the balanced approach.
Abaqus 0.82 0.69 0.81 0.67 0.88 0.80 Comparing the f-score values obtained in the two method-
ology, it is evident how the hierarchical structure allows
to obtain better performances than the baseline approach,
for the first classification level (where the first two whatever preprocessing technique is used.
classes are identified), while the second column contains
the F-score values obtained in the second level of Table 4
classification (classes 1 and 2 for the left sub-branch, Baseline Classification F-score
classes 3 and 4 for the right sub-branch). To better IQR 99%
illustrate our methodology, a focus on a specific solver is Solver 1 2 3 4 1 2 3 4
also represented in Figure 5, that shows the hierarchical Ad. K-M 0.79 0.52 0.28 0.75 0.80 0.63 0.46 0.60
Ad. Bal 0.75 0.64 0.49 0.72 0.78 0.71 0.59 0.71
approach specifically for the Nastran solver, indicating R. K-M 0.44 0.44 0.23 0.14 0.44 0.48 0.23 0.62
the relevance of the identified classes in a real context. R. Bal 0.26 0.55 0.54 0.57 0.21 0.35 0.42 0.64
The predictive model used in all the nodes is the XGBoost N. K-M 0.66 0.49 0.87 0.57 0.85 0.75 0.66 0.57
N. Bal 0.83 0.83 0.73 0.67 0.79 0.69 0.71 0.65
and on the figure are indicated the F-Score values for O. K-M 0.67 0.84 0.68 0.68 0.67 0.64 0.73 0.84
each prediction. Here, the implemented model is able to O. Bal 0.64 0.60 0.56 0.76 0.72 0.54 0.70 0.77
predict quite efficiently whether the execution time of an Ab. K-M 0.56 0.77 0.58 0.55 0.76 0.45 0.65 0.75
analyzed job will be less than 8 minutes, between 8 and Ab. Bal 0.60 0.61 0.57 0.62 0.55 0.69 0.60 0.64
16 minutes, between 16 and 30 minutes, or greater than
30 minutes. For solvers with different characteristics
obviously different thresholds will be obtained; however 4.4. Mixed Model Evaluation
the results in the Table 3 indicate that, except for the
Radioss solver (where the results obtained are below the In the approach that exploit regression algorithms to
average behavior), for all solvers we are able to predict estimate the execution time, the output of the predictive
quite accurately which class a job should belong to. model is a continuous numerical value that indicates
the runtime of the considered jobs. Table 5 contains
the R2 values obtained with three different Regressors:
RandomForest Regressor (RF), XGBoost Regressor and
Lasso Regressor. The results obtained with RandomForest
and XGBoost are very similar, both of them much better
than the results obtained with a Lasso Regressor, which
does not perform well. Furthermore, from the baseline
table, we see that the results obtained after removing the
outliers through the 99th centile, are on average higher
than those obtained using the IQR. For this reason we
use the 99th centile technique for testing our approach.
Currently, the mixed approach is still an experimental
approach and for now it has been tested on only one
solver. The results are shown in Figure 6, where the
Figure 5: Nastran Hierarchical Approach
considered solver is Nastran and the algorithm used is the
XGBoost (both for classification and regression). We can
In order to validate the results obtained with the hi- see that the first classification level is the same obtained
erarchical approach, we conducted also a series of ex- in Figure 5, with the same values of F1-score. Then, unlike
periments with a more traditional methodology and we the classification approach, in the mixed approach we
compared them with our results. The traditional method- used two regression models in the second level (one for
ology consists in a "single" level approach, in which the each sub-branch), able to predict the value of execution
algorithm used is always an XGBoost, but there is no a hi- time of the considered job.
erarchical structure. The results in Table 4 represent the By having a first classification level that can distin-
Table 5 its complexity among multiple prediction layers. In this
Baseline Regression R2 way every classifier has to deal with a binary classifica-
RF XGBoost Lasso tion problem, instead of a multiple classification. Even if
Solver IQR 99% IQR 99% IQR 99% the amount of data in the considered use case is limited
Adams 0.31 0.70 0.36 0.69 0.23 0.27 our approach has shown a promising performance also
Radioss 0.27 0.27 0.27 0.28 0.09 0.12 compared to single prediction level approaches. Espe-
Nastran 0.38 0.39 0.40 0.37 0.08 0.09 cially in the classification context the hierarchical ap-
Optistruct 0.63 0.45 0.60 0.44 0.40 0.43 proach performs better than the normal approach. We
Abaqus 0.36 0.41 0.38 0.42 0.10 0.09 have also tried some experiments in which more than
one solver is taken into account. However, due to the
fact that every solver takes into account a different set of
guish two categories of jobs, the R2 values obtained in variables the resulting dataset is very sparse. This issue
the second regression level are now higher than those combined with the limited amount of data leads to poor
obtained with the Nastran solver using the baseline predictions with both the hierarchical approach and a
approach. Moreover, the mean absolute error (MAE) single prediction level method. Once that the data in
values shown in Figure 6, indicates that the average error our possession has reached a higher numerosity, it will
associated with jobs that last less than 16 minutes is just allow us to investigate whether or not our hierarchical
over one minute (69 seconds), therefore an acceptable approach can outperform more classical approaches in
error in our use-case. Regarding instead the mean a more complex environment. A higher amount of data
absolute error associated with jobs that last longer than means that the impact of the different variables taken into
16 minutes, this one turns out to be about 11 minutes, so account will be reduced. Currently we are still working
a bit more impactful. on this project and we already have different improve-
ments that we want to address. We will keep gathering
more data that will allow us to build more stable and
robust models. We also intend to further investigate the
possibility of building a model that is able to make pre-
dictions on the whole set of solvers, instead of relying on
a different model for every solver. Finally we intend to
integrate our model inside the HPC structure in order to
help it assess the pending time of newly submitted jobs.
References
[1] Leo Breiman. 2001. Random forests. Machine learn-
ing 45, 1 (2001), 5–32.
[2] Fabio Carfi, Enrica Capitelli, Vladi Massimo
Nosenzo, and Tania Cerquitelli. 2021. Estimating
the job’s pending time on a High-Performance Com-
puting cluster through a hierarchical data-driven
Figure 6: Nastran Mixed Approach methodology. In DOLAP. EDBT 2021, Nicosia,
Cyprus.
[3] Tianqi Chen and Carlos Guestrin. 2016. XGBoost:
The error grows as execution times increase. So, for A Scalable Tree Boosting System. In Proceedings of
the considered solver, the better choice could be to adopt the 22nd ACM SIGKDD International Conference on
techniques of regression in order to estimate low execu- Knowledge Discovery and Data Mining (KDD ’16).
tion times; and to use instead a classification approach in Association for Computing Machinery, New York,
order to predict the classes to which the job will belong NY, USA, 785–794. https://doi.org/10.1145/2939672.
when it deals with longer execution times. 2939785
[4] Mariza Ferro, Vinicius P Klôh, Matheus Gritz, Vitor
5. Discussion and Future Research de Sá, and Bruno Schulze. 2021. Predicting Runtime
in HPC Environments for an Efficient Use of Com-
direction putational Resources. In Anais do XXII Simpósio em
Sistemas Computacionais de Alto Desempenho. SBC,
We have presented a classification hierarchical model
WSCAD 2021 – XXII Simpósio em Sistemas Com-
which is able to address a multiclass problem by dividing
putacionais de Alto Desempenho, Belo Horizonte, Statistical Society Series B 73 (06 2011), 273–282.
72–83. https://doi.org/10.2307/41262671
[5] S. Ganguly, A. Consul, A. Khan, B. Bussone, J. [14] Hao Wang, Yi-Qin Dai, Jie Yu, and Yong Dong. 2021.
Richards, and A. Miguel. 2016. A Practical Approach Predicting running time of aerodynamic jobs in
to Hard Disk Failure Prediction in Cloud Platforms: HPC system by combining supervised and unsuper-
Big Data Model for Failure Management in Datacen- vised learning method. Advances in Aerodynamics
ters. In 2016 IEEE Second International Conference 3 (03 2021). https://doi.org/10.21203/rs.3.rs-360961/
on Big Data Computing Service and Applications v1
(BigDataService). 2016 IEEE Second International [15] Jiawei Yang, Susanto Rahardja, and Pasi Fränti.
Conference on Big Data Computing Service and 2019. Outlier Detection: How to Threshold Outlier
Applications, Oxford, United Kingdom, 105–116. Scores?. In Proceedings of the International Confer-
https://doi.org/10.1109/BigDataService.2016.10 ence on Artificial Intelligence, Information Processing
[6] M. Jassas and Q. H. Mahmoud. 2018. Failure Anal- and Cloud Computing (AIIPCC ’19). Association for
ysis and Characterization of Scheduling Jobs in Computing Machinery, New York, NY, USA, Arti-
Google Cluster Trace. In IECON 2018 - 44th An- cle 37, 6 pages. https://doi.org/10.1145/3371425.
nual Conference of the IEEE Industrial Electronics 3371427
Society. IECON 2018 - 44th Annual Conference of
the IEEE Industrial Electronics Society, Omni Shore-
ham, United States, 3102–3107. https://doi.org/10.
1109/IECON.2018.8592822
[7] P. Li, B. Zhang, Y. Weng, and R. Rajagopal. 2017.
A Sparse Linear Model and Significance Test for
Individual Consumption Prediction. IEEE Trans-
actions on Power Systems 32, 6 (2017), 4489–4500.
https://doi.org/10.1109/TPWRS.2017.2679110
[8] Chunhong Liu, Liping Dai, Yi Lai, Guinbing Lai, and
Wentao Mao. 2020. Failure prediction of tasks in the
cloud at an earlier stage: a solution based on domain
information mining. Computing 102 (2020), 2001–
2023. https://doi.org/10.1007/s00607-020-00800-1
[9] C. Liu, J. Han, Y. Shang, C. Liu, B. Cheng, and
J. Chen. 2017. Predicting of Job Failure in Com-
pute Cloud Based on Online Extreme Learning Ma-
chine: A Comparative Study. IEEE Access 5 (2017),
9359–9368. https://doi.org/10.1109/ACCESS.2017.
2706740
[10] J. M. Navarro, G. H. A. Parada, and J. C. Dueñas.
2014. System Failure Prediction through Rare-
Events Elastic-Net Logistic Regression. In 2014 2nd
International Conference on Artificial Intelligence,
Modelling and Simulation. IEEE, Madrid, Spain, 120–
125. https://doi.org/10.1109/AIMS.2014.19
[11] D. Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wol-
ski, and K. Kennedy. 2006. Evaluation of a Work-
flow Scheduler Using Integrated Performance Mod-
elling and Batch Queue Wait Time Prediction. In SC
’06: Proceedings of the 2006 ACM/IEEE Conference
on Supercomputing. IEEE, Tampa, FL, USA, 29–29.
https://doi.org/10.1109/SC.2006.29
[12] A. Rosà, L. Y. Chen, and W. Binder. 2017. Failure
Analysis and Prediction for Big-Data Systems. IEEE
Transactions on Services Computing 10, 6 (2017), 984–
998. https://doi.org/10.1109/TSC.2016.2543718
[13] Robert Tibshirani. 2011. Regression shrinkage
selection via the LASSO. Journal of the Royal