1. Introduction

Recommender Systems, September

Predictions using Neural Models

Roan Schellingerhout

roan.schellingerhout@maastrichtuniversity.nl 0 2

Volodymyr Medentsiy

Maarten Marx

1 0 Department of Advanced Computing Sciences , Paul-Henri Spaaklaan 1, 6229 EN, Maastricht , The Netherlands 1 IRLab Informatics Institute , Science Park 904, 1098 XH, Amsterdam , The Netherlands 2 Randstad Groep Nederland , Diemermere 25, 1112 TC, Diemen , The Netherlands

2022

1 8 23

Career path prediction aims to determine a potential employee's next job, based on the jobs they have had until now. While good performance on this task has been achieved in recent years, the models making career predictions often function as black boxes. By integrating components of explainable artificial intelligence (XAI), this paper aims to make these predictions explainable and understandable. To study the efects of explainability on performance, three non-explainable baselines were compared to three similar, but explainable, alternatives. Furthermore, user testing was performed with recruiters in order to determine the sensibility of the explanations generated by the models. Results show that the explainable alternatives perform on-par with their non-explainable counterparts. In addition, the explainable models were determined to provide understandable and useful explanations by recruiters.

Career path prediction Explainable AI Sequence classification Neural networks User testing

1. Introduction

With the rise of the modern gig economy, it has become more dificult for job seekers to find stable positions of employment [ 1 ]. In addition, due to the average education level of the workforce having increased considerably in recent years, potential employees are faced with more nificantly more dificult for job seekers, and employment agencies alike, to find positions that fit their needs. To aided HR matchmaking (e.g., machine learning) to find suitable positions for individuals, and capable employees for companies [ 3 ]. This task is called career path prediction, which aims to predict a person’s next position of employment, given their career up until this point.

Previous research on automated career path prediction

tends to share a common flaw: a lack of explainability [ 4, 5, 6, 7 ]. While deep learning tends to deliver good performance, these models often function as a black box.

Although good results that are dificult to interpret are acceptable in many use cases, choosing a new career is such an impactful event in a person’s life that it is unrealistic to expect users to blindly trust the models. This is why explainability is such a crucial requirement for nEvelop-O

predictions explainable impact performance? • RQ3: Which explainable model is the most useful

for recommending jobs to candidates? This paper is structured as follows: first, an overview 2.1. Career path predictions of the current state of the art in terms of model per- another multivariate sequence classification task (gold formance and explainability is given. Then, Randstad’s price forecasting), outperforming every alternative archidataset is described in detail. Afterwards, the methods tecture tested. used to answer the research questions are explained. Sub- While the aforementioned models make up the current sequently, the research questions are answered, after state of the art for career path predictions, they all share which their answers are discussed. a common flaw: they function as black boxes. As a result, their outputs are hard to interpret for both recruiters and job seekers. Considering the impact a career change can 2. Related Work have on an individual’s life, this can make the models dificult to use in real-world scenarios.

The goal of career path prediction is to determine what 2.2. Explainability in deep learning position of employment is a logical next step given a job seeker’s career [ 5 ]. Considering the number of diferent Explainability and performance are often considered incareer opportunities and factors which have an influ- verses of each other in the field of AI. A simple, easy ence on the career steps (e.g., previous job experiences, to explain model is likely to perform mediocre at best, educational background, interests of a job seeker), the while a complex, dificult to explain model is more likely career prediction problem is incredibly dificult to model to perform well [ 8 ]. A common example of this inverse by hand. relationship can be seen in the diference between deci

In recent years a lot of progress has been achieved sion trees and random forests: random forests are based within the field of career path prediction. The first no- on decision trees, but with a higher degree of complextable paper to use machine learning for career path pre- ity, which strongly increases performance at the cost of diction, was that by Liu et al. [ 5 ]. In this paper, Liu explainability. et al. scraped individuals’ social media profiles to gen- However, with the increasing interest in explainable AI, erate a dataset, after which they predict when an em- more and more solutions have been brought up that can ployee would be ready to move to a higher-paying posi- make even the most complex deep learning models extion within their current field (e.g. moving from junior plainable to a degree [ 16 ]. Most commonly, this explainsoftware developer to senior software developer). Meng ability takes the shape of visualizations of the networks’ et al. [ 4 ] then extended this task by not just considering behaviour. Saliency maps and attention distributions within-field switches, but general job mobility. Their are capable of visualizing the importance of diferent custom LSTM, the hierarchical career-path-aware neural variables, usually through some type of colour scheme network (HCPNN), was thus tasked to predict individ- indicating higher or lower feature importance. Initially, uals’ next employer, regardless of their current field of Springenberg et al. [ 17 ] used guided backpropagation employment. The HCPNN has shown impressive results, to visualize the features learned by convolutional layers. outperforming every model that forewent it. Extending past guided backpropagation, Selvaraju et al.

Similarly, He et al. [ 7 ] attempted to predict individu- [ 18 ] created Grad-CAM, which could not only visualize als’ next job based on features they extracted from their general learned features, but also determine which fearesume. Unlike Meng et al., they made use of a convolu- tures were important for a specific predicted class. Since tional neural network (CNN) for the predictions. With these post-hoc interpretability techniques merely look at this CNN they tried to implement a multi-purpose model the behaviour of the model, they do not alter their perforthat could not only predict talents’ next job position, but mance. However, it is often necessary to make alterations also their salary and the size of the company they would to the models’ architecture to allow good explanations to be working at. Out of those three tasks, their CNN proved be generated (e.g., they only work on convolutional layto perform the best on career path predictions. ers, and preferably only on the final convolutional layer

At their core, Meng et al.’s LSTM and He et al.’s CNN of a model) [ 18, 17 ]. As a result, such techniques either are simply feature extractors which feed their output into do not change performance at all, or decrease it slightly. a dense layer. While both perform well on their own, it In contrast, while both aforementioned methods were is common to combine these two architectures within created for computer vision, Vaswani et al. [ 19 ] proposed the field of sequence classification [ 10, 11, 12 ]. Although ‘attention mechanisms’ for natural language processing. such an architecture has not yet been used for career These attention mechanisms cause the models to predict path predictions specifically, they have been shown to the importance of each feature per time step (or the imperform exceedingly well on other multivariate sequence portance of a given time step in general) which can then classification problems [ 13, 14, 15 ]. Especially Livieris be visualized. As a result, Vaswani et al. made it possible et al. [ 14 ] their CNN-LSTM has shown good results on for diferent model architectures to become explainable, while simultaneously improving their performance. 105 s cen104 e rr u fcco103 o r e ubm102 N 101

Distribution of ISCO job types

Distribution of job functions 105 2.3. Explainability in sequence company, the used dataset only contains data pertaining classification to candidates living in the Netherlands. For each job, the dataset includes a number of relevant features, such as Sequence classification brings an additional factor into the company for which the person worked, the period the mix: the temporal dimension. Simply visualizing within which they worked, ISCO1 classifications of the which features garner the most attention thus becomes job, and the specific function that was performed. While insuficient in this scenario. While a given variable might job function and ISCO type both represent job positions, be highly important to the network initially, it could be- the former is more granular as it takes over 3000 unique come less relevant as time progresses. Thus, to make ex- values, while the latter takes a mere 355. plainable sequence classifications, not only should there Additionally, Randstad stores structured and unstrucbe an explanation of which variables contributed the most tured profile-specific data, which can be used to describe to the final prediction, but also at what moment their the profile of a candidate. The structured data includes: values were most decisive [ 20 ]. Nonetheless, saliency maps are still useful in this scenario, as a multivariate se- • education history, which includes education level, quence can be treated as a 2-dimensional image of shape completion status, the start and (if applicable) end ( × ℎ) . However, these date; saliency maps do not necessarily reach the level of finesse • skills (e.g. ‘programming: Python’, ‘operating a required to generate understandable explanations for se- forklift’, ‘Microsoft Word’, etc.); quence. As a result, saliency maps are often combined • languages; with attention mechanisms. By combining saliency maps • driving licenses; with attention distributions, it is possible to improve the • location. quality of the explanations [ 21 ].

3. Description of the Data 3.2. Data imbalance Randstad’s dataset consists of over two million jobs relating to more than 500 thousand individuals. These jobs span over multiple decades, going back as far as the early twentieth century. Although Randstad is a multinational The data on which the models were trained, configured,

and tested, was provided by Randstad NV (Randstad). There is a huge imbalance in work experience and edDue to the nature of Randstad’s operations, they have an ucation levels of candidates present in the data. The exhaustive data lake consisting of temporal employee- imbalance in work experience occurs in job positions, related data. which are represented by ISCO job types and job functions (see Figure 1a and 1b respectively), and the number 3.1. Overview of the datasets of positions candidates have had (see Figure 2). We addressed the skew in the number of jobs a candidate had by limiting the job history to the 25 most recent jobs.

The imbalance in education levels (see Figure 3) is less

1https://www.ilo.org/public/english/bureau/stat/isco/isco08/

The unstructured data is represented by curriculum vitaes (CVs), which are user-generated documents. impactful, as the education level of candidates is merely jobs were zero-padded to prevent mismatched sequence a predictor, unlike the ISCO job types and job functions, lengths. This section outlines how candidates’ careers both of which could be used as the actual labels to be were converted into sequences, as well as how those predicted. To construct the final dataset we sequences were fed into diferent models.

Lastly, an overview of the models used is given. The • limited the job history of candidates to the 25 used models can be split into three separate categories: most recent jobs; non-neural baselines, non-explainable neural models3, • dropped candidates with fewer than two jobs in and explainable neural models. 80% of the data was used the dataset, due to the inability to convert their as a training set, 10% of the data was used as a validation careers to a sequence; set, on which the optimal hyperparameters were deter• balanced class labels distribution through mined, and the last 10% of the data was used as a test weighted sampling during training. set to evaluate model performance on unseen data. We used weighted sampling during training to address the imbalance within the class labels distribution.

This resulted in our final dataset consisting of the careers of 113724 candidates, each being limited to the 25 most recent jobs they had. For each job, the (normalized2) time spent working there, the ISCO function level of the job, the highest education enjoyed up until then, the company for which the candidate worked, the specific job function ID, the ISCO job type, and the most recent CV were stored. Additionally, the zip code, obtained certificates, mastered languages, skills, and driving licenses of candidates were stored as static variables, since they rarely changed in between jobs.

4. Methodology

In order to make career path predictions, candidates’ profiles were turned into sequences which could be fed into diferent (deep learning) models. For each candidate we used the last 25 jobs along with profile-specific features as input for the models, after which the models would predict their next job in the form of its ISCO job type. Candidate profiles that consisted of fewer than 25

4.1. Data prepocessing

Due to the availability of temporal data, candidates’ career paths were turned into sequences. For these sequences, each job held by a candidate was considered to be one time step. The order of the time steps was determined by the date at which the candidate started the position. As a result, every career was turned into a sequence, in which each time step was a candidate’s current job, combined with their location and the skills, certificates, languages, and education they had achieved at the time of starting the position. To also include candidates’ curriculum vitaes (CVs) at each time step, the most recent CV uploaded by a candidate at each time step was converted to numerical features using averaged Word2Vec [ 23 ] embeddings and combined with the other features.

Candidates’ career paths were turned into sequences 2Normalization was done through Z-transformation in order to main- 3The neural models were created in PyTorch and trained on an tain a common scale for all features. NVIDIA tesla K80 GPU [ 22 ]. x as follows:

x = [x(1), ..., x(T)], ℎ x(t) = [xj(ot)b; xstructured; x(CtV)]

(t) where the order of timesteps is determined by the date at which the candidate started the job. Every element x(t) of the sequence x consists of a feature vector xj(ot)b, which represents candidate’s current job at a timestep , feature vector x(stt)ructured, which represents their location, skills, certificates, languages, and education they had achieved at the time of starting the position, and feature vector x(CtV), which represents the most recent CV uploaded by a candidate at each time step (embedded using Word2Vec). 4.2. Baselines and Models (1) (2) tested on Randstad’s dataset. The performance of these models will function as a non-explainable baseline, with which the performance of the explainable alternatives can be compared. The following models were used: LSTM : The LSTM -based model used in this paper is based on the HCPNN by Meng et al. [ 4 ]. While the original HCPNN combines candidate-specific data with company-specific data, its modular architecture allows for the removal of some of the model’s components. As a result, the HCPNN was implemented using only candidate-specific features. This results in a model that takes embedded position features, feeds them into an LSTM, runs the LSTM’s output through an attention layer, and combines that output with a candidate’s embedded static features, after which a fully-connected layer makes a prediction.

Considering the fact that careers do not necessarily follow a logical trend, they can be rather dificult to model properly. For example, a person might have a job for CNN : The CNN -based model used in this paper is a while not because they want to, but because they are that of He et al. [ 7 ]. This architecture feeds the forced to do so in order to support themselves. A person input data into a 2D convolutional layer, followed going from a position as a software engineer to a store by a pooling layer. The output is then flattened clerk does not constitute a logical progression, but can and ran through a drop-out layer. Lastly, a fullyobviously occur in the real world whenever someone connected layer is used to do the final prediction. gets laid of and needs to work a temporary job while CNN-LSTM : The CNN-LSTM -based model used in they search for new alternatives. This makes career path this paper is based on the model created by prediction a notoriously dificult problem for deep learn- Livieris et al. [ 14 ]. It uses two sequential 2D ing models [ 4 ]. To evaluate the added value of using convolutional layers, followed by a pooling layer. such models, and to allow for better contextualization, The pooled features then get fed into an LSTM, baselines were set with three non-deep learning (but co- after which a fully-connected layer is responsible incidentally highly explainable) models. The first one is for the final predictions of the model. a simple majority class baseline, which always predicts the most common job in the dataset. The second baseline To evaluate performance, accuracy @ ( ∈ {1, 5, 10} ) is the majority switch, which always predicts the most was used, which shows how often the correct answer common job following the current job of the candidate. was within the top predictions given by the model [ 25 ]. The last simple baseline is more sophisticated: k-nearest Considering the fact that candidates could not be interneighbors based on the dynamic time warping distance ested in a specific job type (e.g. no open vacancies, not between candidates that had the same previous job (KNN- interesting enough, it pays too little), it is expected of reDTW). This baseline uses dynamic time warping [ 24 ] to cruiters that they can provide multiple recommendations determine which candidates have had the most similar for the candidate, allowing them to choose and consider careers, and then uses k-nearest neighbors to make a multiple options. As a result, the models provide multiple prediction based on these similarities. For each candi- predictions, which can be evaluated using accuracy @ . date, all candidates that had the same previous job were compared in terms of DTW distance (i.e., the numerical 4.2.2. RQ2 - Explainable models distance between the sequences); the k-nearest neighbors algorithm was then used to find the closest neighbors for each candidate, based on whom a prediction would be made.

Although the explainable models’ architectures difer slightly from the aforementioned state-of-the-art models to allow for improved explainability, they are largely identical.

4.2.1. RQ1 - State of the art

To study the impact of explainability mechanisms on

model performance, three state-of-the-art models, each with a unique architecture (Section 2.1), were trained and Explainable LSTM : The explainable LSTM -based model (eLSTM) used in this paper is based on the spatiotemporal attention LSTM (STA-LSTM) by Ding et al. [ 26 ]. This architecture starts of by determining spatial attention; it runs each individ- catenated to the last hidden state of the LSTM, ual time step through a linear layer, after which and fed into fully-connected layer, which makes the Hadamard product between the linear layer’s the final prediction. output and the features per time step is taken to determine the importance of each feature at each 4.2.3. RQ3 - Real-world utility time step. The output hereof is then fed into an LSTM, after which the temporal attention is cal- To measure the adequacy of the explanations generated culated. This is done by flattening the output of by the models, user testing was performed. Potential the LSTM and running it through another linear users of the models (e.g. Randstad’s recruiters), were layer. This calculates a normalized importance of tasked to determine which variables were most relevant each time step, based on that step’s hidden values. for a prediction made by the system. Six recruiters were The dot product between the linear layer’s output split into three groups based on their recruiting expertise and the LSTM’s hidden output is then calculated, (finance, customer support, health care), and shown three which is fed into a fully-connected layer to make separate predictions within that industry (one per model). the final predictions. For each predictions, they were tasked to distribute 100 ‘relevance points’ over all of the features used by the Explainable CNN : The explainable CNN -based models (previous jobs, education, skills, etc.), after which model (eCNN) used in this paper is based on the their distribution was compared to that of the models. In explainable convolutional neural network for mul- order to determine the sensibility of the models’ explanativariate time series classification (XCM) by Fauvel tions, the Pearson’s correlation, root mean squared error et al. [ 27 ]. It makes use of two stages which run (RMSE), and mean absolute error (MAE) of each models’ in parallel. The first stage (top) uses a 2D con- distributions compared to the recruiters’ distributions volutional layer with kernel size ( × 1 ) were calculated. Furthermore, the recruiters were prethat generates 1 feature maps. A (1 × 1) 2D sented with the explanations generated by each model, convolutional layer is then used to summarize and tasked to judge each part of the explanations (spathose 1 feature maps into a single feature map. tial/feature attention, temporal attention, and spatiotemThe other stage (bottom), running independently, poral attention), as well as the general usefulness of the uses a 1D convolutional layer with kernel size explanations for finding a suitable position for a candi( × ) and also generates 1 date. By averaging the scores given by the recruiters, the feature maps, which are summarized by a (1 × 1) real-world utility of each explanation was determined. 1D convolutional layer. The two feature maps generated by the two stages are then concatenated in the feature-dimension, after which a 1D 5. Results convolutional layer with kernel size ( × ( +1) ) generates 2 feature maps. These 5.1. RQ1 - State of the art feature maps are then ran through a pooling layer, To better convey the performance gained by using deep which is also responsible for the predictions. 1 , learning models, the score of each model will be directly 2 , and are three separate hyperpa- compared to that of the best-performing baseline. Of rameters for this model. the three simple baselines, the majority switch baseline Explainable CNN-LSTM : The explainable CNN- performed the best, reaching 19.1% accuracy @ 1, 46.6% LSTM -based model (eCNN-LSTM) used in this pa- accuracy @ 5, and 61.3% accuracy @ 10. KNN-DTW per is based on that of Schockaert et al. [ 21 ]. This performed worse initially, but converged to the majormodel runs the input data through a 2D convolu- ity switch baseline as the number of neighbors ( ) aptional layer with kernel size ( ℎ × 1 ), proached infinity. With low values of , e.g. 5, it failed to whose output gets concatenated to the original break even 10% accuracy @ 1. However, using a higher sequential data. This combined output gets fed value for , e.g. 100, greatly improved this score, reachinto an LSTM. All but the last hidden state of the ing 18.1% accuracy @ 1, 46.4% accuracy @ 5, and 58.1% LSTM get passed through a temporal attention accuracy @ 10, showing a sub-linear performance gain mechanism. This temporal attention mechanism as increased. The majority class baseline performed runs each hidden state through a fully-connected significantly worse, only reaching 10.5% accuracy @ 1, layer which attributes it a given amount of atten- 36.8% accuracy @ 5, and 49.1% accuracy @ 10. As a retion. These attention values are then normalized, sult, the performance of the deep learning models was after which the dot product of the attention vec- compared against the scores achieved by the majority tor and the hidden states is calculated to create switch baseline. a context vector. This context vector is then conMajority switch

While similar architectures were used for the explain- user research was done with Randstad’s recruiters. After able and non-explainable models, diferent hyperparam- providing the recruiters with the predictions made by eter configurations led to diferent performance for each the model, they were asked to estimate which variables architecture. The results shown in Table 1 only indicate were most important. The averaged estimates made by the performance given by the best hyperparameter con- the recruiters and models can be seen in Figure 4 (for ifguration found for each model. For a full overview of the comparison per model see Appendix C). The results hyperparameter configurations and their related perfor- indicate that the models’ explanations were positively mance see Appendix B. correlated with those made by the recruiters (Table 2. For the eCNN-LSTM, this correlation was moderate, while 5.2. RQ2 - Explainable models for the eCNN and eLSTM, it was quite weak. In general, the models considered more ‘job-specific’ features such Out of all the models, the CNN-LSTMs performed the as the previous functions, companies, ISCO job types, best. Unlike what was hypothesized, the explainable and ISCO job levels to be highly important, while the models were not inferior to their non-explainable coun- recruiters leaned more towards ‘general’ features such terparts (Table 1). In fact, the eLSTM provides a higher as education and skills. accuracy than the non-explainable LSTM by a slight mar- To measure the sensibility of each model’s explanagin, although this diference falls within the confidence tions, three metrics were calculated for each of them: intervals of the scores, and is therefore not significant ( > .05 ). The explainable CNN took a slight (but statistically significant) hit in performance in exchange for the increase in explainability, especially sufering at higher values of .

5.3. RQ3 - Real-world utility Each explainable model is able to generate three sepa

rate explanations for a prediction: (i) the weight of each feature, (ii) the weight of each time step, and (iii) a time Models step/feature interaction map (spatiotemporal attention). Recruiters The way in which these explanations are generated dif- 15 10 5 Feature im0portance 5 10 15 fers per model, but the final visualizations are the same, regardless of the method used to generate them (Figure 10, Figure 4: Average distribution of feature importance of the 11, and 12 in Appendix E). three explainable models compared to that of Randstad’s reIn order to verify the integrity of these explanations, cruiters (N = 18).

Pearson’s r ⇑ RMSE ⇓ MAE ⇓ 6.0% to 7.3% accuracy @ 1. Although this is a larger eLSTM 0.142 4.661 4.094 improvement than that of the HCPNN compared to the eCNN-LSTM 0.436 6.014 4.847 majority switch baseline presented in this paper (14.6% eCNN 0.152 5.594 4.518 increase in accuracy @ 1), this result can still be considTable 2 ered a confirmation of Meng et al. their findings. The The Pearson correlation, RMSE, and MAE of each model com- smaller relative improvement could in part be caused by pared to the scores given by the recruiters (N = 6). For each the fact that Randstad’s dataset includes data that has feature, both the models and the recruiters gave a score; the been manually input by candidates themselves. This data, scores are calculated based on those two scores. as opposed to that input by Randstad’s recruiters, has not been verified, and could therefore include errors, a substantial amount of missing values, etc. While these RMSE, MAE, and Pearson correlation. This was done data points could have been removed from the dataset to by calculating the diference between the average score improve performance, a conscious decision was made not that recruiters gave to each feature and the attention put to. Removing all data entered by candidates themselves towards that feature by the models (RMSE and MAE), as would get rid of more than half the dataset, in exchange well as the correlation between the models’ values and for a relatively minor improvement in performance (in the recruiters’ values (Pearson correlation). The results the neighborhood of 5-10%, absolute). Additionally, in can be seen in Table 2. real-world use, providing candidates with the ability to

Additionally, the recruiters were asked how sensible enter their own career into Randstad’s system and inthey found the models’ explanations, as well as how use- stantly being able to receive job recommendations is very ful they considered the models (including their explana- valuable. tions) for helping candidates find a new job. The averaged As opposed to the CNN and LSTM, the CNN-LSTM scores for each model is shown in Table 3. showed a major improvement over the baseline. This

In general, the recruiters showed a preference for the is in accordance with the results found by Livieris et al. feature explanations, and to a lesser extent the spatiotem- [ 14 ], who showed that their CNN-LSTM significantly poral explanations. The temporal explanations were con- outperformed a bare LSTM baseline. Considering the sidered the least sensible, failing to reach a suficient fact that both the convolutional layers and LSTM layers grade (i.e., above a 5.5/10 on average). While the eCNN are used as feature extractors, this result is expected. By was judged to deliver the worst explanations, receiving combining the two layer types, the model is able to learn barely a 5/10 on average, the eCNN-LSTM’s and eLSTM’s more abstract representations of the data, allowing it to explanations were considered suficient by the recruiters. generalize better [ 28, 29, 30 ].

Out of these two, the eCNN-LSTM was determined to provide the best explanations, scoring the highest aver- 6.1.2. Explainability’s impact on performance age rating in each category. Regardless of the insuficient grades reached by some explanations/models, all three models were considered generally useful for recommending a job to a candidate.

6. Discussion and conclusion 6.1. Interpretation of the results

6.1.1. State of the art performance Although career path prediction is a notoriously dificult problem in deep learning, the state-of-the-art models used on Randstad’s dataset ended up performing commendably. All three models ended up achieving significantly ( < .05 ) higher scores than the majority switch baseline, which already performed well. However, this improvement is relatively small for the CNN and LSTM.

This marginal increase over the baseline is largely in line with the results found in previous research. Meng et al. [ 4 ] found that the HCPNN outperformed non-neural baselines by about 20% on their dataset; improving from Though it was initially expected that the inclusion of explainability mechanisms would impact model performance to a degree [ 8 ], the experiments have shown that this is not the case. While for Grad-CAM (CNN) this result might seem obvious, considering this technique does not alter the model, but merely looks at the model’s gradients, this is still surprising. Despite the fact that the technique itself is not intrusive, the model’s architecture still needed to be altered in order to create sensible explanations (e.g. the eCNN’s parallel design), as shown by Fauvel et al. [ 27 ]. Regardless of this architectural change, however, the explainable model still performed on-par with its counterpart. Similarly, the explainable CNN-LSTM, which uses not only guided backpropagation, but also an attention mechanism, showed roughly equal performance to the non-explainable CNN-LSTM.

For the LSTM, the addition of explainability even improved the model’s performance (in terms of accuracy @ 1), although this improvement was not statistically significant. Thus, the experiments show that explainability mechanisms can be used in deep learning models Feature explanation

Temporal explanation

Spatiotemporal explanation

General usability eLSTM eCNN eCNN-LSTM for career path prediction without hindering the mod- 6.2. Potential biases els’ predictive powers. For the most part, this is in line with the results of previous research on the topic [ 17, 18 ]. While the models performed commendably, and the exHowever, the fact that the attention mechanisms used planations were determined to be satisfactory, it is imporin the eCNN-LSTM and eLSTM did not improve model tant to consider the impact of biases in the training data accuracy in a statistically significant manner is in stride on the predictions. Although protected features, such with the results found by Schockaert et al. [ 21 ] and Ding as gender, race, and age were removed from the dataset, et al. [ 26 ]. This is likely caused by the diferences be- correlation between such features and input features may tween their datasets and the one provided by Randstad. still have caused discrimination [31]. For example, while For example, the majority of candidates in Randstad’s age was not explicitly present in the data, the models dataset only had one job on record. In such a scenario, could still roughly determine a candidate’s age based temporal attention adds no value, as all attention will be on their total number of days worked across all jobs (a directed towards that single time step. person with a few hundred total days worked is likely to be in their twenties, while someone with over ten thousand days worked is probably nearing retirement). 6.1.3. Real-word utility The models’ ability to ‘retrieve’ such protected features may have negatively afected the recommendations for specific candidates. Future research could look into the extent to which this occurs, as well as methods to alleviate this efect.

User testing showed that recruiters consider the explainable models usable in a real-world scenario. Although they were quite critical, giving mostly suficient (but not outstanding) grades, they determined that each model type would at least be helpful to a degree in finding a job for a candidate. The individual explanation types 6.3. Limitations and expansion tended to score lower than the models as a whole, indicating that the current implementation of the models’ Due to the lack of a publicly available dataset, determinexplanations (i.e. the visualizations in Appendix E) might ing state-of-the-art performance is complicated for career require some tuning or extra clarification in order to be path prediction. Even within Randstad’s own dataset, used eficiently by recruiters. Regardless, the recruiters performance could be increased by simply filtering out did indicate that they considered the current implemen- data entered by candidates. To advance the field of career tation useful as is. Considering the environment for user path prediction, future research should focus on creating testing is quite bare-bones (Appendix D), this is a pos- a general dataset that can be used to directly compare itive indication for the actual usability of the models’ model performance within the field (in the same vein explanations. Thus, to to allow further capitalization as ImageNet for image classification 4 and TREC for text of the explanations, a more user-friendly interface (e.g. retrieval5). This benchmarking dataset should consist of interactive explanations, clear textual descriptions of the relatively clean, GDPR compliant, exhaustive career data data) could be used. In doing so, the models might also of a large variety of candidates. Using this dataset, future become usable by candidates themselves. Considering research will be able to better gauge the performance of the inference time of the models (less than a second), can- diferent architectures used for career path prediction didates could enter their careers into Randstad’s system, (e.g. LSTMs, CNNs, temporal graphs) and draw direct and instantly be provided a list of job recommendations, comparisons between models. Thus, having a clear and accompanied by explanations. However, more research definite state of the art will most certainly advance the will need to be done to determine if this is preferable for ifeld as a whole. candidates over having recruiters interpret the models’ Another limitation posed in this paper, is the lack of predictions. hardware resources. The NVIDIA Tesla K80 used to train 4https://www.image-net.org/ 5https://trec.nist.gov/data.html the models fell short when training the CNN-based models. Because of the low CUDA core count of 2496, and the limited 12 gigabytes of VRAM, the convolutional models had to be limited in terms of kernel size, output channels, embedding sizes, epochs, and batch sizes to decrease VRAM usage and keep training time reasonable.

Consequently, not all possible hyperparameter configurations could be tested, possibly leaving better model configurations unexplored.

Furthermore, the small sample size used for user testing is an important limitation to acknowledge. Because the participating recruiters were on payroll, it was dificult to get their managers’ approval, as well as to schedule a moment to perform the tests. Subsequently, the results gathered by the user testing are subject to high variance and are therefore dificult to use as conclusive evidence.

Increasing the sample size by also performing user testing on candidates themselves would have helped solve this issue and might have provided additional insights.

Also, improving the clarity of the UI used for user testing and the models’ explanations could have led to lower variance, making the results more conclusive.

Additionally, while only including career switches in the training data strongly improved the models’ usability, it also hinders individuals who are looking for new work within their current field from receiving recommendations. To account for such candidates, future work could expand upon the current pipeline by including a recommendation on whether a candidate should stay within their current field, or pursue a position with a diferent function. For individuals who get recommended to stay within their profession, the models could, for example, be altered to recommend a next employer within the field.

6.4. Conclusion

In the span of this paper, it was shown that career path predictions made by deep learning models can be made explainable to a high degree. While diferent types of explanations made by the models can difer in terms of how understandable they are to humans, all of them turned out to be useful for recruiters nonetheless. Due to the fact that these explainability mechanisms do not lead to a decrease in performance, they form a good addition to existing career path prediction models. This goes especially for CNN-LSTMs, as those perform the best as explainable and non-explainable models, while also providing the best explanations according to recruiters.

7. Appendix All code used in the experiments can be found on https: //github.com/Roan-Schellingerhout/MSc_thesis. A. Encoding and indexing

64 feature maps. The 3D max-pooling used a (64 × 1 × 1) kernel with (1 × 1 × 1) stride.

CNN-LSTM : The CNN-LSTM used a batch size of 128

and reached optimal performance after 20 epochs.

The first 2D convolutional layer used a ( 1 × 1) kernel, with a (1 × 1) stride and half padding, and generated 32 feature maps. The second 2D convolutional layer made use of the same kernel size, stride, and padding, but generated 64 feature maps. The following 3D average-pooling layer used a (64 × 1 × 1) kernel and a (1 × 1 × 1) stride.

Lastly, the model used a single LSTM layer with hidden size 1000.

With over 100 thousand careers, each spanning 25 time steps, and over 1000 features per time step (embedding values for skills, certificates, previous jobs, previous companies, addresses, and spoken languages, as well as 300 w2v dimensions per CV), feeding the data into deep learning models as is, turned out to be infeasible. Making use of sparse vectors to lower memory usage also was impossible, due to the incompatibility between CUDA and The optimal hyperparameters found for the explainsparse vectors/matrices [32]. However, considering the able models are as follows: large amount of duplicate data (a candidate’s skills/certificates/CVs do not change at every time step, and can eLSTM : The explainable LSTM used a batch size of 128 therefore often be repeated), use was made of indices and reached optimal performance after 5 epochs. in order to lower memory usage, at the cost of a slight It used a single LSTM layer with hidden size 1000. time complexity increase. For each candidate, a loca- eCNN : The explainable CNN used a batch size of 128 tion within each index was created that contained their and reached optimal performance after 2 epochs. unique attributes, and the time steps from which those The top part used a 2D convolutional layer with attributes became the most recent ones. By then retriev- a (5 × 1) kernel (thus, = 5 ), a (1 × ing the relevant attributes for each candidate in a batch 1) stride, half padding, and generated 8 feature during training, the required memory usage was lowered maps (thus, 1 = 8 ). For the bottom part, the 1D drastically. convolutional layer used a (5 × ) kernel, a (1 × 1) stride, half padding, and also generated B. Hyperparameters 8 feature maps. The final 1D convolutional layer used a kernel size of (5 × ( + 1) ), a (1 × 1) stride, half padding, and generated 32 feature maps (thus, 2 = 32 ). These 32 feature maps were then ran through an 3D average-pooling layer with kernel size (32 × 1 × 1) and a (1 × 1 × 1) stride.

All hyperparameter tuning results can be found on

GitHub. For each configuration, the models were ran for 3 epochs. Based on the results after those 3 epochs, the best performing configuration was ran for 20 epochs to find the optimal number of epochs. Not every intended hyperparameter configuration could be tested due to hardware/time constraints. For example, the CNN-based models needed to be limited to small kernels and output channels to prevent running out of VRAM. Additionally, the eCNN was only trained for a total of 3 epochs, due to time constraints (as each epoch took nearly 8 hours).

All models were optimized using the Adam optimizer [33] (learning rate = 1 ∗ 10−3) with cross-entropy loss.

The hyperparameters used for the results of the non- C. Recruiter vs. model explainable models in Table 1 were the following: eCNN-LSTM : The explainable CNN-LSTM used a batch size of 2048 and reached optimal performance after 15 epochs. Its 2D convolutional layer used a kernel of size ( ℎ × 1) and half padding, and was followed by a single LSTM with hidden size 1000. distributions

LSTM : The HCPNN used a batch size of 512 and reached optimal performance after 18 epochs. It used a single LSTM layer with hidden size 1000. The distributions of feature importance on which Table 2

is based can be seen in Figures 5a, 5b, and 5c. Each model distribution is based on the average feature importance CNN : The CNN used a batch size of 128 and reached determined by the models across the three categories optimal performance after 11 epochs. The 2D (finance, health care, and customer support). For the convolutional layer consisted of a (5 × 5) kernel, recruiter distribution, the average is taken over the three with (1 × 1) padding and stride, and generated industries, as well as all recruiters within those industries (as a result, = 6 for all recruiter distributions). 23 25 Sequence length (log ) 27

D. User testing

User testing was conducted using a web environment accessible by the recruiters. The web app was hosted using Amazon ec2 in combination with Docker, and built using Flask, JQuery, Jinja, and AJAX. The recruiters were tasked to enter their e-mail address (to allow follow-up questions if needed) and select their expertise (finance, health care, customer support). Afterwards, they were

E. Explanation examples The explanations provided by the three diferent models

for the same candidate can be found in Figures 10, 11, and 12. The correct label for this candidate was Survey and market research interviewer.

3 Time step 4

5 3 Time step 1 3 Time step 1 3 Time step

[1]

Parigi ,

Ma , The gig economy , XRDS: Crossroads, The ACM Magazine for Students 23 ( 2016 ) 38 - 41 .

[2]

Hanson ,

Checked , Educational attainment statistics [2022] : Levels by demographic , 2021 . URL: https://educationdata.org/ education-attainment-statistics.

[3]

Zimmermann ,

Kotschenreuther ,

Schmidt , Data-driven hr-r\'esum\'e analysis based on natural language processing and machine learning , arXiv preprint arXiv:1606.05611 ( 2016 ).

[4]

Meng ,

Zhu ,

Xiao ,

Zhang ,

Xiong , A hierarchical career-path-aware neural network for job mobility prediction , in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2019 , pp. 14 - 24 .

[5]

Liu ,

Zhang , L. Nie,

Yan ,

Rosenblum , Fortune teller: predicting your career path , in: Proceedings of the AAAI conference on artificial intelligence , volume 30 , 2016 , pp. 1 - 16 .

[6]

Kokkodis ,

P. G.

Ipeirotis , Demand-aware career path recommendations: A reinforcement learning approach , Management Science 67 ( 2021 ) 4362 - 4383 .

[7]

He ,

Shen ,

Zhu ,

He ,

Wang , Z. Zhang, Career trajectory prediction based on cnn , in: 2019 IEEE International Conference on Service Operations and Logistics , and Informatics (SOLI), IEEE, 2019 , pp. 22 - 26 .

[8]

Gunning ,

Stefik ,

Choi ,

Miller ,

Stumpf , G.-

Yang , Xai-explainable artificial intelligence , Science Robotics 4 ( 2019 ) eaay7120 .

[9]

S. R.

Department , Stafing industry: Leading companies worldwide, 2022 . URL: https://www.statista.com/statistics/257876/ staffing-companies -worldwide-by-revenue/.

[10]

Vidal , W. Kristjanpoller, Gold volatility prediction using a cnn-lstm approach , Expert Systems with Applications 157 ( 2020 ) 113481 .

[11]

Lu ,

Li ,

Sun ,

Wang , A cnn-lstm-based model to forecast stock prices , Complexity 2020 ( 2020 ).

[12]

Rick , L. Berton, Energy forecasting model based on cnn-lstm-ae for many time series with unequal lengths , Engineering Applications of Artificial Intelligence 113 ( 2022 ) 104998 .

[13] T.-Y. Kim , S.-B. Cho, Predicting residential energy consumption using cnn-lstm neural networks , Energy 182 ( 2019 ) 72 - 81 .

[14]

I. E.

Livieris , E. Pintelas,

Pintelas , A cnn-lstm model for gold price time-series forecasting , Neural computing and applications 32 ( 2020 ) 17351 - 17360 .

[15]

Xie ,

Zhang ,

C. P.

Lim , Evolving cnnlstm models for time series prediction using enhanced grey wolf optimizer , IEEE Access 8 ( 2020 ) 161519 - 161541 .

[16]

Choo , S. Liu, Visual analytics for explainable deep learning , IEEE computer graphics and applications 38 ( 2018 ) 84 - 92 . [30]

Chen ,

Zhong ,

Zhang ,

Sun ,

Zhao , Lstm

[17]

J. T.

Springenberg ,

Dosovitskiy ,

Brox , M. Ried- networks for mobile human activity recognition, miller, Striving for simplicity: The all convolutional in: 2016 International conference on artificial innet, arXiv preprint arXiv:1412.6806 ( 2014 ). telligence: technologies and applications , Atlantis

[18]

R. R.

Selvaraju ,

Cogswell , A. Das , R. Vedantam , Press, 2016 , pp. 50 - 53 . D. Parikh , D.

Batra , Grad-cam: Visual explanations [31] F.

Träuble , E.

Creager , N.

Kilbertus , F.

Locatello, from deep networks via gradient-based localization, A . Dittadi , A.

Goyal , B.

Schölkopf , S.

Bauer , On in: Proceedings of the IEEE international confer- disentangled representations learned from correence on computer vision , 2017 , pp. 618 - 626 . lated data, in: International Conference on Machine

[19]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit , Learning, PMLR, 2021 , pp. 10401 - 10412 . L. Jones , A. N. Gomez , Ł. Kaiser, I. Polosukhin , At- [32]

Contributors , Torch.sparse¶, 2022 . URL: https: tention is all you need , Advances in neural infor- //pytorch.org/docs/stable/sparse.html. mation processing systems 30 ( 2017 ). [33]

D. P.

Kingma ,

Ba , Adam: A method for stochas-

[20]

Rojat ,

Puget ,

Filliat ,

J. Del

Ser ,

Gelin , tic optimization, arXiv preprint arXiv:1412 .6980

Díaz-Rodríguez , Explainable artificial intelli- ( 2014 ). gence (xai) on timeseries data: A survey , arXiv preprint arXiv:2104.00950 ( 2021 ).

[21]

Schockaert ,

Leperlier ,

Moawad , Attention mechanism for multivariate time series recurrent model interpretability applied to the ironmaking industry , arXiv preprint arXiv: 2007 . 12617 ( 2020 ).

[22]

Paszke ,

Gross ,

Massa ,

Lerer ,

Bradbury , G. Chanan,

Killeen ,

Lin ,

Gimelshein ,

Antiga , et al., Pytorch: An imperative style, highperformance deep learning library , Advances in neural information processing systems 32 ( 2019 ).

[23]

Mikolov ,

Chen , G. Corrado,

Dean , Eficient estimation of word representations in vector space , arXiv preprint arXiv:1301.3781 ( 2013 ).

[24]

D. J.

Berndt ,

Cliford , Using dynamic time warping to find patterns in time series ., in: KDD workshop, volume 10 , Seattle, WA, USA:, 1994 , pp. 359 - 370 .

[25] D. M. Powers , Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation , arXiv preprint arXiv: 2010 . 16061 ( 2020 ).

[26]

Ding ,

Zhu ,

Feng ,

Zhang , Z. Cheng, Interpretable spatio-temporal attention lstm model for flood forecasting , Neurocomputing 403 ( 2020 ) 348 - 359 .

[27]

Fauvel ,

Lin , V. Masson, É. Fromont,

Termier , Xcm: An explainable convolutional neural network for multivariate time series classification , Mathematics 9 ( 2021 ) 3137 .

[28]

Eldan , O. Shamir, The power of depth for feedforward neural networks , in: Conference on learning theory, PMLR , 2016 , pp. 907 - 940 .

[29]

Subasi , Chapter 5 - other classification examples , in: A. Subasi (Ed.), Practical Machine Learning for Data Analysis Using Python , Academic Press, 2020 , pp. 323 - 390 . URL: https://www.sciencedirect.com/science/ article/pii/B9780128213797000059. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / B 9 7 8 - 0 - 1 2 - 8 2 1 3 7 9 - 7 . 0 0 0 0 5 - 9 .

0.05 0.10 Gradient 0.15 0.20 0.05 0.10 0.15 0.20 0.25 0.30 0 . 35