=Paper=
{{Paper
|id=Vol-2967/paper8
|storemode=property
|title=conSultantBERT: Fine-tuned Siamese Sentence-BERT for
Matching Jobs and Job Seekers
|pdfUrl=https://ceur-ws.org/Vol-2967/paper_8.pdf
|volume=Vol-2967
|authors=Dor Lavi,Volodymyr Medentsiy,David Graus
|dblpUrl=https://dblp.org/rec/conf/hr-recsys/LaviMG21
}}
==conSultantBERT: Fine-tuned Siamese Sentence-BERT for
Matching Jobs and Job Seekers==
conSultantBERT: Fine-tuned Siamese Sentence-BERT for
Matching Jobs and Job Seekers
Dor Lavi Volodymyr Medentsiy David Graus
dor.lavi@randstadgroep.nl volodymyr.medentsiy@randstadgroep.nl david.graus@randstadgroep.nl
Randstad Groep Nederland Randstad Groep Nederland Randstad Groep Nederland
Diemen, The Netherlands Diemen, The Netherlands Diemen, The Netherlands
ABSTRACT The main requirements of the model are (i) to be able to op-
In this paper we focus on constructing useful embeddings of textual erate on multiple languages at the same time, and (ii) to be used
information in vacancies and resumes, which we aim to incorpo- to efficiently compare a vacancy with a large dataset of available
rate as features into job to job seeker matching models alongside resumes.
other features. We explain our task where noisy data from parsed Our end-goal is to incorporate these embeddings, or features
resumes, heterogeneous nature of the different sources of data, derived from them, in a larger recommender system that combines
and crosslinguality and multilinguality present domain-specific a heterogeneous feature set, spanning, e.g., categorical features and
challenges. real-valued features.
We address these challenges by fine-tuning a Siamese Sentence-
BERT (SBERT) model, which we call conSultantBERT, using a
large-scale, real-world, and high quality dataset of over 270,000
resume-vacancy pairs labeled by our staffing consultants. We show
how our fine-tuned model significantly outperforms unsupervised
and supervised baselines that rely on TF-IDF-weighted feature 1.1 Problem setting
vectors and BERT embeddings. In addition, we find our model
Several challenges arise when matching jobs to job seekers through
successfully matches cross-lingual and multilingual textual content.
textual resume and vacancy data.
CCS CONCEPTS First, the data we work with is inherently noisy. On the one hand,
resumes are user-generated data, usually (but not always) in PDF
• Computing methodologies → Ranking; • Information sys- format. It goes without saying that parsing those files to plain text
tems → Content analysis and feature selection; Similarity can be a challenge in itself and therefore out of scope for this paper.
measures; Language models. On the other hand, vacancies are usually structured formatted text.
Second, the nature of the data differs. Most NLP research in text
KEYWORDS similarity is based on the assumption that two pieces of information
job matching, BERT, fine-tuning are the same but written differently [3]. However, in our case the
two documents do not express the same information, but comple-
1 INTRODUCTION ment each other like pieces of a puzzle. Our goal is to match two
Randstad is the global leader in the HR services industry. We sup- complementary pieces of textual information, that may not exhibit
port people and organizations in realizing their true potential by direct overlap/similarity.
combining the power of today’s technology with our passion for Third, as a multinational corporate that operates all across the
people. In 2020, we helped more than two million job seekers find globe, developing separate models for each market and language
a meaningful job with our 236,100 clients. Randstad is active in 38 does not scale. Therefore, a desired property of our system is multi-
markets around the world and has top-three positions in almost linguality; a system that will support as many languages as possible.
half of these. In 2020, Randstad had on average 34,680 corporate In addition, as it is common to have English resumes in non-English
employees and generated revenue of € 20.7 billion. countries (e.g., in the Dutch market around 10% of resumes are in
Each day, at Randstad, we employ industry-scale recommender English), cross-linguality is another desired property, e.g., being
systems to recommend thousands of job seekers to our clients, and able to match English resumes to Dutch vacancies.
the other way around; vacancies to job seekers. Our job recom- This paper is structured as follows. First, in Section 2 we sum-
mender system is based on a heterogeneous collection of input marize related work on the use of neural embeddings of textual
data: curriculum vitaes (resumes) of job seekers, vacancy texts (job information in the recruimtent/HR domain. Next, in Section 3.1 we
descriptions), and structured data (e.g., the location of a job seeker describe how we leverage our internal history of job seeker place-
or vacancy). The goal of our system is to recommend the best job ments to create a labeled resume-vacancy pairs dataset. Then, in
seekers to each open vacancy. In this paper we explore methods for Section 3.2 we describe how we fine-tune a multilingual BERT with
constructing useful embeddings of textual information in vacancies bi encoder structure [12] over this dataset, by adding a cosine simi-
and resumes. larity log loss layer. Finally, in Section 4 we describe how using the
mentioned architecture helps us overcome most of the challenges
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons described above, and how it enables us to build a maintainable and
License Attribution 4.0 International (CC BY 4.0). scalable pipeline to match resumes and vacancies.
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands Lavi et al.
2 RELATED WORK requirements, and resumes into sections of experiences. The words
Neural embeddings are widely used for content-based retrieval, are encoded using pre-trained embeddings and processed with bi-
and embedding models became an essential component in the mod- LSTM. To get the embeddings of job postings and resumes, they
ern recommender system pipelines [8] [6]. So the the main focus propose a two-step hierarchical pipeline. Every section is encoded
of our work is to construct embeddings of textual information in using the attention mechanism, and finally to get embeddings of the
vacancy and resume, which could be then incorporated into an- job postings and resumes they run bi-LSTM on top of section encod-
other job-job seeker matching model and used along with other ings and aggregate bi-LSTM outputs using the attention mechanism.
features, e.g. location and other categorical features. So we will Additionally, during processing information in resumes Qin et al.
focus on reviewing methods of embedding vacancies and resumes. propose to add encodings of job postings to emphasize skills in a
The task of embedding resume and vacancy could be posed as resume relevant for a specific job.
creating domain-specific document embeddings. Although context- We can not directly compare our work with other approaches,
aware embeddings proved to outperform bag-of-words approaches because of the different datasets used. For example, Bhatia et al.
in most of the NLP tasks in academia, the latter is still widely used and Qin et al. assume well-structured resumes, which is not the
in the industry. case in our situation. Ramanath et al. builds an embedding model
Bian et al. propose to construct two sub-models with a co-teaching for recruiters’ queries which are shorter than vacancies and lack
mechanism to combine predictions of those models. The first sub- the context provided in the vacancy. Bhatia et al. propose solution
model encodes relational information of resume and vacancy, and when limited amount of data is available. We on the other hand
the second sub-model, which is related to our work, encodes textual work in a setting of abundant data of heterogeneous nature, but at
information in resume and vacancy. Documents are processed per the same time resumes lack consistent structure, while vacancies
sentence, with every sentence being encoded using the CLS token are given in a more structured way. Additional issues that are not
of the BERT model. The Hierarchical Transformer is applied on top considered by most of the reviewed works are (i) cross-linguality,
of sentence embeddings to get the document embeddings. The final so that we aim at predicting English resumes to Dutch vacancies if
match prediction is obtained by applying a fully-connected layer there is a potential match, and (ii) multilinguality, where we aim
with sigmoid activation on concatenated embeddings of resume and to serve a single model for multiple languages. We address both
vacancy. Bhatia et al. propose to fine-tune BERT on the sequence issues with our approach. Next, Qin et al. and Zhu et al. observe
pair classification task to predict whether two job experiences be- that an embedding model may benefit from constructing parallel
long to one person or not. The proposed method does not require pipelines to process resume and vacancy. Our approach relies on a
a dataset of labeled resume-vacancy pairs. The fine-tuned model shared embedding model of resumes and vacancies.
is used to embed both the job description of the vacancy and the While there are many organizations capable to train off the
current job experiences of the job seeker. Zhao et al. process words shelf transformers, not many of them have the availability of an
in resumes and vacancies using word2vec embeddings and domain- abundance of high-quality labeled data. As a global market leader,
specific vocabulary. Word embeddings are fitted into a stack of we are situated in a unique position. We have both rich histories
convolutional blocks of different kernel sizes, on top of which Zhao of high-quality interactions between consultants, candidates, and
et al. apply attention to get the context vector and project that into vacancies, in addition to having the content to represent those
the embedding space using the FC layer. The model is trained using candidates and vacancies.
the binary cross-entropy loss on the task of predicting a match
between job seeker and vacancy. Ramanath et al. use supervised 3 METHOD
and unsupervised embeddings for their ranking model to recom- Here we describe our method, more specifically, in Section 3.1 we
mend candidates to recruiter queries. The unsupervised method describe how we acquire our labeled dataset of resume/vacancy-
does not use the unstructured textual data but relies on the data pairs. Next, in Section 3.2 we describe our multilingual SBERT with
stored in Linkedin Economic Graph [15], which represents entities bi-encoder and cosine similarity score as output layer.
such as skills, educational institutions, employers, employees, and
relations among them. Candidates and queries are embedded using
3.1 Dataset creation
the graph neural models in this method. The supervised method
embeds textual information in recruiters’ queries and candidates’ We have rich history of interaction between consultants (recruiters)
profiles using the DSSM [7] model. The DSSM operates on the char- and job seekers (candidates). We define a positive signal any point
acter trigrams and is composed of two separate models to embed of contact between a job seeker and consultant (e.g., a phone call,
queries and candidates. The DSSM model is trained on the historical interview, job offer, etc.). Negative signals are defined by job seekers
data of recruiters’ interaction with candidates. who submit their profile, but get rejected by a consultant without
Zhu et al. utilize skip-gram embeddings of various dimensionali- any interaction (i.e., consultant looks at the job seeker’s profile,
ties (64 for resume and 256 for a vacancy) to encode words, which and rejects). In addition, since we have unbalanced dataset, for
are then passed through two convolutional layers. A pooling oper- each vacancy we add random negative samples, which we draw
ation (max pooling for resume and mean pooling for vacancy) is randomly from our job seeker pool. This is done in spirit with
applied to the output of convolutional layers to get the embeddings other works, which also complement historical data with random
of resume and vacancy. The model is optimized using the cosine negative pairs [2, 17, 18].
similarity loss. Qin et al. divide job postings into sections of ability Our dataset consists of 274,407 resume-vacancy pairs, out of
which 126,679 are positive samples, 109,724 are negative samples
conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
The original formulation of the Sentence-BERT (SBERT) model
Num ber of candidat es per vacancy
takes a pair of sentences as input, with words independently en-
coded in every sentence using BERT, which are aggregated by pool-
4000 ing to get the sentence embeddings. After that, it either optimizes
the regression loss, which is MSE loss between the cosine similarity
3000 score and true similarity label, or optimizes the classification loss,
namely the cross-entropy loss.
Whereas SBERT is aimed at computing pairwise semantic sim-
2000
ilarity, we show that it can be applied effectively for our task of
matching two heterogeneous types of data. Our fine-tuning pipeline
1000 is illustrated at the Figure 2. So we pass resume and vacancy pairs
to the siamese SBERT and experimented with classification and
0 regression objectives.
0 20 40 60 80 100
3.2.1 Document representation. Most transformer models, includ-
ing SBERT, aim to model sentences as input, while we are interested
Figure 1: Histogram of the number of resumes per vacancy. in modeling documents.
We experimented with several methods for document represen-
tation by using the embeddings of the pre-trained BERT model.
First, we attempted to split our input documents into sentences
as defined by actual recruiters, and 38,004 are randomly drawn
to encode each sentence; we tried different sentence representa-
negative samples. We have 156,256 of unique resumes and 23,080
tions, first we used the pooled layer output from BERT to represent
unique vacancy texts, which implies that one vacancy can be paired
each sentence, which we then average to represent the document.
with multiple resumes. Figure 1 shows the histogram of the number
Here, we experimented with both simply averaging sentences, and
of resume-vacancy samples per vacancy. We see that for the major-
weighted averaging (by sentence length).
ity of vacancies, we have a small number of paired resumes with
Next, we tried to take the mean of the last 4 layers of the
approximately 10.5% of our vacancies being paired with a single
token, which is a special token placed at the start of each sentence,
resume, and approximately 30% of our vacancies being paired with
and considered a suitable representation for the sentence (according
at most three resumes.
to Devlin et al.). With these sentence representations, too, we repre-
resumes are user-generated PDF documents which we parse
sented the underlying document through averaging, both weighted
with Apache Tika.1 Overall, these parsed resumes can be consid-
and simple averaging.
ered quite noisy input to our model; there is a wide variation in
Our final approach however, was more simple. We ended up
format and structure of resumes, where common challenges include
treating the first 512 tokens of each input document as input for the
the ordering of different textual blocks, the diversity of their content
SBERT model, ignoring sentence boundaries. To avoid trimming too
(e.g., spanning any type of information from personalia, education,
much content, we pre-processed and cleaned our input documents
work experience, to hobbies), and parsing of tables and bullet points.
by, e.g., removing non-textual content that resulted from parsing
At the same time, vacancies are usually well structured and stan-
errors.
dardized documents. They consist of on average 2,100 tokens (while
resumes are on average longer and comprised of 2,500 tokens), and 3.2.2 Fine-tuning method. We fine-tune our model on our 80%
are roughly structured according to the following sections: job ti- training data split for 5 epochs, using a batch size of 4 pairs, and
tle, job description, job requirements, including required skills, job mean pooling, which was found during hyperparameter tuning
benefits, including compensation and other aspects, and company on our validation set as optimal parameters. To fine-tune the pre-
description, which usually includes information about the industry trained BERT model, we need to trim the content of every resume
of the job offered in the vacancy. and vacancy to 512 tokens, as the base model limits the maximum
number of tokens to 512.
3.2 Architecture
Our method, dubbed conSultantBERT (as it is fine-tuned using la- 4 EXPERIMENTAL SETUP
bels provided by our consultants), utilizes the multilingual BERT [5] In this section, we describe the final datasets we use for training,
model pre-trained on Wikipedia pages of 100 languages.2 We fine- validation, and testing in Section 4.1, the baselines and why we em-
tune it using the Siamese networks, as proposed by Reimers and ploy them in Section 4.2, our proposed fine-tuned conSultantBERT
Gurevych. This method of fine-tuning BERT employs the bi-encoder approach in Section 4.3, and finally, in Section 4.4 we explain our
structure, which is effective in matching vacancy with a large pool evaluation metrics and statistical testing methodology.
of available resumes.
4.1 Dataset
1 https://tika.apache.org/ As described in Section 3.1, our dataset consists of 274,407 resume-
2 We used the bert-base-multilingual-cased model from the HuggingFace li- vacancy pairs. We split this dataset into 80% train (219,525 samples),
brary [16]. 10% validation and 10% test (27,441 samples each).
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands Lavi et al.
Figure 2: Our conSultantBERT architectures with classification objective (left) and regression objective (right), with resume
text input on the left, and vacancy text input on the right-hand side. Image adopted from [12]. Note that vacancy and resume
differ at the level of words (vocabulary), but also format, structure, and semantics.
We use our training set to (i) fine-tune the embedding models and approaches such as TF-IDF weighting and embedding represen-
train the supervised random forest classifiers, the validation set for tations of these different words will likely show low similarity.
the hyperparameter search of the SBERT hyperparams described Formulated differently, if word overlap or proximity between va-
in Section 3.2.2, and report on performance on our test set. cancies and resumes is meaningful, these baselines would be able
to perform decently.
4.2 Baselines
4.2.2 Supervised. Next, we present our two supervised baselines,
As this paper revolves around constructing useful feature repre- where we employ a random forest classifier that is trained on top
sentations (i.e., embeddings) of the textual information in vacancy of the feature representation described above (TFIDF+RF, BERT+RF).
and resumes, we compare several approaches of generating these These supervised baselines are trained on our 80% train split, using
embeddings. the default parameters given by the scikit-learn library [9].
4.2.1 Unsupervised. Our first baselines rely on unsupervised fea- We add these supervised methods and compare them to the
ture representations. More specifically, we represent both our va- previously described unsupervised baselines to further establish
cancies and resumes as either (i) TF-IDF weighted vectors (TFIDF), the extent of the aforementioned vocabulary gap, and the extent
or (ii) pre-trained BERT embeddings (BERT). in which the heterogeneous nature of the two types of documents
We then compute cosine similarities between pairs of resumes mentioned in Section 1.1 plays a role. That is to say, if there is a
and vacancies, and consider the cosine similarity as the predicted direct mapping that can be learned from words in one source of
“matching score.” Our TF-IDF vectors have 768 dimensions, which data (vacancy or resume), to the other source, supervised baselines
is equal to the dimensionality of BERT embeddings.3 We fitted our should be able to pick this up and outperform the unsupervised
TF-IDF weights on the training set, comprising both vacancy and baselines.
resume data. As described in Section 3.2, we rely on BERT models
pre-trained on Wikipedia from the HuggingFace library [16]. 4.3 conSultantBERT 4
These unsupervised baselines help us to assess the extent to Finally, we present our fine-tuned embedding model; conSultant-
which the vocabulary gap is problematic, i.e., if vacancies and re- BERT, which we fine-tune using both the classification objective as
sumes use completely different words, both bag of words-based the regression objective, as explained in Section 3.2 and illustrated
in Figure 2.
3 We experimented with increasing the dimensions of TF-IDF vectors, but it did not
gave any substantial increase in performance (increasing dimensions 3-fold improved
ROC-AUC by +0.7%). 4 The capitalized S in conSultant is a reference to SBERT.
conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
# Model ROC-AUC precision recall f1
1. BERT+Cosine 0.5325 0.5057 0.5006 0.3379
2. TFIDF+Cosine 0.5730∗ 0.5890 0.5016 0.3564
3. BERT+RF 0.6978 0.6421 0.6363 0.6360
4. TFIDF+RF 0.7174 0.6581 0.6526 0.6527
5. conSultantBERTClassifier+Cosine 0.7474 0.7001 0.6426 0.5994
6. conSultantBERTClassifier+RF 0.8366∗ 0.7642 0.7643 0.7643
7. conSultantBERTRegressor+Cosine 0.8459∗ 0.7714 0.7664 0.7677
8. conSultantBERTRegressor+RF 0.8389 0.7684 0.7658 0.7666
Table 1: Results of different runs. ∗ denotes statistically significant difference from the alternative in the same group at 𝛼 = 0.01.
Best-performing run in bold face.
Our intuition for using the classification objective is our task and overlap or semantic proximity/similarity does not seem a suffi-
dataset; which consists out of binary class labels, making the classi- ciently strong signal for identifying which resumes match a given
fication objective the most obvious choice. At the same time, our vacancy.
work revolves around searching for meaningful embeddings, not At row 3 and 4 we show the methods where a supervised clas-
necessarily solving the end-task in the best possible way, for which, sifier is trained using the input representations described above
we hypothesize, the model may benefit from the more fine-grained (TFIDF+RF and BERT+RF). Here, we see that with a supervised clas-
information that is present in the cosine similarity optimization sifier, both TF-IDF-weighted vectors as feature representation and
metric. pre-trained BERT embeddings vastly improve over the unsuper-
Similarly to our baselines, we consider (i) our fine-tuned model’s vised baselines, with a +31.0% improvement for BERT, and a +25.2%
direct output layer (i.e., the cosine similarity output layer) as “match- improvement for the TF-IDF-weighted vectors. This suggests to
ing score,” between a candidate and a vacancy (conSultantBERT- some extent, a mapping can be learned between the two types of
Regressor+Cosine, conSultantBERTClassifier+Cosine), in ad- documents, as we hypothesized in Section 4.2.
dition to the predictions made by a supervised random forest classi-
fier which we train on the embedding layer’s output (see Figure 2),
yielding the following supervised models: conSultantBERTReg- 5.1.2 conSultantBERT. Next, we turn to our consultantBERT runs
ressor+RF and conSultantBERTClassifier+RF. We explore the in rows 5 through 8. First, we consider models trained with the
latter since we aim to incorporate our model in a production system classification objective, in rows 5 and 6. We note that fine-tuning
alongside other models and features. BERT brings huge gains over the results the unsupervised pre-
trained embeddings (BERT+Cosine), which is in line with previous
4.4 Evaluation findings [14]. Even the method that relies on direct cosine similarity
In order to compare our different methods and baselines, we turn computations on the embeddings learned with the classification
to standard evaluation metrics for classification problems, namely objective (conSultantBERTClassifier+Cosine), outperforms our
we consider ROC-AUC as our main metric, as it is insensitive to supervised random forest baselines in rows 3 and 4 with a 7.1% and
thresholding and scale invariant. In addition, we turn to macro- 4.2% increase in ROC-AUC respectively. Adding a random forest
averaged (since our dataset is pretty balanced) precision, recall, and classifier on top of those fine-tuned embeddings (conSultantBERT
F1 scores for a deeper understanding of the specific behaviors of Classifier+RF) even further increases performance with a +11.9%
the different methods. increase in ROC-AUC over our supervised baseline with pre-trained
Finally, we perform independent student’s 𝑡-tests for statistical embeddings.
significance testing, and set the 𝛼-level at 0.01. As our primary goal is to get high quality embeddings for the
down-stream application of generating useful feature representa-
tions for recommendation, alongside of which we can train what-
5 RESULTS
ever model we want with additional features types such as categor-
See Table 1 for the results of our baselines and fine-tuned model. ical, binary, or real-valued features, we are more interested in the
We first turn to our main evaluation metric ROC-AUC below, and former unsupervised, rather than the latter approach with random
next to the precision and recall scores in Section 5.2. forest classifier.
Finally, we turn to our fine-tuned conSultantBERT with the
5.1 Overall performance regression objective (conSultantBERTRegressor), to study the di-
5.1.1 Baselines. As expected, using BERT embeddings or TF-IDF rect cosine similarity optimization objective. Looking at the last two
vectors in an unsupervised manner, i.e., by computing cosine sim- rows, we find that conSultantBERT with regression objective per-
ilarity as a matching score, does not perform well. This confirms forms similarly to conSultantBERT trained with the classification
our observations about the challenging nature of our domain; word objective in supervised approach with a random forest classifier
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands Lavi et al.
Figure 3: Density plots showing the distribution of Cosine similarity (denoted Cosine) scores (top row) and Random forest
classifier (denoted RF) probability scores (bottom row) per label. Blue lines show the distributions for the negative class, and
orange show distributions for the positive class.
on top. In fact, the small difference turns out not to be statistically marginally higher precision compared to the random forest classi-
significant with a 𝑝-value of 0.2. fier trained on the SBERT’s output layer. Row 7 and 8 show similar
What stands out most, is that the approach with cosine similarity stable improvements across the board compared to the methods in
outperforms all other runs, including significantly outperforming rows 1–4.
the conSultantBERTClassifier+RF approach. We explain this dif- Overall, we conclude that all of the supervised approaches (row
ference by the fact the architecture with the classifier objective 3 through 8) show roughly similar behavior with precision and re-
having a learnable layer (the Softmax classifier in Figure 2). We call balanced and substantial improvements over the unsupervised
drop this layer after the fine-tuning phase to yield our embeddings, baselines.
losing information in the process. On the other hand, the architec-
ture with the regression objective’s had a non-learnable last layer,
namely simply cosine similarity, which means all the necessary 6 ANALYSIS AND DISCUSSION
information has to propagate to the embedding layer. After analyzing the results in the previous section, in this section
we take a closer look at the different methods, by studying the
5.2 Precision & Recall distributions of prediction scores across the two labels (positive and
When we zoom into the more fine-grained precision and recall negative matches) in Section 6.1, and we zoom into another desired
scores, we observe the following: First, our unsupervised baselines property of our embedding space: multilinguality (in Section 6.2)
in row 1 and 2 show how TFIDF-based cosine similarity yields
substantially higher precision compared to cosine similarity with
pre-trained BERT embeddings, keeping similar recall. 6.1 Distributions
Next, these same baselines with a supervised random forest clas- In Figure 3 we plot the distributions of predicted scores per run, split
sifier on top (row 3 and 4) show that in both cases the classifiers out for positive and negative labels. The predicted scores correspond
seem to perform somewhat similarly, irrespective of whether it is to cosine similarities in the case of TFIDF, BERT, conSultantBERT*,
trained with TFIDF-weighted feature vectors or pre-trained BERT and probabilities of the random forest classifier in the case of
embeddings. With only a slight increase across precision and recall TFIDF+RF, BERT+RF, and conSultantBERT*+RF.
(around +2.5%) for the TFIDF-based method, we conclude the fea- A few things stand out, first, it is clear that fine-tuning the BERT
ture space in itself does not provide substantially different signals embeddings in our conSultantBERT models yield a clearer separa-
for separating the classes. tion in the predictions per class. The left-most plots (TFIDF, both
Then, comparing our fine-tuned models, rows 5 and 6 shows with Cosine similarities) show largely overlapping distributions.
the SBERT model fine-tuned with the classification objective yields The second column of plots (denoted BERT) shows largely similar
conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
TFIDF BERT conSult ant BERTRegressor 1.0
I worked in a warehouse for 10 years 0 0 0 0 0.57 0.49 0.46 0.44 0.91 0.41 0.76 0.083
0.8
Ive been an educat or for t he last 10 years 0 0 0 0 0.65 0.66 0.52 0.53 0.3 0.86 0.33 0.62
0.6
0.4
Ik heb 10 jaar in een m agazijn gewerkt 0 0 0 0 0.56 0.47 0.7 0.67 0.89 0.62 0.91 0.23
0.2
Ik ben de afgelopen 10 jaar leraar geweest 0 0 0 0 0.55 0.51 0.65 0.66 0.26 0.82 0.37 0.91
0.0
We are looking for a t alent ed educat or
We are looking for a t alent ed educat or
We are looking for a t alent ed educat or
We are looking for a t alent ed logist ic worker
We zijn op zoek naar een get alent eerde logist iek m edewerker
We are looking for a t alent ed logist ic worker
We zijn op zoek naar een get alent eerde logist iek m edewerker
We are looking for a t alent ed logist ic worker
We zijn op zoek naar een get alent eerde logist iek m edewerker
We zijn op zoek naar een get alent eerde docent
We zijn op zoek naar een get alent eerde docent
We zijn op zoek naar een get alent eerde docent
Figure 4: Heatmaps of cosine similarity between sentences from resumes and sentences from vacancies. The English sentences
(first two rows, first two columns) and Dutch sentences (last two rows, last two columns) are each other’s direct translations.
patterns; with as main difference the unsupervised BERT having a In some of the countries we operate, there is a high percentage of
bias towards higher scores compared to the unsupervised TFIDF. job seekers that are not native to that country. For example, many
At the same time, while the distributions of scores for each label of the job descriptions in the Netherlands are in Dutch, however
from the classifier with TF-IDF weighted vectors or pre-trained around 10% of the resumes are in English. Due to that fact another
BERT embeddings do seem rather close, the model does succeed in expected property from our solution is to be cross-lingual [13].
separating the classes more often than not (as can be seen by the Classic text models, like TF-IDF and Word2vec, capture informa-
score improvements over the non-supervised baselines in Table 1). tion within one language, but hardly connect between languages.
Observing the latter, though, makes clear that unsupervised Simply put, even if trained on multiple languages each language
similarity scores using default embeddings or TF-IDF weights does will have its own cluster in space. So “logistics” in English and
not allow a strong signal to separate both classes, which can be “logistiek” in Dutch are embedded in a completely different point in
witnessed by the largely overlapping score distributions. space, even though the meaning is the same.
We observed the difference in performance between our regression- Furthermore, we know that the language of resume correlates
optimized conSultantBERT with the classification-optimized one, in with nationality and therefore can be a proxy discriminator. Due to
Section 5.1. We now turn to comparing the prediction distributions the impact of these systems and the risks of unintended algorithmic
of both models. We note how the classification-optimized conSul- bias and discrimination, HR is marked as a high risk domain in
tantBERT (top row, third plot from the left) seems to yield less sepa- the recently published EC Artificial Intelligence Act [4]. To avoid
rable cosine similarity scores compared to the regression-optimized discriminating against nationality we would like to recommend a
one (top right plot). Compared to the three other conSultantBERT candidate to the vacancy no matter which language the resume is
approaches, the former stands out. The random forest classifier written in. That is of course only if language is not a requirement
trained on top of the embeddings, though, effectively learns again for that vacancy.
to separate between the classes, suggesting the embeddings keep In Figure 4 we see few examples for sentences that a candidate
distinguishing information. might write “I worked ...” and few examples for vacancy “We are
looking for ...”. In order to demonstrate cross lingual and multilingual
the same examples are written in both English and Dutch.
6.2 Multilinguality On the left hand side of Figure 4 (TFIDF) we can see that there is
As a multinational corporate that operates all across the globe, de- no match at all. This is due to the vocabulary gap, so by definition
veloping a model per language is not scalable in our case. Therefore, TFIDF can not match between “warehouse” and “logistic worker”.
a desired property of our system is multilinguality; a system that To solve the vocabulary gap we introduced BERT (center of
will support as many languages as possible. Figure 4), which we see does find similarity between candidate and
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands Lavi et al.
vacancy sentences. However, it also hardly separates between the Claudia Hauff, and Gianmaria Silvello (Eds.). Springer International Publishing,
positive and negative pairs. Moreover, we can see a slight clustering Cham, 729–734.
[9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
around languages, so Dutch sentences have comparitively higher Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
similarity to Dutch sentences, and likewise English sentences are napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
more similar to each other. [10] Chuan Qin, Hengshu Zhu, Tong Xu, Chen Zhu, Liang Jiang, Enhong Chen, and
On the right hand side of Figure 4 (conSultantBERTRegressor) Hui Xiong. 2018. Enhancing Person-Job Fit for Talent Recruitment: An Ability-
we observe that the vocabulary gap is bridged, cross-lingual sen- aware Neural Network Approach. The 41st International ACM SIGIR Conference
on Research & Development in Information Retrieval (2018).
tences are paired properly (e.g “I worked in a warehouse for 10 years” [11] Rohan Ramanath, Hakan Inan, Gungor Polatkan, Bo Hu, Q. Guo, C. Ozcaglar,
and “We zijn op zoek naar een getalenteerde logistiek medewerker” Xianren Wu, K. Kenthapadi, and S. C. Geyik. 2018. Towards Deep and Repre-
have high score), and finally, both Dutch to Dutch and English sentation Learning for Talent Search at LinkedIn. Proceedings of the 27th ACM
International Conference on Information and Knowledge Management (2018).
to English sentences are properly scored, too, thus achieving our [12] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
desired property of multilinguality. using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP). Association for Computational
7 CONCLUSION Linguistics, Hong Kong, China, 3982–3992. https://doi.org/10.18653/v1/D19-1410
[13] Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2019. A Survey of Cross-
In this work we experimented with various ways to construct em- Lingual Word Embedding Models. J. Artif. Int. Res. 65, 1 (May 2019), 569–630.
beddings of resume and vacancy texts. https://doi.org/10.1613/jair.1.11640
[14] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to Fine-Tune
We propose to fine-tune the BERT model using the Siamese BERT for Text Classification?. In Chinese Computational Linguistics, Maosong Sun,
SBERT framework on our large real-world dataset with high qual- Xuanjing Huang, Heng Ji, Zhiyuan Liu, and Yang Liu (Eds.). Springer International
Publishing, Cham, 194–206.
ity labels for resume-vacancy matches derived from our consultants’ [15] Jeff Weiner. 2012. The Future of LinkedIn and the Economic Graph. (2012).
decisions. We show our model beat our unsupervised and super- [16] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
vised baselines based on TF-IDF features and pre-trained BERT Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
embeddings. Furthermore, we show it can be applied for multilin- Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
gual (e.g., English-to-English alongside Dutch-to-Dutch) and cross- and Alexander M. Rush. 2020. HuggingFace’s Transformers: State-of-the-art
lingual matching (e.g., English-to-Dutch and vice versa). Finally, Natural Language Processing. arXiv:1910.03771 [cs.CL]
[17] Jing Zhao, Jingya Wang, Madhav Sigdel, Bopeng Zhang, Phuong Hoang, Mengshu
we show that using a regression objective to optimize for cosine Liu, and Mohammed Korayem. 2021. Embedding-based Recommender System
similarity yields more useful embeddings in our scenario, where for Job to Candidate Matching on Scale. https://arxiv.org/abs/2107.00221
[18] Chen Zhu, Hengshu Zhu, Hui Xiong, Chao Ma, Fang Xie, Pengliang Ding, and
we aim to apply the learned embeddings as feature representation Pan Li. 2018. Person-Job Fit: Adapting the Right Talent for the Right Job with
in a broader job recommender system. Joint Representation Learning. https://arxiv.org/abs/1810.04040
ACKNOWLEDGMENTS
Special thanks to the rest of the SmartMatch team: Adam, Evelien,
Najeeb, Sandra, Wilco, Wojciech, and Zeki. And Sepideh for her
helpful comments.
REFERENCES
[1] Vedant Bhatia, Prateek Rawat, Ajit Kumar, and Rajiv Ratn Shah. 2019. End-to-
End Resume Parsing and Finding Candidates for a Job Description using BERT.
https://arxiv.org/abs/1910.03089
[2] Shuqing Bian, Xu Chen, Wayne Xin Zhao, Kun Zhou, Yupeng Hou, Yang Song,
Tao Zhang, and Ji-Rong Wen. 2020. Learning to Match Jobs with Resumes
from Sparse Interaction Data using Multi-View Co-Teaching Network. https:
//arxiv.org/abs/2009.13299
[3] Dhivya Chandrasekaran and Vijay Mago. 2021. Evolution of Semantic Similarity—
A Survey. ACM Comput. Surv. 54, 2, Article 41 (Feb. 2021), 37 pages. https:
//doi.org/10.1145/3440755
[4] European Commission. 2021. Proposal for a Regulation of the European Parliament
and of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artifi-
cial Intelligence Act) and Amending Certain Union Legislative Acts : {SEC(2021) 167
Final } - {SWD(2021) 84 Final } - {SWD(2021) 85 Final }. European Commission.
https://books.google.nl/books?id=ofxxzgEACAAJ
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding.
arXiv preprint arXiv:1810.04805 (2018).
[6] Hebatallah A. Mohamed Hassan, Giuseppe Sansonetti, Fabio Gasparetti, A. Mi-
carelli, and J. Beel. 2019. BERT, ELMo, USE and InferSent Sentence Encoders:
The Panacea for Research-Paper Recommendation?. In RecSys.
[7] Po-Sen Huang, Xiaodong He, Jianfeng Gao, li Deng, Alex Acero, and Larry
Heck. 2013. Learning deep structured semantic models for web search using
clickthrough data. 2333–2338. https://doi.org/10.1145/2505515.2505665
[8] Cataldo Musto, Giovanni Semeraro, Marco de Gemmis, and Pasquale Lops. 2016.
Learning Word Embeddings from Wikipedia for Content-Based Recommender
Systems. In Advances in Information Retrieval, Nicola Ferro, Fabio Crestani, Marie-
Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio,