=Paper=
{{Paper
|id=Vol-2967/paper8
|storemode=property
|title=conSultantBERT: Fine-tuned Siamese Sentence-BERT for
Matching Jobs and Job Seekers
|pdfUrl=https://ceur-ws.org/Vol-2967/paper_8.pdf
|volume=Vol-2967
|authors=Dor Lavi,Volodymyr Medentsiy,David Graus
|dblpUrl=https://dblp.org/rec/conf/hr-recsys/LaviMG21
}}
==conSultantBERT: Fine-tuned Siamese Sentence-BERT for
Matching Jobs and Job Seekers==
<pdf width="1500px">https://ceur-ws.org/Vol-2967/paper_8.pdf</pdf>
<pre>
          conSultantBERT: Fine-tuned Siamese Sentence-BERT for
                     Matching Jobs and Job Seekers
                      Dor Lavi                                     Volodymyr Medentsiy                                  David Graus
           dor.lavi@randstadgroep.nl                        volodymyr.medentsiy@randstadgroep.nl               david.graus@randstadgroep.nl
           Randstad Groep Nederland                              Randstad Groep Nederland                        Randstad Groep Nederland
           Diemen, The Netherlands                               Diemen, The Netherlands                         Diemen, The Netherlands
ABSTRACT                                                                                The main requirements of the model are (i) to be able to op-
In this paper we focus on constructing useful embeddings of textual                  erate on multiple languages at the same time, and (ii) to be used
information in vacancies and resumes, which we aim to incorpo-                       to efficiently compare a vacancy with a large dataset of available
rate as features into job to job seeker matching models alongside                    resumes.
other features. We explain our task where noisy data from parsed                        Our end-goal is to incorporate these embeddings, or features
resumes, heterogeneous nature of the different sources of data,                      derived from them, in a larger recommender system that combines
and crosslinguality and multilinguality present domain-specific                      a heterogeneous feature set, spanning, e.g., categorical features and
challenges.                                                                          real-valued features.
   We address these challenges by fine-tuning a Siamese Sentence-
BERT (SBERT) model, which we call conSultantBERT, using a
large-scale, real-world, and high quality dataset of over 270,000
resume-vacancy pairs labeled by our staffing consultants. We show
how our fine-tuned model significantly outperforms unsupervised
and supervised baselines that rely on TF-IDF-weighted feature                        1.1    Problem setting
vectors and BERT embeddings. In addition, we find our model
                                                                                     Several challenges arise when matching jobs to job seekers through
successfully matches cross-lingual and multilingual textual content.
                                                                                     textual resume and vacancy data.
CCS CONCEPTS                                                                            First, the data we work with is inherently noisy. On the one hand,
                                                                                     resumes are user-generated data, usually (but not always) in PDF
• Computing methodologies → Ranking; • Information sys-                              format. It goes without saying that parsing those files to plain text
tems → Content analysis and feature selection; Similarity                            can be a challenge in itself and therefore out of scope for this paper.
measures; Language models.                                                           On the other hand, vacancies are usually structured formatted text.
                                                                                        Second, the nature of the data differs. Most NLP research in text
KEYWORDS                                                                             similarity is based on the assumption that two pieces of information
job matching, BERT, fine-tuning                                                      are the same but written differently [3]. However, in our case the
                                                                                     two documents do not express the same information, but comple-
1    INTRODUCTION                                                                    ment each other like pieces of a puzzle. Our goal is to match two
Randstad is the global leader in the HR services industry. We sup-                   complementary pieces of textual information, that may not exhibit
port people and organizations in realizing their true potential by                   direct overlap/similarity.
combining the power of today’s technology with our passion for                          Third, as a multinational corporate that operates all across the
people. In 2020, we helped more than two million job seekers find                    globe, developing separate models for each market and language
a meaningful job with our 236,100 clients. Randstad is active in 38                  does not scale. Therefore, a desired property of our system is multi-
markets around the world and has top-three positions in almost                       linguality; a system that will support as many languages as possible.
half of these. In 2020, Randstad had on average 34,680 corporate                     In addition, as it is common to have English resumes in non-English
employees and generated revenue of € 20.7 billion.                                   countries (e.g., in the Dutch market around 10% of resumes are in
   Each day, at Randstad, we employ industry-scale recommender                       English), cross-linguality is another desired property, e.g., being
systems to recommend thousands of job seekers to our clients, and                    able to match English resumes to Dutch vacancies.
the other way around; vacancies to job seekers. Our job recom-                          This paper is structured as follows. First, in Section 2 we sum-
mender system is based on a heterogeneous collection of input                        marize related work on the use of neural embeddings of textual
data: curriculum vitaes (resumes) of job seekers, vacancy texts (job                 information in the recruimtent/HR domain. Next, in Section 3.1 we
descriptions), and structured data (e.g., the location of a job seeker               describe how we leverage our internal history of job seeker place-
or vacancy). The goal of our system is to recommend the best job                     ments to create a labeled resume-vacancy pairs dataset. Then, in
seekers to each open vacancy. In this paper we explore methods for                   Section 3.2 we describe how we fine-tune a multilingual BERT with
constructing useful embeddings of textual information in vacancies                   bi encoder structure [12] over this dataset, by adding a cosine simi-
and resumes.                                                                         larity log loss layer. Finally, in Section 4 we describe how using the
                                                                                     mentioned architecture helps us overcome most of the challenges
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   described above, and how it enables us to build a maintainable and
License Attribution 4.0 International (CC BY 4.0).                                   scalable pipeline to match resumes and vacancies.
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands                                                                            Lavi et al.


2    RELATED WORK                                                          requirements, and resumes into sections of experiences. The words
Neural embeddings are widely used for content-based retrieval,             are encoded using pre-trained embeddings and processed with bi-
and embedding models became an essential component in the mod-             LSTM. To get the embeddings of job postings and resumes, they
ern recommender system pipelines [8] [6]. So the the main focus            propose a two-step hierarchical pipeline. Every section is encoded
of our work is to construct embeddings of textual information in           using the attention mechanism, and finally to get embeddings of the
vacancy and resume, which could be then incorporated into an-              job postings and resumes they run bi-LSTM on top of section encod-
other job-job seeker matching model and used along with other              ings and aggregate bi-LSTM outputs using the attention mechanism.
features, e.g. location and other categorical features. So we will         Additionally, during processing information in resumes Qin et al.
focus on reviewing methods of embedding vacancies and resumes.             propose to add encodings of job postings to emphasize skills in a
The task of embedding resume and vacancy could be posed as                 resume relevant for a specific job.
creating domain-specific document embeddings. Although context-               We can not directly compare our work with other approaches,
aware embeddings proved to outperform bag-of-words approaches              because of the different datasets used. For example, Bhatia et al.
in most of the NLP tasks in academia, the latter is still widely used      and Qin et al. assume well-structured resumes, which is not the
in the industry.                                                           case in our situation. Ramanath et al. builds an embedding model
   Bian et al. propose to construct two sub-models with a co-teaching      for recruiters’ queries which are shorter than vacancies and lack
mechanism to combine predictions of those models. The first sub-           the context provided in the vacancy. Bhatia et al. propose solution
model encodes relational information of resume and vacancy, and            when limited amount of data is available. We on the other hand
the second sub-model, which is related to our work, encodes textual        work in a setting of abundant data of heterogeneous nature, but at
information in resume and vacancy. Documents are processed per             the same time resumes lack consistent structure, while vacancies
sentence, with every sentence being encoded using the CLS token            are given in a more structured way. Additional issues that are not
of the BERT model. The Hierarchical Transformer is applied on top          considered by most of the reviewed works are (i) cross-linguality,
of sentence embeddings to get the document embeddings. The final           so that we aim at predicting English resumes to Dutch vacancies if
match prediction is obtained by applying a fully-connected layer           there is a potential match, and (ii) multilinguality, where we aim
with sigmoid activation on concatenated embeddings of resume and           to serve a single model for multiple languages. We address both
vacancy. Bhatia et al. propose to fine-tune BERT on the sequence           issues with our approach. Next, Qin et al. and Zhu et al. observe
pair classification task to predict whether two job experiences be-        that an embedding model may benefit from constructing parallel
long to one person or not. The proposed method does not require            pipelines to process resume and vacancy. Our approach relies on a
a dataset of labeled resume-vacancy pairs. The fine-tuned model            shared embedding model of resumes and vacancies.
is used to embed both the job description of the vacancy and the              While there are many organizations capable to train off the
current job experiences of the job seeker. Zhao et al. process words       shelf transformers, not many of them have the availability of an
in resumes and vacancies using word2vec embeddings and domain-             abundance of high-quality labeled data. As a global market leader,
specific vocabulary. Word embeddings are fitted into a stack of            we are situated in a unique position. We have both rich histories
convolutional blocks of different kernel sizes, on top of which Zhao       of high-quality interactions between consultants, candidates, and
et al. apply attention to get the context vector and project that into     vacancies, in addition to having the content to represent those
the embedding space using the FC layer. The model is trained using         candidates and vacancies.
the binary cross-entropy loss on the task of predicting a match
between job seeker and vacancy. Ramanath et al. use supervised             3     METHOD
and unsupervised embeddings for their ranking model to recom-              Here we describe our method, more specifically, in Section 3.1 we
mend candidates to recruiter queries. The unsupervised method              describe how we acquire our labeled dataset of resume/vacancy-
does not use the unstructured textual data but relies on the data          pairs. Next, in Section 3.2 we describe our multilingual SBERT with
stored in Linkedin Economic Graph [15], which represents entities          bi-encoder and cosine similarity score as output layer.
such as skills, educational institutions, employers, employees, and
relations among them. Candidates and queries are embedded using
                                                                           3.1    Dataset creation
the graph neural models in this method. The supervised method
embeds textual information in recruiters’ queries and candidates’          We have rich history of interaction between consultants (recruiters)
profiles using the DSSM [7] model. The DSSM operates on the char-          and job seekers (candidates). We define a positive signal any point
acter trigrams and is composed of two separate models to embed             of contact between a job seeker and consultant (e.g., a phone call,
queries and candidates. The DSSM model is trained on the historical        interview, job offer, etc.). Negative signals are defined by job seekers
data of recruiters’ interaction with candidates.                           who submit their profile, but get rejected by a consultant without
   Zhu et al. utilize skip-gram embeddings of various dimensionali-        any interaction (i.e., consultant looks at the job seeker’s profile,
ties (64 for resume and 256 for a vacancy) to encode words, which          and rejects). In addition, since we have unbalanced dataset, for
are then passed through two convolutional layers. A pooling oper-          each vacancy we add random negative samples, which we draw
ation (max pooling for resume and mean pooling for vacancy) is             randomly from our job seeker pool. This is done in spirit with
applied to the output of convolutional layers to get the embeddings        other works, which also complement historical data with random
of resume and vacancy. The model is optimized using the cosine             negative pairs [2, 17, 18].
similarity loss. Qin et al. divide job postings into sections of ability      Our dataset consists of 274,407 resume-vacancy pairs, out of
                                                                           which 126,679 are positive samples, 109,724 are negative samples
conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers                  RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands


                                                                                         The original formulation of the Sentence-BERT (SBERT) model
                        Num ber of candidat es per vacancy
                                                                                     takes a pair of sentences as input, with words independently en-
                                                                                     coded in every sentence using BERT, which are aggregated by pool-
    4000                                                                             ing to get the sentence embeddings. After that, it either optimizes
                                                                                     the regression loss, which is MSE loss between the cosine similarity
    3000                                                                             score and true similarity label, or optimizes the classification loss,
                                                                                     namely the cross-entropy loss.
                                                                                         Whereas SBERT is aimed at computing pairwise semantic sim-
    2000
                                                                                     ilarity, we show that it can be applied effectively for our task of
                                                                                     matching two heterogeneous types of data. Our fine-tuning pipeline
    1000                                                                             is illustrated at the Figure 2. So we pass resume and vacancy pairs
                                                                                     to the siamese SBERT and experimented with classification and
        0                                                                            regression objectives.
              0          20        40        60        80        100

                                                                                     3.2.1 Document representation. Most transformer models, includ-
                                                                                     ing SBERT, aim to model sentences as input, while we are interested
Figure 1: Histogram of the number of resumes per vacancy.                            in modeling documents.
                                                                                        We experimented with several methods for document represen-
                                                                                     tation by using the embeddings of the pre-trained BERT model.
                                                                                     First, we attempted to split our input documents into sentences
as defined by actual recruiters, and 38,004 are randomly drawn
                                                                                     to encode each sentence; we tried different sentence representa-
negative samples. We have 156,256 of unique resumes and 23,080
                                                                                     tions, first we used the pooled layer output from BERT to represent
unique vacancy texts, which implies that one vacancy can be paired
                                                                                     each sentence, which we then average to represent the document.
with multiple resumes. Figure 1 shows the histogram of the number
                                                                                     Here, we experimented with both simply averaging sentences, and
of resume-vacancy samples per vacancy. We see that for the major-
                                                                                     weighted averaging (by sentence length).
ity of vacancies, we have a small number of paired resumes with
                                                                                        Next, we tried to take the mean of the last 4 layers of the <CLS>
approximately 10.5% of our vacancies being paired with a single
                                                                                     token, which is a special token placed at the start of each sentence,
resume, and approximately 30% of our vacancies being paired with
                                                                                     and considered a suitable representation for the sentence (according
at most three resumes.
                                                                                     to Devlin et al.). With these sentence representations, too, we repre-
   resumes are user-generated PDF documents which we parse
                                                                                     sented the underlying document through averaging, both weighted
with Apache Tika.1 Overall, these parsed resumes can be consid-
                                                                                     and simple averaging.
ered quite noisy input to our model; there is a wide variation in
                                                                                        Our final approach however, was more simple. We ended up
format and structure of resumes, where common challenges include
                                                                                     treating the first 512 tokens of each input document as input for the
the ordering of different textual blocks, the diversity of their content
                                                                                     SBERT model, ignoring sentence boundaries. To avoid trimming too
(e.g., spanning any type of information from personalia, education,
                                                                                     much content, we pre-processed and cleaned our input documents
work experience, to hobbies), and parsing of tables and bullet points.
                                                                                     by, e.g., removing non-textual content that resulted from parsing
At the same time, vacancies are usually well structured and stan-
                                                                                     errors.
dardized documents. They consist of on average 2,100 tokens (while
resumes are on average longer and comprised of 2,500 tokens), and                    3.2.2 Fine-tuning method. We fine-tune our model on our 80%
are roughly structured according to the following sections: job ti-                  training data split for 5 epochs, using a batch size of 4 pairs, and
tle, job description, job requirements, including required skills, job               mean pooling, which was found during hyperparameter tuning
benefits, including compensation and other aspects, and company                      on our validation set as optimal parameters. To fine-tune the pre-
description, which usually includes information about the industry                   trained BERT model, we need to trim the content of every resume
of the job offered in the vacancy.                                                   and vacancy to 512 tokens, as the base model limits the maximum
                                                                                     number of tokens to 512.
3.2      Architecture
Our method, dubbed conSultantBERT (as it is fine-tuned using la-                     4     EXPERIMENTAL SETUP
bels provided by our consultants), utilizes the multilingual BERT [5]                In this section, we describe the final datasets we use for training,
model pre-trained on Wikipedia pages of 100 languages.2 We fine-                     validation, and testing in Section 4.1, the baselines and why we em-
tune it using the Siamese networks, as proposed by Reimers and                       ploy them in Section 4.2, our proposed fine-tuned conSultantBERT
Gurevych. This method of fine-tuning BERT employs the bi-encoder                     approach in Section 4.3, and finally, in Section 4.4 we explain our
structure, which is effective in matching vacancy with a large pool                  evaluation metrics and statistical testing methodology.
of available resumes.
                                                                                     4.1    Dataset
1 https://tika.apache.org/                                                           As described in Section 3.1, our dataset consists of 274,407 resume-
2 We used the bert-base-multilingual-cased model from the HuggingFace li-            vacancy pairs. We split this dataset into 80% train (219,525 samples),
brary [16].                                                                          10% validation and 10% test (27,441 samples each).
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands                                                                                     Lavi et al.


Figure 2: Our conSultantBERT architectures with classification objective (left) and regression objective (right), with resume
text input on the left, and vacancy text input on the right-hand side. Image adopted from [12]. Note that vacancy and resume
differ at the level of words (vocabulary), but also format, structure, and semantics.


   We use our training set to (i) fine-tune the embedding models and                  approaches such as TF-IDF weighting and embedding represen-
train the supervised random forest classifiers, the validation set for                tations of these different words will likely show low similarity.
the hyperparameter search of the SBERT hyperparams described                          Formulated differently, if word overlap or proximity between va-
in Section 3.2.2, and report on performance on our test set.                          cancies and resumes is meaningful, these baselines would be able
                                                                                      to perform decently.
4.2     Baselines
                                                                                      4.2.2 Supervised. Next, we present our two supervised baselines,
As this paper revolves around constructing useful feature repre-                      where we employ a random forest classifier that is trained on top
sentations (i.e., embeddings) of the textual information in vacancy                   of the feature representation described above (TFIDF+RF, BERT+RF).
and resumes, we compare several approaches of generating these                        These supervised baselines are trained on our 80% train split, using
embeddings.                                                                           the default parameters given by the scikit-learn library [9].
4.2.1 Unsupervised. Our first baselines rely on unsupervised fea-                        We add these supervised methods and compare them to the
ture representations. More specifically, we represent both our va-                    previously described unsupervised baselines to further establish
cancies and resumes as either (i) TF-IDF weighted vectors (TFIDF),                    the extent of the aforementioned vocabulary gap, and the extent
or (ii) pre-trained BERT embeddings (BERT).                                           in which the heterogeneous nature of the two types of documents
   We then compute cosine similarities between pairs of resumes                       mentioned in Section 1.1 plays a role. That is to say, if there is a
and vacancies, and consider the cosine similarity as the predicted                    direct mapping that can be learned from words in one source of
“matching score.” Our TF-IDF vectors have 768 dimensions, which                       data (vacancy or resume), to the other source, supervised baselines
is equal to the dimensionality of BERT embeddings.3 We fitted our                     should be able to pick this up and outperform the unsupervised
TF-IDF weights on the training set, comprising both vacancy and                       baselines.
resume data. As described in Section 3.2, we rely on BERT models
pre-trained on Wikipedia from the HuggingFace library [16].                           4.3     conSultantBERT 4
   These unsupervised baselines help us to assess the extent to                       Finally, we present our fine-tuned embedding model; conSultant-
which the vocabulary gap is problematic, i.e., if vacancies and re-                   BERT, which we fine-tune using both the classification objective as
sumes use completely different words, both bag of words-based                         the regression objective, as explained in Section 3.2 and illustrated
                                                                                      in Figure 2.
3 We experimented with increasing the dimensions of TF-IDF vectors, but it did not
gave any substantial increase in performance (increasing dimensions 3-fold improved
ROC-AUC by +0.7%).                                                                    4 The capitalized S in conSultant is a reference to SBERT.
conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers                   RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands


                             #     Model                                             ROC-AUC       precision      recall       f1
                             1.    BERT+Cosine                                       0.5325          0.5057       0.5006    0.3379
                             2.    TFIDF+Cosine                                      0.5730∗         0.5890       0.5016    0.3564
                             3.    BERT+RF                                           0.6978          0.6421       0.6363    0.6360
                             4.    TFIDF+RF                                          0.7174          0.6581       0.6526    0.6527
                             5.    conSultantBERTClassifier+Cosine                   0.7474          0.7001       0.6426    0.5994
                             6.    conSultantBERTClassifier+RF                       0.8366∗         0.7642       0.7643    0.7643
                             7.    conSultantBERTRegressor+Cosine                    0.8459∗         0.7714       0.7664    0.7677
                             8.    conSultantBERTRegressor+RF                        0.8389          0.7684       0.7658    0.7666
Table 1: Results of different runs. ∗ denotes statistically significant difference from the alternative in the same group at 𝛼 = 0.01.
Best-performing run in bold face.


   Our intuition for using the classification objective is our task and               overlap or semantic proximity/similarity does not seem a suffi-
dataset; which consists out of binary class labels, making the classi-                ciently strong signal for identifying which resumes match a given
fication objective the most obvious choice. At the same time, our                     vacancy.
work revolves around searching for meaningful embeddings, not                             At row 3 and 4 we show the methods where a supervised clas-
necessarily solving the end-task in the best possible way, for which,                 sifier is trained using the input representations described above
we hypothesize, the model may benefit from the more fine-grained                      (TFIDF+RF and BERT+RF). Here, we see that with a supervised clas-
information that is present in the cosine similarity optimization                     sifier, both TF-IDF-weighted vectors as feature representation and
metric.                                                                               pre-trained BERT embeddings vastly improve over the unsuper-
   Similarly to our baselines, we consider (i) our fine-tuned model’s                 vised baselines, with a +31.0% improvement for BERT, and a +25.2%
direct output layer (i.e., the cosine similarity output layer) as “match-             improvement for the TF-IDF-weighted vectors. This suggests to
ing score,” between a candidate and a vacancy (conSultantBERT-                        some extent, a mapping can be learned between the two types of
Regressor+Cosine, conSultantBERTClassifier+Cosine), in ad-                            documents, as we hypothesized in Section 4.2.
dition to the predictions made by a supervised random forest classi-
fier which we train on the embedding layer’s output (see Figure 2),
yielding the following supervised models: conSultantBERTReg-                          5.1.2 conSultantBERT. Next, we turn to our consultantBERT runs
ressor+RF and conSultantBERTClassifier+RF. We explore the                             in rows 5 through 8. First, we consider models trained with the
latter since we aim to incorporate our model in a production system                   classification objective, in rows 5 and 6. We note that fine-tuning
alongside other models and features.                                                  BERT brings huge gains over the results the unsupervised pre-
                                                                                      trained embeddings (BERT+Cosine), which is in line with previous
4.4     Evaluation                                                                    findings [14]. Even the method that relies on direct cosine similarity
In order to compare our different methods and baselines, we turn                      computations on the embeddings learned with the classification
to standard evaluation metrics for classification problems, namely                    objective (conSultantBERTClassifier+Cosine), outperforms our
we consider ROC-AUC as our main metric, as it is insensitive to                       supervised random forest baselines in rows 3 and 4 with a 7.1% and
thresholding and scale invariant. In addition, we turn to macro-                      4.2% increase in ROC-AUC respectively. Adding a random forest
averaged (since our dataset is pretty balanced) precision, recall, and                classifier on top of those fine-tuned embeddings (conSultantBERT
F1 scores for a deeper understanding of the specific behaviors of                     Classifier+RF) even further increases performance with a +11.9%
the different methods.                                                                increase in ROC-AUC over our supervised baseline with pre-trained
   Finally, we perform independent student’s 𝑡-tests for statistical                  embeddings.
significance testing, and set the 𝛼-level at 0.01.                                       As our primary goal is to get high quality embeddings for the
                                                                                      down-stream application of generating useful feature representa-
                                                                                      tions for recommendation, alongside of which we can train what-
5     RESULTS
                                                                                      ever model we want with additional features types such as categor-
See Table 1 for the results of our baselines and fine-tuned model.                    ical, binary, or real-valued features, we are more interested in the
We first turn to our main evaluation metric ROC-AUC below, and                        former unsupervised, rather than the latter approach with random
next to the precision and recall scores in Section 5.2.                               forest classifier.
                                                                                         Finally, we turn to our fine-tuned conSultantBERT with the
5.1     Overall performance                                                           regression objective (conSultantBERTRegressor), to study the di-
5.1.1 Baselines. As expected, using BERT embeddings or TF-IDF                         rect cosine similarity optimization objective. Looking at the last two
vectors in an unsupervised manner, i.e., by computing cosine sim-                     rows, we find that conSultantBERT with regression objective per-
ilarity as a matching score, does not perform well. This confirms                     forms similarly to conSultantBERT trained with the classification
our observations about the challenging nature of our domain; word                     objective in supervised approach with a random forest classifier
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands                                                                           Lavi et al.


Figure 3: Density plots showing the distribution of Cosine similarity (denoted Cosine) scores (top row) and Random forest
classifier (denoted RF) probability scores (bottom row) per label. Blue lines show the distributions for the negative class, and
orange show distributions for the positive class.


on top. In fact, the small difference turns out not to be statistically   marginally higher precision compared to the random forest classi-
significant with a 𝑝-value of 0.2.                                        fier trained on the SBERT’s output layer. Row 7 and 8 show similar
   What stands out most, is that the approach with cosine similarity      stable improvements across the board compared to the methods in
outperforms all other runs, including significantly outperforming         rows 1–4.
the conSultantBERTClassifier+RF approach. We explain this dif-               Overall, we conclude that all of the supervised approaches (row
ference by the fact the architecture with the classifier objective        3 through 8) show roughly similar behavior with precision and re-
having a learnable layer (the Softmax classifier in Figure 2). We         call balanced and substantial improvements over the unsupervised
drop this layer after the fine-tuning phase to yield our embeddings,      baselines.
losing information in the process. On the other hand, the architec-
ture with the regression objective’s had a non-learnable last layer,
namely simply cosine similarity, which means all the necessary            6     ANALYSIS AND DISCUSSION
information has to propagate to the embedding layer.                      After analyzing the results in the previous section, in this section
                                                                          we take a closer look at the different methods, by studying the
5.2     Precision & Recall                                                distributions of prediction scores across the two labels (positive and
When we zoom into the more fine-grained precision and recall              negative matches) in Section 6.1, and we zoom into another desired
scores, we observe the following: First, our unsupervised baselines       property of our embedding space: multilinguality (in Section 6.2)
in row 1 and 2 show how TFIDF-based cosine similarity yields
substantially higher precision compared to cosine similarity with
pre-trained BERT embeddings, keeping similar recall.                      6.1    Distributions
    Next, these same baselines with a supervised random forest clas-      In Figure 3 we plot the distributions of predicted scores per run, split
sifier on top (row 3 and 4) show that in both cases the classifiers       out for positive and negative labels. The predicted scores correspond
seem to perform somewhat similarly, irrespective of whether it is         to cosine similarities in the case of TFIDF, BERT, conSultantBERT*,
trained with TFIDF-weighted feature vectors or pre-trained BERT           and probabilities of the random forest classifier in the case of
embeddings. With only a slight increase across precision and recall       TFIDF+RF, BERT+RF, and conSultantBERT*+RF.
(around +2.5%) for the TFIDF-based method, we conclude the fea-              A few things stand out, first, it is clear that fine-tuning the BERT
ture space in itself does not provide substantially different signals     embeddings in our conSultantBERT models yield a clearer separa-
for separating the classes.                                               tion in the predictions per class. The left-most plots (TFIDF, both
    Then, comparing our fine-tuned models, rows 5 and 6 shows             with Cosine similarities) show largely overlapping distributions.
the SBERT model fine-tuned with the classification objective yields       The second column of plots (denoted BERT) shows largely similar
conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers                                                                                                                                                                                                                                                                                          RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands


                                                                                                                                               TFIDF                                                                                                                                                                                                                      BERT                                                                                                                                 conSult ant BERTRegressor                                                                                                                                                                                 1.0


      I worked in a warehouse for 10 years                     0                                                0                                                          0                                                            0                                  0.57                                               0.49                                               0.46                                                              0.44                                              0.91                                               0.41                                        0.76                                                              0.083
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         0.8


 Ive been an educat or for t he last 10 years                  0                                                0                                                          0                                                            0                                  0.65                                               0.66                                               0.52                                                              0.53                                              0.3                                                0.86                                        0.33                                                              0.62
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         0.6


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         0.4
    Ik heb 10 jaar in een m agazijn gewerkt                    0                                                0                                                          0                                                            0                                  0.56                                               0.47                                               0.7                                                               0.67                                              0.89                                               0.62                                        0.91                                                              0.23


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         0.2


Ik ben de afgelopen 10 jaar leraar geweest                     0                                                0                                                          0                                                            0                                  0.55                                               0.51                                               0.65                                                              0.66                                              0.26                                               0.82                                        0.37                                                              0.91


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         0.0


                                                                                                   We are looking for a t alent ed educat or


                                                                                                                                                                                                                                                                                                                              We are looking for a t alent ed educat or


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        We are looking for a t alent ed educat or
                                                We are looking for a t alent ed logist ic worker


                                                                                                                                                       We zijn op zoek naar een get alent eerde logist iek m edewerker


                                                                                                                                                                                                                                                                           We are looking for a t alent ed logist ic worker


                                                                                                                                                                                                                                                                                                                                                                                 We zijn op zoek naar een get alent eerde logist iek m edewerker


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     We are looking for a t alent ed logist ic worker


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    We zijn op zoek naar een get alent eerde logist iek m edewerker
                                                                                                                                                                                                                         We zijn op zoek naar een get alent eerde docent


                                                                                                                                                                                                                                                                                                                                                                                                                                                   We zijn op zoek naar een get alent eerde docent


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       We zijn op zoek naar een get alent eerde docent
Figure 4: Heatmaps of cosine similarity between sentences from resumes and sentences from vacancies. The English sentences
(first two rows, first two columns) and Dutch sentences (last two rows, last two columns) are each other’s direct translations.


patterns; with as main difference the unsupervised BERT having a                                                                                                                                                                                                              In some of the countries we operate, there is a high percentage of
bias towards higher scores compared to the unsupervised TFIDF.                                                                                                                                                                                                             job seekers that are not native to that country. For example, many
   At the same time, while the distributions of scores for each label                                                                                                                                                                                                      of the job descriptions in the Netherlands are in Dutch, however
from the classifier with TF-IDF weighted vectors or pre-trained                                                                                                                                                                                                            around 10% of the resumes are in English. Due to that fact another
BERT embeddings do seem rather close, the model does succeed in                                                                                                                                                                                                            expected property from our solution is to be cross-lingual [13].
separating the classes more often than not (as can be seen by the                                                                                                                                                                                                             Classic text models, like TF-IDF and Word2vec, capture informa-
score improvements over the non-supervised baselines in Table 1).                                                                                                                                                                                                          tion within one language, but hardly connect between languages.
   Observing the latter, though, makes clear that unsupervised                                                                                                                                                                                                             Simply put, even if trained on multiple languages each language
similarity scores using default embeddings or TF-IDF weights does                                                                                                                                                                                                          will have its own cluster in space. So “logistics” in English and
not allow a strong signal to separate both classes, which can be                                                                                                                                                                                                           “logistiek” in Dutch are embedded in a completely different point in
witnessed by the largely overlapping score distributions.                                                                                                                                                                                                                  space, even though the meaning is the same.
   We observed the difference in performance between our regression-                                                                                                                                                                                                          Furthermore, we know that the language of resume correlates
optimized conSultantBERT with the classification-optimized one, in                                                                                                                                                                                                         with nationality and therefore can be a proxy discriminator. Due to
Section 5.1. We now turn to comparing the prediction distributions                                                                                                                                                                                                         the impact of these systems and the risks of unintended algorithmic
of both models. We note how the classification-optimized conSul-                                                                                                                                                                                                           bias and discrimination, HR is marked as a high risk domain in
tantBERT (top row, third plot from the left) seems to yield less sepa-                                                                                                                                                                                                     the recently published EC Artificial Intelligence Act [4]. To avoid
rable cosine similarity scores compared to the regression-optimized                                                                                                                                                                                                        discriminating against nationality we would like to recommend a
one (top right plot). Compared to the three other conSultantBERT                                                                                                                                                                                                           candidate to the vacancy no matter which language the resume is
approaches, the former stands out. The random forest classifier                                                                                                                                                                                                            written in. That is of course only if language is not a requirement
trained on top of the embeddings, though, effectively learns again                                                                                                                                                                                                         for that vacancy.
to separate between the classes, suggesting the embeddings keep                                                                                                                                                                                                               In Figure 4 we see few examples for sentences that a candidate
distinguishing information.                                                                                                                                                                                                                                                might write “I worked ...” and few examples for vacancy “We are
                                                                                                                                                                                                                                                                           looking for ...”. In order to demonstrate cross lingual and multilingual
                                                                                                                                                                                                                                                                           the same examples are written in both English and Dutch.
6.2       Multilinguality                                                                                                                                                                                                                                                     On the left hand side of Figure 4 (TFIDF) we can see that there is
As a multinational corporate that operates all across the globe, de-                                                                                                                                                                                                       no match at all. This is due to the vocabulary gap, so by definition
veloping a model per language is not scalable in our case. Therefore,                                                                                                                                                                                                      TFIDF can not match between “warehouse” and “logistic worker”.
a desired property of our system is multilinguality; a system that                                                                                                                                                                                                            To solve the vocabulary gap we introduced BERT (center of
will support as many languages as possible.                                                                                                                                                                                                                                Figure 4), which we see does find similarity between candidate and
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands                                                                                                         Lavi et al.


vacancy sentences. However, it also hardly separates between the                                Claudia Hauff, and Gianmaria Silvello (Eds.). Springer International Publishing,
positive and negative pairs. Moreover, we can see a slight clustering                           Cham, 729–734.
                                                                                            [9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
around languages, so Dutch sentences have comparitively higher                                  Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
similarity to Dutch sentences, and likewise English sentences are                               napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
                                                                                                Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
more similar to each other.                                                                [10] Chuan Qin, Hengshu Zhu, Tong Xu, Chen Zhu, Liang Jiang, Enhong Chen, and
   On the right hand side of Figure 4 (conSultantBERTRegressor)                                 Hui Xiong. 2018. Enhancing Person-Job Fit for Talent Recruitment: An Ability-
we observe that the vocabulary gap is bridged, cross-lingual sen-                               aware Neural Network Approach. The 41st International ACM SIGIR Conference
                                                                                                on Research & Development in Information Retrieval (2018).
tences are paired properly (e.g “I worked in a warehouse for 10 years”                     [11] Rohan Ramanath, Hakan Inan, Gungor Polatkan, Bo Hu, Q. Guo, C. Ozcaglar,
and “We zijn op zoek naar een getalenteerde logistiek medewerker”                               Xianren Wu, K. Kenthapadi, and S. C. Geyik. 2018. Towards Deep and Repre-
have high score), and finally, both Dutch to Dutch and English                                  sentation Learning for Talent Search at LinkedIn. Proceedings of the 27th ACM
                                                                                                International Conference on Information and Knowledge Management (2018).
to English sentences are properly scored, too, thus achieving our                          [12] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
desired property of multilinguality.                                                            using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical
                                                                                                Methods in Natural Language Processing and the 9th International Joint Conference
                                                                                                on Natural Language Processing (EMNLP-IJCNLP). Association for Computational
7    CONCLUSION                                                                                 Linguistics, Hong Kong, China, 3982–3992. https://doi.org/10.18653/v1/D19-1410
                                                                                           [13] Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2019. A Survey of Cross-
In this work we experimented with various ways to construct em-                                 Lingual Word Embedding Models. J. Artif. Int. Res. 65, 1 (May 2019), 569–630.
beddings of resume and vacancy texts.                                                           https://doi.org/10.1613/jair.1.11640
                                                                                           [14] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to Fine-Tune
   We propose to fine-tune the BERT model using the Siamese                                     BERT for Text Classification?. In Chinese Computational Linguistics, Maosong Sun,
SBERT framework on our large real-world dataset with high qual-                                 Xuanjing Huang, Heng Ji, Zhiyuan Liu, and Yang Liu (Eds.). Springer International
                                                                                                Publishing, Cham, 194–206.
ity labels for resume-vacancy matches derived from our consultants’                        [15] Jeff Weiner. 2012. The Future of LinkedIn and the Economic Graph. (2012).
decisions. We show our model beat our unsupervised and super-                              [16] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
vised baselines based on TF-IDF features and pre-trained BERT                                   Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
                                                                                                Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
embeddings. Furthermore, we show it can be applied for multilin-                                Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
gual (e.g., English-to-English alongside Dutch-to-Dutch) and cross-                             and Alexander M. Rush. 2020. HuggingFace’s Transformers: State-of-the-art
lingual matching (e.g., English-to-Dutch and vice versa). Finally,                              Natural Language Processing. arXiv:1910.03771 [cs.CL]
                                                                                           [17] Jing Zhao, Jingya Wang, Madhav Sigdel, Bopeng Zhang, Phuong Hoang, Mengshu
we show that using a regression objective to optimize for cosine                                Liu, and Mohammed Korayem. 2021. Embedding-based Recommender System
similarity yields more useful embeddings in our scenario, where                                 for Job to Candidate Matching on Scale. https://arxiv.org/abs/2107.00221
                                                                                           [18] Chen Zhu, Hengshu Zhu, Hui Xiong, Chao Ma, Fang Xie, Pengliang Ding, and
we aim to apply the learned embeddings as feature representation                                Pan Li. 2018. Person-Job Fit: Adapting the Right Talent for the Right Job with
in a broader job recommender system.                                                            Joint Representation Learning. https://arxiv.org/abs/1810.04040


ACKNOWLEDGMENTS
Special thanks to the rest of the SmartMatch team: Adam, Evelien,
Najeeb, Sandra, Wilco, Wojciech, and Zeki. And Sepideh for her
helpful comments.

REFERENCES
 [1] Vedant Bhatia, Prateek Rawat, Ajit Kumar, and Rajiv Ratn Shah. 2019. End-to-
     End Resume Parsing and Finding Candidates for a Job Description using BERT.
     https://arxiv.org/abs/1910.03089
 [2] Shuqing Bian, Xu Chen, Wayne Xin Zhao, Kun Zhou, Yupeng Hou, Yang Song,
     Tao Zhang, and Ji-Rong Wen. 2020. Learning to Match Jobs with Resumes
     from Sparse Interaction Data using Multi-View Co-Teaching Network. https:
     //arxiv.org/abs/2009.13299
 [3] Dhivya Chandrasekaran and Vijay Mago. 2021. Evolution of Semantic Similarity—
     A Survey. ACM Comput. Surv. 54, 2, Article 41 (Feb. 2021), 37 pages. https:
     //doi.org/10.1145/3440755
 [4] European Commission. 2021. Proposal for a Regulation of the European Parliament
     and of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artifi-
     cial Intelligence Act) and Amending Certain Union Legislative Acts : {SEC(2021) 167
     Final } - {SWD(2021) 84 Final } - {SWD(2021) 85 Final }. European Commission.
     https://books.google.nl/books?id=ofxxzgEACAAJ
 [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:
     Pre-training of Deep Bidirectional Transformers for Language Understanding.
     arXiv preprint arXiv:1810.04805 (2018).
 [6] Hebatallah A. Mohamed Hassan, Giuseppe Sansonetti, Fabio Gasparetti, A. Mi-
     carelli, and J. Beel. 2019. BERT, ELMo, USE and InferSent Sentence Encoders:
     The Panacea for Research-Paper Recommendation?. In RecSys.
 [7] Po-Sen Huang, Xiaodong He, Jianfeng Gao, li Deng, Alex Acero, and Larry
     Heck. 2013. Learning deep structured semantic models for web search using
     clickthrough data. 2333–2338. https://doi.org/10.1145/2505515.2505665
 [8] Cataldo Musto, Giovanni Semeraro, Marco de Gemmis, and Pasquale Lops. 2016.
     Learning Word Embeddings from Wikipedia for Content-Based Recommender
     Systems. In Advances in Information Retrieval, Nicola Ferro, Fabio Crestani, Marie-
     Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio,

</pre>