=Paper= {{Paper |id=Vol-3218/paper9 |storemode=property |title=Model Threshold Optimization for Segmented Job-Jobseeker Recommendation System |pdfUrl=https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_9.pdf |volume=Vol-3218 |authors=Yichao Jin,Anirudh Alampally,Dheeraj Toshniwal,Zhiming Xu,Ankush Girdhar |dblpUrl=https://dblp.org/rec/conf/hr-recsys/JinATXG22 }} ==Model Threshold Optimization for Segmented Job-Jobseeker Recommendation System== https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_9.pdf

Model Threshold Optimization for Segmented
Job-Jobseeker Recommendation System
Yichao Jin, Anirudh Alampally, Dheeraj Toshniwal, Zhiming Xu and Ankush Girdhar
Indeed.com
{jinyichao, aalampally, dtoshniwal, zxu, ankush}@indeed.com

Abstract
Recently, job-jobseeker recommendation system has played an important role in helping people get more timely and suitable
jobs in the domain of HR technology. Most existing recommender systems proposed an unified model to serve all the job and
jobseekers from different backgrounds. While very limited work, if not none, had paid attention to the possible performance
gap among different segments. In this work, we use the occupation data to define the job segment, and study the segment-level
performance comparison from an existing recommendation system within our organization. We then try to identify the
possible causes, and make multiple attempts to deal with the problem. Finally, we adopt the most feasible approach to conduct
the per-segment level model threshold optimization. In particular, we properly formulate a constrained optimization problem,
and propose an efficient algorithm to speed up the threshold optimization process. Our prototype implementation enables
the online A/B tests. The experimental results from real online products indicate significant performance improvement in
terms of both recommendation quality and coverage on a list of selected segments.

Keywords
Job-jobseeker Recommendation, Segmentation, Threshold Optimization

1. Introduction Regression models to predict each steps along with the
application funnel for every single job-jobseeker pair. In
Nowadays, online job marketplaces such as Indeed.com, particular, we have three models in a tandem. The first
CareerBuilder and LinkedIn, are serving hundreds of mil- model predicts the probability of receiving a response
lions jobseekers by connecting them to the right job op- (either positive or negative) from the jobseeker , given
portunities. The target jobseekers of such recommender the recommendation is made. The second one predicts
systems should not limit to any specific segments or the probability of getting a positive response (e.g., apply
groups. Instead, we should try to help all the jobseek- or enquiry), given receiving the jobseeker response. The
ers with a variety of profiles to get their best jobs in an third one predicts the probability of having a positive em-
efficient and scalable manner. ployer response (e.g., interview schedule or hire decision),
The job-jobseeker recommendation platform is one of given the application is made from the jobseeker. Each
the most important engines that we are using to help peo- model has its own threshold to filter out certain matches
ple get jobs within our organization. There are multiple with low scores, and the product of all the model scores
ways that we recommend either jobs or jobseekers to the will be used to rank the remaining matches.
other side throughout different surfaces. Specifically, on Currently we only have one set of models for all types
the jobseeker-facing side, we sent invite-to-apply emails of jobs and jobseekers, while we found the performance
or app notifications to the jobseekers. We also display gap is huge across different job segments in terms of their
a list of recommended jobs on the homepage. On the occupation. Although we already use a few segment-
employer-facing side, we provide instant candidate rec- specific data (e.g., job title, industry, etc.) as the model
ommendations to the employers as soon as they publish features, the data didn’t seem to be good enough to repre-
a new job post. sent all the explicit or implicit features that are associated
Underneath the recommendation platform, we have with the segment. There are certainly many ways to im-
multiple match providers, where each provider has its prove the per-segment performance, including adding
own way to retrieve and rank matches. In this paper, we more segment-specific features, and training dedicated
will mainly focus on the ranking stage for our longest- models for each segment. But choosing the best cut-off
lived probabilistic-based match provider using Logistic threshold score per segment turned out to be the most
Regression models. Specifically, we have a set of Logistic practicable and effective way to achieve the goal.
In this work, we propose an efficient approach to opti-
RecSys in HR’22: The 2nd Workshop on Recommender Systems for mize the segmented job-jobseeker recommendation per-
Human Resources, in conjunction with the 16th ACM Conference on formance by tuning the per-segment model thresholds.
Recommender Systems, September 18–23, 2022, Seattle, USA. Specifically, we formulate a constrained optimization to
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). identify the potential improvement space per segment.
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
Three attempts are discussed, and the most feasible one There were also works focusing on jointly examining
is adopted in the production. We also apply a greedy the resumes from jobseekers and the job descriptions
search algorithm to speed up the segment-specific thresh- from the job side, mostly for the high-tech job profiles.
old tuning process. Our prototype implementation and Malinowski et al. [5] presented a probabilistic-based
the corresponding AB tests on selected segments had CV and job recommender, that relied extensively on the
suggested considerable improvements in terms of bet- structured resume data from a limited number (i.e., 100)
ter recommendation quality and higher volume on both of high-skilled jobseekers. Javad et al. [6] used named
applies and positive outcomes from the job applications. entity recognition (NER) to explicitly extract the skills
In summary, the main contributions of this paper are from resume, and further used them to facilitate the rec-
as follows. We hope our work can provide a reference ommender system. Qin et al. [7, 8] proposed a neural
on similar problems in the industry. network based representation to embed the skills from
resume and job descriptions, and ranking the matches
• We report the occupation-based segment-level based on the vector similarities. Luo et al. [9] introduced
investigation using the real-world data from our adversarial learning to learn more expressive represen-
organization. tation from similar sources. However, the majority of
• We formulate a constrained optimization problem jobseekers in the labor market (e.g., truck driver, retail
to facilitate our segmentation work in the job- sellers, etc.), do not have properly written resumes, if
jobseeker recommendation system. not completely without a resume. Consequently, such
• We propose an effective way to optimize the per- methods might not work well for these jobseekers.
segment performance by tuning the thresholds While most existing job recommendation systems
on different models. [10, 11, 12] tried to have one model worked for all dif-
• We implement an automated model threshold ferent job profiles, very limited work noticed the sig-
tuning module into the pipeline, and the online nificant difference among these job and jobseeker pro-
experimental results from the real products in- files. This work, on the contrary, attempts to identify
dicate promising performance improvement on such differences and make operational optimization cor-
both recommendation quality and coverage. respondingly, with the objective to improve the overall
recommendation performance.
The rest of the paper is organized in the following ways.
In Section 2, we discuss and review a few related works.
In Section 3, we provide an overview of our existing 3. Overview of Our Match
recommendation platform. In section 4, we describe the
segment-level model threshold tuning that we use to
Recommendation Platform
optimize the recommendation performance. In Section This section presents an overview of the match recom-
5, we illustrate the evaluation results on three selected mendation platform, and justifies the segment-level op-
segments. And finally, Section 6 concludes this work. timization is needed. In particular, we first present our
probabilistic-based models that are still driving a signifi-
2. Related Work cant number of recommendations within our organiza-
tion. We then study the feature distribution from both
Many existing works had studied the overall framework the job and jobseeker side, and identify the performance
of efficient job recommendation systems. Kenthapadi gap across a variety of segments.
et al. [1] discussed the candidate selection, personal-
ized relevance model, and match redistribution, as three 3.1. Probabilistic-based Models
main sub-systems in the job recommendation system
at Linkedin. Lu et al. [2] presented a hybrid-ranking Our recommendation match provider is built on top of a
system by combining the interaction-based and content- series of probabilistic-based models. Each of them takes
based features from both job and jobseekers, and cal- care of a single step along the application funnel. In
culate a ranking score accordingly. Shalaby et al. [3] particular, as depicted in figure 1, each model takes a
built a graph-based job recommendation framework at subset of features from job, employer, jobseekers’ con-
CareerBuilder.com, using a similar hybrid approach by tents (e.g., resumes, questionnaires, etc.) and behaviors
combining the behavior-based and content-based data (e.g., apply history, feedback from previous applications,
together into weighted scores for the ranking purpose. responses to previous recommendations, and inferred in-
Diaby et al. [4] proposed a taxonomy-based job recom- terests, etc.) as the input features. And the model outputs
mender system that segmented both job and jobseekers the probability for its own step.
into a taxonomy system using their occupation data. More specifically, the first model focuses on if the job-
seeker responds to the recommendation, given that the
Figure 1: There are three models in tandem to construct
the probabilistic-based filtering and ranking module for job-
jobseeker recommendation.
Figure 2: Performance difference between recommendation
and other organic channels across different segments.

recommendation had been sent out. It can be any re-
sponse, such as clicking the ”apply job” or ”not interested”
button, unsubscribing, replying, or giving out a rating. 3.2. Performance Gap among Segments
The second model focuses on if the jobseeker actually From our historical data, we found there are significant
applied for the job, given he/she had made any kind of performance gaps across different segments in terms of
responses to our recommendation. The third model deals their occupation, as we used to have one unified set of
with the probability further on the employer side, fo- models with the same threshold values, to serve all the
cusing on if the employer sends any positive outcome jobs and jobseekers from different backgrounds. This
to the submitted application, such as follow-up conver- observation is based on a segment-level comparison on
sations to further understand the applicant, interview the recommendation performance with other organic
arrangements, or even making an offer. channels, where jobseekers search and find jobs by them-
selves. Ideally, we expected the recommender system
𝑝(𝑝𝑜𝑠𝑂𝑢𝑡|𝑠𝑒𝑛𝑡) = 𝑝(𝑗𝑠𝐶𝑙𝑖𝑐𝑘|𝑠𝑒𝑛𝑡) would consistently perform better, because it should pro-
∗𝑝(𝑗𝑠𝐴𝑝𝑝𝑙𝑦|𝑗𝑠𝐶𝑙𝑖𝑐𝑘) (1) vide better matches with higher accuracy. However, such
∗𝑝(𝑝𝑜𝑠𝑂𝑢𝑡|𝑗𝑠𝐴𝑝𝑝𝑙𝑦) an assumption is not always true.
Figure 2 studies the performance gap in terms of Apply
We conduct both the scoring as shown in Eq.1, and start Rate (AR) and Positive outcome over Apply (PoA)
filtering as shown in Eq.2, based on this model chain. of the biggest 16 occupations from our real-world data.
In particular, each logistic regression model follows the The AR metric indicates the quality of the jobseeker en-
sigmod function to generate the probability output 𝑝(𝑠|𝑔), gagement, while the PoA metric indicates the quality of
where 𝑔 is the ground truth, 𝑠 refers to the stage that the the employer engagement. It is clear that a bunch of
model is dealing with. 𝑓 (𝑥𝑛 ) = ∑ 𝑤𝑛 𝑥𝑛 , and 𝑥𝑛 refers to segments (mostly blue-collar jobs) are with low AR but
the input feature vector, 𝑤𝑛 refers to the weights to be okay PoA, indicating the model gets higher precision but
trained for each feature. At the same time, there is a lower recall there. While a few other segments (mostly
customized threshold 𝜃 for each model, to filter out the white-collar jobs) are suffering from low PoA but okay
matches having low probability at that stage. Eventually, AR, indicating the model gets higher recall but lower pre-
only the matches that pass all the three cutoff threshold at cision. After all, we believe all these segments could have
each stage will be assigned a non-zero score to represent considerable improvement spaces, but through different
the probability of having a positive outcome, given a sent initial locations and different directions towards the top
recommendation 𝑃(𝑝𝑜𝑠𝑂𝑢𝑡|𝑠𝑒𝑛𝑡). Finally, this score will right corner as shown in the figure 2.
be passed into the ranking and aggregation module as Note that, we cannot directly compare the absolute
the next step. It is easy to see that the multiplication metrics across different segments, because the perfor-
leads to higher precision but lower recall, because it will mance can be affected by the segment nature, instead of
be filtered out as long as the job-jobseeker pair gets a low the recommendation quality.
score at any stage in the chain. The examination motivates us to further check if these
segments are big enough to try the segment-level opti-
1/(1 + 𝑒 −𝑓 (𝑥𝑛 ) ), if 1/(1 + 𝑒 −𝑓 (𝑥𝑛 ) ) > 𝜃 mization. And if so, we also would like to understand
𝑝(𝑠|𝑔) = { (2) the reasons that lead to such performance gaps.
0, otherwise
4. Segment-level Optimization
There are a number of possible ways to do the segment-
level optimization for our probabilistic-based recommen-
dation system. In particular, we report three different
attempts that we have tried in this section. For each at-
tempt, we evaluate not only its effectiveness, but also its
scalability in the long run.

4.1. First Attempt: dedicated models per
segment
The most intuitive solution that first came to us was to
build dedicated sets of models for each segment. We
selected a list of low-performed occupation-based seg-
Figure 3: There are multiple top-level occupations in the job ments (i.e., Security Guard, Retail Store Manager, and
markets. Each occupation accounts for a certain percentage, Quick Service Server) according to Figure 2, and trained
but no one clearly dominates the whole population. a dedicated set of models for each segment. As a result,
every segment got three different models as shown in Fig-
ure 1, and they were trained by only using the historical
3.3. Segment-level Investigation dataset from that specific segment.
Surprisingly, the initial experimental results did not
Figure 3 shows the segment distribution in terms of the align well with our expectation, showing mixed signals
number of active jobs based on their top-level occupation in terms of the recommendation quality and volume. In
from our organization in 2022H1. We clearly serve a full particular, for all the three experimented segments, we
spectrum of jobs and jobseekers from a variety of occupa- observed significant decreases in terms of the Applystart
tions, without any single occupation clearly dominating Rate (AR) or Positive outcome over Apply (PoA) ranging
the whole population. Every occupation-based segment from -9.8% to -15.6%, while an improvement in terms of
occupies a certain portion in the job market. As a result, the number of apply starts and positive outcomes ranging
the segment-level optimization could have a reasonable from 6.0% to 13.8%. However, our expectation on the ded-
expectation to benefit the overall performance. icated models was to have considerable improvements
We next want to examine if the performance gap orig- on all the key metrics at the same time.
inates from the different feature distribution among dif- After a close examination on the approach and the
ferent segments. Specifically, we look at a mixture of corresponding models, we found three major issues that
blue-collar and white-collar jobs, on both the job and lead to the disappointing results. First, we did not setup
jobseekers side. As expected, the blue-collar jobseekers a formulation to properly represent the overall objec-
(e.g., delivery drivers, retail sellers, etc.) tend to have tive. Consequently, we even did not have a clearly de-
a much shorter resume, which in turn makes the skill fined expectation and target for the optimization at the
and experiences extraction, or even resume embedding very beginning. Second, we over-emphasized the model
less representative than the white-collar jobseekers (e.g., training part, whereas we missed the fact that the cutoff
software development, technical managers, etc.). Similar thresholds play even more important roles to trade-off
pattern can be observed on the job side too, where the precision and recall. Therefore, we believe the dedicated
white-collar jobs tend to list more job requirements in models still need careful threshold tuning, to maximize
terms of hard skills and experiences, while blue-collar its benefit. Lastly, we noticed that we were ongoing many
jobs tend to focus more on licences and soft skills. other initiatives (such as an alternative way of embedding
The observations lead us to reconsider if our exist- features, or even adding new features, etc.) that kept im-
ing approach using the same model set with the same proving the baseline models from other members within
threshold setting is good enough to handle all these cases. our organization, while our treatment models were kept
Although we already use a few segment-specific data (e.g., unchanged during the experiment. This made the ex-
job title, industry) as the model features, we suspect they perimental comparison inconsistent over time. More
might not be representative enough to properly differen- importantly, we could not fix this easily, because the
tiate the specific requirements. As a result, we work on a large-scale model auto-updates together with the param-
few different approaches on the segment-level optimiza- eter fine-tuning could be too expensive in terms of both
tion, and discuss the feasibility based on our real-world the initial engineering efforts and the following infras-
experiences in the next section. tructural maintenance.
4.2. Second Attempt: online sampling methods such as Thompson sampling could
reinforcement learning with speed up the convergence rate to some extent.
multi-armed bandit However, there were still a list of issues that prevented
us from doing efficient multi-armed bandit tests for our
By learning the lessons from our first attempt, we would segmented threshold optimization. First, the underlying
like to formulate an optimization problem to appropri- baseline models were being iterated in parallel, resulting
ately capture our task and the objective. In particular, in the inconsistent and unreliable comparison among
we want to simultaneously improve both recommenda- different treatment groups with fixed threshold settings.
tion quality and volume, on all the key metrics including Second, the online reinforcement learning could take a
applystart volume, positive outcome volume, applystart long time to get converged, especially when a few target
rate, and positive outcome over apply. While we can segments are with small sample size. Last but not least,
focus slightly more on AR for those low-AR segments, we also suffered from the delayed data issue from the
or positive outcomes for those low-PoA segments. And up-streaming data sources, considering the signals from
the control variables that we can operate here are the employer-side (e.g., interview schedule and results, etc.)
thresholds for each model per segment. could take up to a few weeks to come back after the
application had been made. Consequently, this attempt
is unfortunately also impractical to our problem.
max ∑ ⃗
𝜆𝑖 Δ𝑖 (𝜃) (3)
𝜃⃗ 𝑖∈{𝑎,𝑎𝑟,𝑝,𝑝𝑜𝑎}
4.3. Third Attempt: offline threshold
s.t. 0 < 𝜆𝑖 < 1, 𝑖 ∈ {𝑎, 𝑎𝑟, 𝑝, 𝑝𝑜𝑎} (4)
tuning per segment
∑ 𝜆𝑖 = 1 (5)
𝑖∈{𝑎,𝑎𝑟,𝑝,𝑝𝑜𝑎} By learning from the previous two failed attempts, we
⃗ − 𝑏𝑎𝑠𝑒𝑖 confirmed that fine-tuning the thresholds for each seg-
⃗ = 𝑛𝑒𝑤𝑖 (𝜃)
Δ𝑖 (𝜃) > 0, 𝑖 ∈ {𝑎, 𝑎𝑟, 𝑝, 𝑝𝑜𝑎} (6) ment could be the feasible solution to optimize the per-
𝑏𝑎𝑠𝑒𝑖 formance. But it was not practical to find out the optimal
⃗ < 𝑠𝑙𝑜𝑗 − 𝜖𝑗 , 𝑗 ∈ {𝑢𝑛𝑠𝑢𝑏, 𝑛𝑒𝑔}
𝑠𝑙𝑜𝑗 (𝜃) (7) solution throughout the reinforcement learning approach
over the online iterations. As a result, we came up with
As a result, we formulate a constrained optimization our third attempt by using a proper offline evaluation
problem as shown in Equation 3 to 7. Specifically, the ob- algorithm based on the historical job and jobseeker in-
jective function aims to maximize the weighted combina- teraction data from all the channels.
tion of all the key metrics, including apply start volume Algorithm 1 describes the proposed greedy searching
𝑎, apply start rate 𝑎𝑟, positive outcome volume 𝑝, and process to find out the optimal threshold settings per seg-
positive outcome over apply 𝑝𝑜𝑎. For each metric, 𝜆𝑖 rep- ment in an efficient manner. Specifically, the algorithm
resents the weight for us to shift focus between low-AR takes a few different inputs, including the models (i.e.,
and low-PoA segments, and Δ𝑖 indicates the correspond- Jobseeker Response JR model, Jobseeker Apply JA model,
ing performance improvement. In the meanwhile, there and Positive Outcome PO model) as discussed in Section
are a few Service Level Agreements (SLOs) that we must 1, the historical data for model performance evaluation
meet, including the unsubscription rate must lower than per segment, and a default threshold setting. We select
0.05%, and negative feedback ratio from jobseekers must an upper 𝑢 and a lower bound 𝑙 for each model respec-
be lower than 25% among all the feedback. These SLOs tively, by plus-minus a range over the default value. We
are hard requirements, so that we even want to add a then follow a greedy searching process to find out the
marginal buffer 𝜖 to the constraint. Both Δ and 𝑠𝑙𝑜 metrics optimal settings, that can achieve the best performance
can be affected by the threshold setting 𝜃.⃗ With these 𝑃⃗ = (𝑂, 𝑎, 𝑎𝑟, 𝑝, 𝑝𝑜𝑠) in terms of the objective value 𝑂 and
definitions, our task is to find out the optimal model the four key metrics as defined in Equation 3 to 7. The
thresholds for each segment that could maximize the expected outputs are the optimal threshold settings 𝜃⃗𝑜
objective function, while fulfilling all the constraints. per segment, which can achieve no worse performance
With the clearly defined objective (or reward) function than the default ones.
and constraints, one possible way is to adopt multi-armed The greedy part originates from the fact that JA model
bandits as a reinforcement learning approach to find out threshold correlates well with the applystart rate, and
the optimal solution in the online environment. Specif- the same pattern applies to PO model threshold and the
ically, we can setup multiple test groups in the produc- positive outcome over apply. On the other hand, when we
tion, each group has different threshold settings. Then increase any model threshold, the applystart and positive
we keep monitoring the performance on the objective outcome volume can only go down or at most flat. As a
value and constraints, and adjust the traffic allocation result, if we want to improve both quality and volume as
towards the better performing variances gradually. Some
Algorithm 1 Greedy Threshold Searching Algorithm
Require: Models: JR, JA, PO
Require: Historical job-jobseeker interactions
Require: Default threshold set 𝜃⃗𝑑
function greedySearch(𝜃⃗, 𝑃⃗, 𝑚1, 𝑚2)
for 𝜃𝑚1 in 𝑟𝑎𝑛𝑔𝑒(𝑙𝑚1 , 𝜃⃗𝑑 (𝑚1), 𝑠𝑚1 ) do
𝜃⃗𝑡 ← (𝜃𝑗𝑟 , 𝜃𝑚1 , 𝜃⃗𝑑 (𝑚2))
if 𝑔𝑒𝑡𝑉 𝑜𝑙(𝜃⃗𝑡 )(𝑚1) < 𝑃(𝑚1)
⃗ then
break
end if
for 𝜃𝑚2 in 𝑟𝑎𝑛𝑔𝑒(𝑢𝑚2 , 𝜃⃗𝑑 (𝑚2), −𝑠𝑚2 ) do
𝜃⃗𝑡 ← (𝜃𝑗𝑟 , 𝜃𝑚1 , 𝜃𝑚2 )
if 𝑔𝑒𝑡𝑅𝑎𝑡𝑖𝑜(𝜃⃗𝑡 )(𝑚2) < 𝑃(𝑚2)⃗ then
break
end if
if 𝑚𝑒𝑒𝑡𝑆𝑙𝑜(𝜃⃗𝑡 ) and 𝑔𝑒𝑡𝑂𝑏𝑗(𝜃⃗𝑡 ) > 𝑃(𝑂)
⃗ then
⃗ ⃗ Figure 4: A 3d illustration of the greedy threshold searching
𝜃 ← 𝜃𝑡
regions on the Security Guard segment, with the JA and PO
𝑃⃗ ← 𝑔𝑒𝑡𝑃𝑒𝑟𝑓 (𝜃) ⃗
model threshold as x and y axis, and the objective value as
end if the z-axis. It is clear that we can only search the region with
end for non-zero objective value, and skip all the non-zero regions.
end for
return 𝜃,⃗ 𝑃⃗
end function
or ratio change, or SLO violations. In this way, the pro-
𝜃⃗𝑜 ← 𝜃⃗𝑑
posed algorithm can be 10x to 15x faster than the full grid
𝑎 ← 𝑔𝑒𝑡𝑉 𝑜𝑙(𝜃⃗𝑑 )(𝐽 𝐴), 𝑎𝑟 ← 𝑔𝑒𝑡𝑅𝑎𝑡𝑖𝑜(𝜃⃗𝑑 )(𝐽 𝐴) search, by skipping these regions. The speed-up factor
𝑝 ← 𝑔𝑒𝑡𝑉 𝑜𝑙(𝜃⃗𝑑 )(𝑃𝑂), 𝑝𝑜𝑎 ← 𝑔𝑒𝑡𝑅𝑎𝑡𝑖𝑜(𝜃⃗𝑑 )(𝑃𝑂) could be even bigger when we have limited knowledge
𝑂 ← 𝑔𝑒𝑡𝑂𝑏𝑗(𝜃⃗𝑑 ) about a new segment, thus requiring a wider searching
𝑃⃗ ← (𝑂, 𝑎, 𝑎𝑟, 𝑝, 𝑝𝑜𝑎) boundary with smaller steps. The greedy algorithm en-
for 𝜃𝑗𝑟 in 𝑟𝑎𝑛𝑔𝑒(𝑙𝑗𝑟 , 𝑢𝑗𝑟 , 𝑠𝑗𝑟 ) do ables us to have the optimal threshold searching process
𝜃⃗𝑜 , 𝑃⃗ ←greedySearch(𝜃⃗𝑜 , 𝑃, ⃗ 𝐽 𝐴, 𝑃𝑂) running more frequently and efficiently for every model
update, but also to scale up the optimization process to
𝜃⃗𝑜 , 𝑃⃗ ←greedySearch(𝜃⃗𝑜 , 𝑃, ⃗ 𝑃𝑂, 𝐽 𝐴)
all the segments.
end for
return optimal threshold set 𝜃⃗𝑜 per segment
5. Performance Evaluation
the key metrics at the same time, we need to search the JA By following the third attempt as discussed in the previ-
and PO model threshold in different directions starting ous Section, we further integrate the algorithm into our
from the default value. In addition, once we reach the model training pipeline as a prototype implementation.
boundary, by either observing a volume that is smaller The performance is evaluated throughout proper online
than default when increasing the threshold, or a ratio that A/B testing from real recommendation products. The
is smaller than the default when decreasing threshold, we experimental results demonstrate promising signals for
do not need to further down the same direction. However, all the three selected segments. Moreover, the approach
the JR model does not display a clear relationship with is generally applicable for all the segments.
our targeted key metrics, therefore, we still do a full grid
search on the JR model in the outer loop. 5.1. Prototype Implementation
Figure 4 illustrates an example of searching the optimal
threshold settings for Security Guard over three dimen- Under our current model pipeline, the unified model set
sions (i.e., JA and PO model threshold serve x-axis and is retrained upon either the regular daily update or a
y-axis, while the objective value is set along z-axis). It is production release on various other model improvement
obvious that we only need to check the areas with non- initiatives. Previously, the retained models would be
zero objective values. Whereas most of the areas with put into the production model storage, thus they can be
zero value can be ignored due to the negative volume directly invoked by the online recommendation system
Segment Positive Outcomes POs/Applies Apply Starts ASs/Sends
Security Guard +17.29% +20.61% +2.46% +14.87% ↑
Retail Store Manager +86.49% ↑ +8.16% +95.51% ↑ +38.05% ↑
Quick Service Server +1.02% +4.30% -11.54% +30.72% ↑
Table 1
Final performance evaluation via online A/B experiment on selected segments via threshold optimization, where ↑ denotes a
statistically significant increase (𝛼 = 0.05 with a two-sided t-test). Almost all segments showed promising improvements by
comparing with the baseline group, which uses a default threshold for all segments.

cating at least apply start volume or ratio should be able
to reach the statistical significance given 15% estimated
effect size within this period.
Table 1 elaborates the experimental results from the
online A/B testing, with ↑denotes the statistical change
for that metric. In particular, we find all the three seg-
ments show significant improvement in terms of positive
outcome volume, positive outcome over applies, and ap-
plystart ratio. Such observation aligns with our offline
evaluation. However, it does note us that while Secu-
Figure 5: The functional workflow indicates the way we inte- rity Guard and Retail Store Manager segments also have
grate our segmented auto-threshold tuning module into the
considerable improvements on apply start volume, the
model training and adaptation pipeline.
Quick Service Server segment display a negative signal at
around -11%. Although we believe our offline evaluation
is overall reliable enough for most segments, there are
by using a fixed set of default threshold values. still the difference between online and offline data due
Figure 5 describes the conceptual design of our pro- to the usage of the historical data from other sources.
totype implementation, on top of the existing pipeline. While low-performed segments are more likely to have
Specifically, we introduce a new stand-alone segment bigger improvement spaces as we observed in the experi-
auto-threshold tuning module to host our algorithm, and ment, the proposed approach is generally applicable to all
plug it into the offline stage. The module is able to gen- the segments. In particular, first, the optimization frame-
erate the optimal cutoff threshold settings per segment work can figure out if a specific segment can be improved
for the newly trained models, before the online recom- by the threshold tuning. In the worst case, the existing
mender system can actually use them. As a result, we threshold settings are already in the optimal range, and
make sure the online matches are always proceeding our proposed algorithm can quickly confirm this. But if
with the optimal thresholds for each segment. In the there is an opportunity, we can also accurately identify
experiment, we set the weight 𝜆 equally at 0.25 for all it, and further find out the optimal settings accordingly.
the four key metric changes. Different 𝜆 settings could
have various impacts on the objective values, and thus
the optimal threshold settings, but due to the page lim- 6. Conclusion
itation, we are not going to dig deeper into this point.
Note that, we take the offline historical interactions forIn this work, we presented an effective solution to im-
all the job-jobseeker pairs from all other channels as theprove the performance of our job-jobseeker recommen-
inputs for our algorithm, because this would allow us to dation system. Specifically, we started by identifying
not only check the potential impact if we increase the the performance gap among different segments, followed
threshold, but also enable us to see the estimated impact by segment-level investigations. We then reported three
if we decrease the threshold for certain models. different attempts, and came up with the most feasible ap-
proach by tuning the model thresholds per segment. The
detailed solution was presented, including a proper prob-
5.2. Online AB Test Results lem formulation as a constrained optimization problem,
We continue to focus on the same three low-performed an efficient algorithm to speed up the threshold optimiza-
segments (i.e., Security Guard, Retail Store Manager, and tion process, and the prototype implementation. Finally,
Quick Service Server), but with a more rigorous online online A/B tests from real products proved the perfor-
A/B testing plan. In particular, we run the online A/B mance improvement in terms of both recommendation
experiments for two weeks with the power analysis indi- quality and quantity.
Our future work will mainly follow three venues. First, ing people and jobs: A bilateral recommenda-
we are going to scale up the auto threshold optimization tion approach, in: Proceedings of the 39th An-
to more segments, and also figure out the way to mini- nual Hawaii International Conference on System
mize the difference between offline evaluation and the Sciences (HICSS’06), volume 6, IEEE, 2006, pp.
actual online performance. Second, we will evaluate 137–145.
if similar segmentation work can benefit other match [6] F. Javed, P. Hoang, T. Mahoney, M. McNair, Large-
providers that are based on more sophisticated models scale occupational skills normalization for online
(e.g., neural networks, or deep collaborative filtering). recruitment, in: Twenty-ninth IAAI conference,
Third, we will extend our segmentation optimization 2017, pp. 4627–4634.
work for the same match provider into our international [7] C. Qin, H. Zhu, T. Xu, C. Zhu, L. Jiang, E. Chen,
markets, where the user behaviors and job requirements H. Xiong, Enhancing person-job fit for talent re-
can become different even for the same occupation across cruitment: An ability-aware neural network ap-
different countries and markets. proach, in: The 41st international ACM SIGIR con-
ference on research & development in information
retrieval, 2018, pp. 25–34.
References [8] C. Qin, H. Zhu, T. Xu, C. Zhu, C. Ma, E. Chen,
H. Xiong, An enhanced neural network approach
[1] K. Kenthapadi, B. Le, G. Venkataraman, Personal- to person-job fit in talent recruitment, ACM Trans-
ized job recommendation system at linkedin: Practi- actions on Information Systems 38 (2020) 1–33.
cal challenges and lessons learned, in: Proceedings [9] Y. Luo, H. Zhang, Y. Wen, X. Zhang, Resumegan: an
of the eleventh ACM conference on recommender optimized deep representation learning framework
systems, 2017, pp. 346–347. for talent-job fit via adversarial learning, in: Pro-
[2] Y. Lu, S. El Helou, D. Gillet, A recommender system ceedings of the 28th ACM international conference
for job seeking and recruiting website, in: Proceed- on information and knowledge management, 2019,
ings of the 22nd International Conference on World pp. 1101–1110.
Wide Web, 2013, pp. 963–966. [10] T. A.-O. Shaha, Y. Mourad, A survey of job recom-
[3] W. Shalaby, B. AlAila, M. Korayem, L. Pournajaf, mender systems, International Journal of Physical
K. AlJadda, S. Quinn, W. Zadrozny, Help me find a Sciences 7 (2012) 5127–5142.
job: A graph-based approach for job recommenda- [11] F. Abel, A. Benczúr, D. Kohlsdorf, M. Larson,
tion at scale, in: 2017 IEEE international conference R. Pálovics, Recsys challenge 2016: Job recommen-
on big data (big data), IEEE, 2017, pp. 1544–1553. dations, in: Proceedings of the 10th ACM confer-
[4] M. Diaby, E. Viennet, Taxonomy-based job recom- ence on recommender systems, 2016, pp. 425–426.
mender systems on facebook and linkedin profiles, [12] J. Dhameliya, N. Desai, Job recommender systems:
in: 2014 IEEE Eighth International Conference on A survey, in: 2019 innovations in power and ad-
Research Challenges in Information Science (RCIS), vanced computing technologies (i-PACT), volume 1,
IEEE, 2014, pp. 1–6. IEEE, 2019, pp. 1–5.
[5] J. Malinowski, T. Keim, O. Wendt, T. Weitzel, Match-