<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Pseudo-online Measurement of Retrieval Recall for Job Recom mendations - A case study at Indeed</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Liyasi Wu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Wei Pang</string-name>
          <email>ywpang@indeed.com</email>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Indeed.com</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>A typical large-scale recommender system in production involves several key stages: retrieval, filtering, scoring, and ordering. As the process unfolds, the quantity of recommendations decreases, ideally enhancing their quality. While many metrics such as NDCG, recall@k, and precision@k have been employed in ofline evaluations, there have been observations that improvements in such ofline metrics do not lead to gains in common online metrics, such as click-through and conversion rates, especially at the early stages of the recommender system. In this paper, we introduce a case study at Indeed where we designed a pseudo-online metric adapted from the traditional recall@k to measure the efectiveness of the retrieval stage within our job recommender system.</p>
      </abstract>
      <kwd-group>
        <kwd>Recommendation system</kwd>
        <kwd>Retrieval eficiency</kwd>
        <kwd>Recall</kwd>
        <kwd>Funnel analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Indeed sends millions of job recommendations daily to job
seekers through various channels (e.g. onsite job
recommendation feed, job recommendation emails). These job
matches are generated by leveraging job seekers’ profile,
behavioral data on Indeed, and job data to predict the most
relevant jobs for each individual. Following an
industrywide pattern, our recommendation system utilizes a
multistage process that includes retrieval, filtering, scoring, and
ordering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The retrieval stage consists of multiple
services, termed as match providers, each employing diferent
strategies to eficiently retrieve matches. Matches retrieved
are processed and filtered through layers of business logic
before the final ranking stage where they are scored and
ranked before the top N matches are selected to be sent to
the job seeker.
      </p>
      <p>To continuously measure the efectiveness of our job
RecSys in HR’24: The 4th Workshop on Recommender Systems for Human
Resources, in conjunction with the 18th ACM Conference on Recommender</p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
recommendation system, business metrics such as
clickthrough rate (apply button clicks) and conversion rate (job
applications submitted) have been used for performance
tracking and A/B experiments in online settings. While
simple and efective, they do not consider intermediate steps
in the recommendation process, and are often heavily
influenced by the final ranking model. Improvements made at
the retrieval stage could be undermined due to biasness of
the final ranking models and as observed in other studies
[
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], we have also encountered scenarios where ofline
evaluation results of individual retrieval strategies do not
translate to improvement in online performance. It is
imperative to find better ways to evaluate and measure the
efectiveness of our retrieval strategies.
quality in ofline evaluations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] where it considers the
proportion of relevant matches produced by a recommendation
algorithm. It is suitable for measuring early stages in the
recommendation system where the main focus is to ensure
that the most relevant matches are selected from a large pool
adapted and employed as a business metric for measuring
the eficiency of our retrieval strategies in an online setting
and discuss how it has provided us with better
explainability and insights into our recommendation system. As our
approach involves performing some post aggregation and
processing of collected online signals, we describe it as a
pseudo-online measurement of recall.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Definition of Retrieval Recall and its Variants</title>
      <p>The concept of recall we explore in this paper is an
adaptation of the traditional recall metric commonly used in the
ifeld of recommender systems.</p>
      <p>Recall =</p>
      <sec id="sec-2-1">
        <title>No. of recommended relevant matches</title>
      </sec>
      <sec id="sec-2-2">
        <title>Total number of relevant matches</title>
        <p>(1)</p>
        <p>In our job recommendation system, a relevant match
refers to a job match that receives a positive feedback from
the job seeker such as apply button click across Indeed. A
recommended match refers to a job match that was retrieved
by a match provider or one that passes a certain step of the
recommendation pipeline. We note that relevant matches
on Indeed do not solely come from our recommendation
system, but also include other channels, such as external
links, searches, social media, etc. Indeed is ranked as the
#1 job site worldwide1 and attracts over 350 million unique
visitors globally each month2. Given this scale, Indeed’s
application data serves as a reasonably representative proxy
for relevant matches in general.</p>
      </sec>
      <sec id="sec-2-3">
        <title>When considering the total number of relevant matches,</title>
        <p>appropriate time bounds for determining the pool of
relevant matches have to be set. There are two possible
definitions - forward, or backward recall.</p>
        <p>For simplicity, we introduce the definition of forward and
backward recall in context of match providers, but these
definitions can be easily generalized to other steps of the
recommendation pipeline, such as filtering, scoring and
ordering. We will see it as an example in section 4.3.</p>
        <sec id="sec-2-3-1">
          <title>2.1. Forward Recall</title>
          <p>In this definition, we aim to measure the immediate impact
of the recommendations generated by the match provider.
Starting with the job matches recommended by a match
provider on a particular day, we check them against all
relevant job matches in the subsequent few days:</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>No. of relevant matches recommended on a single day</title>
      </sec>
      <sec id="sec-2-5">
        <title>Total number of relevant matches in the next x days</title>
        <p>(2)</p>
        <p>This approach allows us to assess how efectively each
match provider can capture potential relevant matches
shortly after their generation.</p>
        <sec id="sec-2-5-1">
          <title>2.2. Backward Recall</title>
          <p>Conversely, backward recall assesses a match provider’s
historical ability to predict and generate relevant job matches.
Starting from the set of relevant matches recorded on a
particular day, we check for the matches that have been
recommended by each match provider in the prior few days.</p>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>No. of relevant matches recommended in the past x days</title>
      </sec>
      <sec id="sec-2-7">
        <title>Total number of relevant matches on a single day</title>
        <p>(3)</p>
        <p>This measurement provides insights into the lasting
relevance of the recommendations made by the match provider.</p>
        <sec id="sec-2-7-1">
          <title>2.3. Comparison</title>
          <p>The main diference between the above two definitions lies
in the how samples for the numerator and denominator are
obtained for recall calculation. We note that each approach
will yield a diferent value of recall and the results from
different definitions should not be compared with one another.
However, in our practice, when we rank match providers
by recall, those with high recall in one method, whether
forward or backward, also rank high in the other method,
leading to the same order.</p>
          <p>Additionally, the parameter x is used to limit the amount
of data for consideration and its main purpose is to account
for potential delays between the time a job match is
generated to the time a job seeker might respond to the
recommendation, which could take a few days especially for our
1Comscore, Total Visits, June 2023
2Indeed Internal Data, average monthly Unique Visitors April – July
2023
email recommendation channel. We chose 7 days (a week)
in our implementation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Implementation</title>
      <sec id="sec-3-1">
        <title>3.1. Building the Dataset for Retrieval</title>
      </sec>
      <sec id="sec-3-2">
        <title>Recall</title>
        <p>To measure the retrieval recall in our recommendation
system, we made use of the backward recall definition above.
On a given day, relevant matches logged on the Indeed
site (i.e. matches that received an apply button click) were
picked as the base event and a left join was done with the
retrieved matches logged in the previous 7 days.</p>
        <p>The resulting dataset was made available as a table on our
internal data analytics platform, and could be easily queried
for analysis on retrieval strategies. The key columns in our
table for analyzing retrieval recall in our job seeker to job
recommendation system are as follows:
• Job and job seeker ID
• Metadata: Additional information relating to the job
or job seeker which could be useful in filtering the
data for segmented analysis (e.g. locale information)
• Apply button click flag: Boolean flag that
indicates that the match received an apply button click
by the job seeker. With the existing definition of a
relevant match, this will be true for all records.
However, this could change if the definition of a relevant
match expands to incorporate other user signals in
the future.
• Match providers covering the match: An
array of match providers that were able to generate
the particular relevant match in the previous 7 days.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.2. Delayed Signals</title>
        <p>Indeed collects several other delayed positive signals along
the job application funnel; after the initial apply button click
(i.e. job application submitted, positive response from
employers, etc). We incorporated these signals into our dataset
as well by joining the resulting dataset with the
corresponding event logs. This enabled tracking and measuring recall
of additional user signals associated with job
recommendations. We will introduce some delayed signals in section
4.1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Real-world Applications</title>
      <sec id="sec-4-1">
        <title>4.1. Example 1: Better Indication of</title>
      </sec>
      <sec id="sec-4-2">
        <title>Retrieval Stage Performance in an</title>
      </sec>
      <sec id="sec-4-3">
        <title>Online Setting</title>
        <p>Common business metrics (i.e. click-through rate,
conversion rates) used in online experiments for recommender
systems only capture the impact of matches that were
actually delivered to the user. In many cases, we may not
be able to see the actual improvement of a positive change
made to a match provider, possibly due to business logic,
ifltering and ranking stages that occur subsequently. In one
of our online experiments, we tested a product change of an
embedding based match provider supporting Indeed’s job
recommendation email channel. Note that there are several
Metric name</p>
        <p>Number of applications
Number of positive connections</p>
        <p>Number of positive outcome</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.2. Example 2: Tracking Match Providers’</title>
      </sec>
      <sec id="sec-4-5">
        <title>Recall Change Over Time</title>
        <p>Apart from analyzing the performance of the entire retrieval
stage, we also made use of retrieval recall to track
improvements for individual match providers.</p>
        <p>For example, we had implemented multiple technical
enhancements in one of our match providers, denoted by M.
Each roll out was backed by online A/B experiments, and
we anticipated seeing the actual improvements once all
enhancements were deployed to production. We observed
that our measurement of retrieval recall was able to better
capture the impact of these improvements.</p>
        <p>To illustrate this, we compare two diferent methods
(figure 2):
1. Application Share Change Over Time: Application
share is defined as the number of applications in job
recommendation emails and attributed to M divided
by the total number of applications in job
recommendation emails. Despite the multiple enhancements,
the share of applications of M did not increase. This
can be attributed to other changes being made across
the recommender system, such as the addition of
new match providers whose matches might replace
those retrieved by M; given the limits on the number
of recommendations we can deliver to each user.
2. Recall Change Over Time: Recall is defined as the
number of applications covered by M divided by the
number of total applications in Indeed. We observe
a clear increasing trend in M’s recall. This indicates
that the match provider is efectively identifying a
higher number of relevant matches over time, even if
this improvement is not reflected in the application
share.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.3. Example 3: Job-Recommendation Email</title>
      </sec>
      <sec id="sec-4-7">
        <title>Funnel’s Efectiveness at Each Stage</title>
        <p>As our recall measurements covers the first stage of the
recommendation pipeline, we can further analyze the drop of
of relevant retrieved matches through the subsequent stages.
Ideally, while recall decreases through the recommendation
funnel, precision should improve. By calculating the
recall at each stage and comparing it with precision, we can
identify improvement opportunities of the recommendation
process.</p>
        <p>In this example, we focus on job matches with positive
outcomes in Indeed. Positive outcomes (PO), defined as per
section 4.1, is a composition of multiple types of positive
interactions between jobs and job seekers, such as interviews.
The total number of matches with a positive outcome
returned by our match providers serves as the numerator for
the recall metric. Under this definition, the recall at each
iflter stage is 31%, 10%, 9%, and 3%, respectively.</p>
        <p>By combining this analysis with quality improvement at
each stage, we gain a clear understanding of the funnel’s
efectiveness. We can define the quality of each stage as
precision, which is the number of relevant matches retrieved
at this stage divided by the total number of matches at that
stage. While we expected precision should improve as recall
decreases through the funnel, we observed that the precision
drops from 0.04% to 0.02% at step 1 (figure 4), while recall
drops to 31% (figure 3). This suggests the need for further
investigation and quality improvement at this stage.</p>
        <p>By applying this method, we can identify bottlenecks
and opportunities for enhancement of the recommendation
funnel, ensuring continuous improvement and higher
satisfaction for job seekers.
We are also able to look into ’missed recall’ of our
recommendation system. This represents relevant matches that
were not delivered to job seekers through our
recommendation system, but were found when the job seekers
manually searched for jobs on Indeed. In a study conducted,
we sampled the dataset of retrieval recall described in
section 3.1 and analyzed 735 matches which were relevant but
not successfully retrieved by the match providers in our
email recommendation system. By analyzing the patterns
of missed relevant matches (figure 5), we found
opportunities to optimize our retrieval strategy across features such
as job age and job title.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Future Directions</title>
      <p>In this section we list a few directions for future exploration.
• While forward and backward recalls are powerful
tools for us to better understand the performance
of retrieval stage and other intermediate steps, in
practice we monitor not only these recall variants
but also traditional online business metrics, such
as click through rate, to make sure online business
metrics are not harmed. Studying the correlation
between recall and business metrics could be helpful to
reduce the total number of metrics in an experiment
and simplify the decision making process.
• As mentioned earlier, one important assumption
behind this work is that relevant matches do not
solely come from our recommendation system, but
also include other channels, such as external links,
searches, social media, etc. This assumption will be
violated if relevant matches delivered from
recommendation system dominate the pool of all relevant
matches. It would be interesting to explore diferent
methods of defining relevance of a match regardless
of delivery.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we presented a case study on how we adapted
the metric, recall@k, for use in an online setting to
measure the efectiveness of the early stages within our
recommender system for job recommendations. Compared to
business metrics, which usually only capture the impact of
recommendations that were actually delivered to users, this
nuanced measurement of performance at each step ensures
that enhancements to the system are accurately targeted. It
also serves a diagnostic tool that identifies missed
opportunities and bottlenecks within the recommendation funnel.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Higley</surname>
          </string-name>
          , E. Oldridge,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rabhi</surname>
          </string-name>
          , G. de Souza Pereira Moreira,
          <article-title>Building and deploying a multistage recommender system with merlin</article-title>
          ,
          <source>in: Proceedings of the 16th ACM Conference on Recommender Systems</source>
          , RecSys '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>632</fpage>
          -
          <lpage>635</lpage>
          . URL: https://doi.org/10.1145/3523227.3551468. doi:
          <volume>10</volume>
          .1145/ 3523227.3551468.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Beel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Genzmehr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Langer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nürnberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <article-title>A comparative analysis of ofline and online evaluations and discussion of research paper recommender system evaluation</article-title>
          ,
          <source>in: Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation, RepSys '13</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2013</year>
          , p.
          <fpage>7</fpage>
          -
          <lpage>14</lpage>
          . URL: https://doi.org/10.1145/2532508.2532511. doi:
          <volume>10</volume>
          .1145/2532508.2532511.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Krauth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Curmei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Recht</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. I. Jordan</surname>
          </string-name>
          ,
          <article-title>Do ofline metrics predict online performance in recommender systems</article-title>
          ?,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2011</year>
          .07931. arXiv:
          <year>2011</year>
          .07931.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Carnevali</surname>
          </string-name>
          , Evaluation measures in information retrieval,
          <year>2023</year>
          . URL: https://www.pinecone.io/learn/ offline-evaluation/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <article-title>Building a large-scale recommendation system: People you may know</article-title>
          ,
          <year>2024</year>
          . URL: https://www. linkedin.com/blog/engineering/recommendations/ building
          <article-title>-a-large-scale-recommendation-system-people-you-may-know.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>