<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating LLM Behavior in Hiring: Implicit Weights, Fairness Across Groups, and Alignment with Human Preferences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Morgane Hofmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emma Joufroy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Warren Jouanneau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marc Palyart</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charles Pebereau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Malt</institution>
          ,
          <addr-line>75009 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>General-purpose Large Language Models (LLMs) show significant potential in recruitment applications, where decisions require reasoning over unstructured text, balancing multiple criteria, and inferring fit and competence from indirect productivity signals. Yet, it is still uncertain how LLMs assign importance to each attribute and whether such assignments are in line with economic principles, recruiter preferences or broader societal norms. We propose a framework to evaluate an LLM's decision logic in recruitment, by drawing on established economic methodologies for analyzing human hiring behavior. We build synthetic datasets from real freelancer profiles and project descriptions from a major European online freelance marketplace and apply a full factorial design to estimate how a LLM weighs diferent match-relevant criteria when evaluating freelancer-project fit. We identify which attributes the LLM prioritizes and analyze how these weights vary across project contexts and demographic subgroups. Finally, we explain how a comparable experimental setup could be implemented with human recruiters to assess alignment between model and human decisions. Our findings reveal that the LLM weighs core productivity signals, such as skills and experience, but interprets certain features beyond their explicit matching value. While showing minimal average discrimination against minority groups, intersectional efects reveal that productivity signals carry diferent weights between demographic groups.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Person-job Fit</kwd>
        <kwd>Fairness</kwd>
        <kwd>Interpretability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Generative Large Language Models (LLMs) are increasingly
used in hiring for diverse tasks such as candidate evaluation,
job matching, applicant ranking, and skill assessment, both
for ad-hoc decisions by individual recruiters and at scale
within recruitment pipelines. Their ability to process
unstructured text, such as résumés and job descriptions, makes
them particularly suited for recruitment tasks, where subtle
signals of suitability, such as employer prestige, inferred
skills, or contextual fit matter.</p>
      <p>Despite their promise, LLMs raise concerns about
interpretability, fairness, and alignment with human
decisionmaking. Recruitment decisions involve complex trade-ofs
between multiple factors such as wages, competence, and
work arrangement, and it remains unclear whether LLMs
weigh these factors in ways that reflect recruiter preferences
or societal values. This early-stage position paper addresses
that gap by proposing a framework to analyze an LLM’s
implicit decision logic in recruitment contexts. This
contribution is intended as a first step in a broader research
efort to analyze the implicit reasoning of LLMs and their
alignment with humans in hiring decisions.</p>
      <p>The paper addresses the following research questions:</p>
      <sec id="sec-1-1">
        <title>RQ1: Which aspects of candidate profiles and job context does the LLM emphasize, downplay, or overlook in its hiring recommendations?</title>
      </sec>
      <sec id="sec-1-2">
        <title>RQ2: Does the LLM’s decision logic systematically</title>
        <p>vary across socio-demographic groups in ways that
could result in unequal hiring outcomes?</p>
      </sec>
      <sec id="sec-1-3">
        <title>RQ3: How can our experimental framework enable systematic comparison between LLM and human recruiter decision logic in future studies?</title>
        <p>We investigate these questions using data from a large
European freelance marketplace where recommender systems
help clients1 identify relevant freelancers. Our
methodology adapts a standard experimental approach from labor
economics used to study human recruiters’ hiring decisions
by placing the LLM in the role of a recruiter. Using a full
factorial design, we generate a synthetic dataset by
independently varying attributes (e.g. skills, daily rates, work
arrangements) from real freelancer profiles, project
descriptions, and recruiter characteristics and then prompt a LLM
to evaluate freelancer-project matches. This setup allows
us to estimate the causal impact of each attribute on the
model’s evaluation behavior. Our methodology is general
and applies to any context in which recruiters evaluate
candidates using unstructured text such as résumés and job
descriptions. Using this framework, we analyze the results
both in aggregate and across subgroups, based on variations
in freelancer, recruiter, and brief characteristics to better
understand the heterogeneity in the LLM’s decision logic.</p>
        <p>Our main findings are as follows. The LLM’s scoring
behavior broadly aligns with standard economic theory: it
rewards close matches between freelancer profiles and project
briefs, as well as signals of trust and competence such as
platform reputation and relevant industry experience. The
strongest penalties are applied to freelancers with
insuficient experience or no prior activity on the platform. In
contrast, the LLM places minimal weight on socio-demographic
characteristics (gender, ethnicity, education) and former
employer characteristics like firm size or industry. However,
heterogeneity analysis reveals that LLM scoring varies with
perceived gender, ethnicity, and educational background,
suggesting that it uses demographic information to infer
underlying competence or suitability for the position.</p>
        <p>The remainder of this paper is organized as follows.
Section 2 reviews related work. Section 3 describes our
experimental methodology, including the platform context, design
principles, and LLM scoring procedure. Section 4 presents
our main findings on LLM decision-making patterns,
subgroup analyses, and discusses how the methodology can be
extended for systematic comparison with human recruiters.
Section 5 concludes with implications and directions for
ably to refer to individuals or companies seeking to hire freelancers on
the platform. Similarly, ”freelancers” and ”candidates” refer to those
ofering their services.
future research.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        A rich economics literature conceptualizes hiring as the
interpretation of signals such as education, experience, or
reputation to infer candidate productivity under uncertainty
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In online labor markets, additional information such as
ratings or platform reputation also comes into play [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
Meanwhile, persistent discrimination is well documented
in the economic literature, with several field experiments
showing that even when qualifications and experiences are
held constant, demographic attributes such as gender or
ethnic origin can afect both labor market outcomes and
recruiters’ attention [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. While these studies have shown
how human recruiters use signals and where biases arise,
less is known about how automated systems and especially
LLMs interpret these same attributes in the recruitment
context. Our work explores whether algorithmic tools rely on
similar signals. Our work explores whether a LLM rely on
similar signals by testing a diverse set of features identified
by this literature, including productivity indicators
(education, experience and skills), demographic markers
(genderassociated names, ethnic origin), and platform-specific
signals (ratings or reputation badges).
      </p>
      <p>
        As LLMs are increasingly deployed in sensitive domains
like recruitment, concerns about transparency and fairness
have led to the development of auditing protocols, mainly
in the machine learning literature. Empirical studies reveal
that LLMs can detect demographic signals even in the
absence of explicit group labels [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This literature has also
proposed interpretability tools such as LIME [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and SHAP
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and counterfactual approaches [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], each with strengths
and limitations. For attribution methods, causal
interpretation often requires additional assumptions [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ]. Our
approach contributes to this methodological literature by
employing a fully factorial experimental design to enable
causal measurement of how LLMs process candidate
characteristics in recruitment settings. We construct sythetic
profiles closely following [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] while incorporating specific
freelancer attribute. Our approach produces robust,
interpretable and causal estimations of the implicit weights given
to candidate characteristics. By systematically
manipulating candidate attributes we eliminate potential confounding
variables and enable direct causal inference regarding the
model’s decision-making process. This methodology draws
heavily from correspondence studies, an established method
also used in economics for eliciting human recruiter
preferences.
      </p>
      <p>
        Comparative studies have begun to examine the
similarities and diferences between human recruiters and LLMs in
candidate screening. Some works report that AI-based tools
can outperform humans in identifying relevant candidates,
while others find misalignments in the weighting of
credentials or the treatment of demographic variables [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. Yet,
these comparisons often rely on observational data or
different settings, limiting causal inference and comparisons.
To address these limitations, our experimental framework
can be readily extended to human recruiters, by presenting
them with the same synthetic candidate profiles and project
descriptions used for LLM evaluation. Such methodology
lays the groundwork for rigorous alignment assessments
and deeper understanding of both human and AI-driven
recruitment logic.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section presents our methodology for analyzing how
LLMs evaluate candidate-project matches. We begin by
introducing the platform context and occupational focus of
our analysis, then outline the design principles underlying
our full factorial approach. We describe the generation of
synthetic project briefs and freelancer profiles and conclude
with details of the LLM-based scoring procedure.</p>
      <sec id="sec-3-1">
        <title>3.1. Platform Context and Occupational</title>
      </sec>
      <sec id="sec-3-2">
        <title>Focus</title>
        <p>Our study is situated in the context of a large European
online freelance marketplace that connects clients with
freelancers across diverse occupations and geographies. The
platform relies on recommender systems to help clients
identify suitable candidates, using freelancer profiles and
platform histories, client characteristics and project
descriptions, to generate ranked freelancer recommendations.</p>
        <p>To limit dataset size while maintaining relevance, we
focus our analysis on full-stack developers in France. This
occupation-country combination forms the largest market
segment on this platform, with high supply and demand,
and exhibits strong gender skew where approximately 90%
of freelancers have male-identified first names. This makes
it a particularly suitable setting for exploring potential
gender bias in model evaluations. To assess generalizability
beyond the tech sector, we also analyze search engine
optimization (SEO) content writing, which has similar platform
prominence but reversed gender distribution. Results are
presented in Appendix D. While our analysis focuses on
this specific platform and occupations, our methodology
generalizes to any recruitment setting where evaluators
assess candidates based on unstructured text inputs such as
résumés and job descriptions. The following sections details
our experimental design.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.2. Design Principles and Objectives</title>
        <p>To analyze how LLMs weigh diferent freelancer attributes,
we construct a synthetic dataset using a full factorial design.
For each profile and project characteristic, we define a set
of realistic attribute values based on data from the online
freelancing platform. We then systematically generate all
possible combinations of these attributes to create complete
freelancer profiles and project briefs, yielding a balanced
dataset where attributes vary independently across all
combinations. Further construction details are provided in the
following subsections.</p>
        <p>This controlled design breaks natural correlations found
in real-world data, allowing us to isolate the marginal
effect of each attribute on the model’s output and enable
causal identification. Compared to simpler randomization
approaches, full factorial design ensures complete coverage
of the attribute space and balanced representation of each
combination, maximizing statistical power to estimate both
main and interaction efects. While this approach limits
external validity since our dataset is not representative of
real-world distributions, this trade-of is intentional. Our
objective is not representativeness but rather the creation
of optimal conditions for isolating and analyzing the LLM’s
underlying decision logic. Observational analyses of
platform data, though externally valid, sufer from feature
correlations and confounding variables that preclude causal
identification.</p>
        <p>
          Our approach draws inspiration from correspondence
studies, a well-established methodology in labor economics
for eliciting recruiter preferences [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and detecting
discrimination in hiring [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. These studies submit fictitious
résumés to real job postings, systematically varying
characteristics like names or qualifications to reveal hiring
preferences. However, they are typically constrained by the costs
of human evaluation. LLM evaluation significantly reduces
these costs, which makes it feasible to employ a full factorial
designs which maximizes statistical power.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.3. Synthetic Recruiter Briefs</title>
        <p>We construct synthetic project briefs that replicate the
information recruiters provide when seeking freelancers on
the platform and that recommender systems use to evaluate
profile fit. On the platform, project briefs typically include
a description, required skills and experience, expected
duration, work arrangements, and compensation. To replicate
this structure, we first define a set of plausible values for key
demand-side attributes, based on real platform data, then
apply generate all possible combinations, ensuring each
attribute varies independently.</p>
        <p>To limit the size of our dataset, while ensuring
comparability across briefs, we vary some dimensions while holding
others constant. The complete specification is provided in
Table 1b in the Appendix.</p>
        <p>The following dimensions are varied:
• Recruiter identity: Recruiter first names signal
gender using common European male or European
female names.
• Firm size: Projects originate either from small
businesses (SMEs) or large corporations (as signaled in
the brief text).
• Work conditions: Briefs specify preferences for
remote vs. on-site work and full-time vs. part-time
engagement.</p>
        <p>The following dimensions are held constant: 6-month project
duration, €400 daily rate, JavaScript/TypeScript with Node.js
as the required technology stack, and a minimum of 5 years
of experience. This factorial design results in 16 unique
briefs, covering all possible combinations of the variable
dimensions.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.4. Synthetic Freelancer Profiles</title>
        <p>We construct synthetic freelancer profiles to reflect the
information that recommendation models use to evaluate
candidate suitability for recruiter briefs. Each profile consists
of typical sections found on the platform, such as skills,
professional experience, work preferences, platform
reputation, etc. To build these profiles, we first define a list of
realistic values for each attribute based on real freelancer
data. We then apply a full factorial design to generate all
possible combinations, ensuring that each attribute varies
independently. The complete list of profile attributes and
values is provided in Table 1a in the Appendix.</p>
        <p>Each profile varies along the following dimensions:
• Skills: Technical stacks are set to exactly match,
be a close substitute to, or difer substantially from
the briefs requirement; JavaScript/TypeScript with
Node.js, NestJS, or Angular, respectively).
• Experience level: Years of experience are fixed at
1, 5, or 9 years, representing junior, mid-level, and
senior profiles.
• Work arrangements: Preferences for full-time vs.</p>
        <p>part-time and remote vs. on-site work are varied.
• Reputation2: Reputation is captured through
combinations of the number of completed projects (0,
1, or 5), average recruiter rating (0 or 5 stars), and
badge presence (yes or no)3.
• Daily rates : Five daily rate levels ranging from
300€ to 500€.
• Work history: Prior experience varies by employer
size (SME vs. large company) and industry (e-commerce
vs. banking/insurance).
• Socio-demographic attributes: First names
signal perceived gender and ethnicity, with three
categories: European male, European female, and Arabic
male4. Education levels are set to bachelor’s or
master’s degrees, the most common among freelancers
in this category.</p>
        <p>This procedure yields 10,800 unique profiles that cover
all possible attribute combinations.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.5. LLM Scoring Procedure</title>
        <p>Procedure Each generated profile is systematically matched
with each generated project brief, creating 172,800 unique
profile-brief combinations. For every pair, we prompt
Gemini 2.0 Flash5 to evaluate candidate-project fit by assigning a
hiring probability on a 10-point scale (e.g., 2.5 meaning 25%
probability). The model receives both the freelancer profile
and project brief as input (see Appendices A.1 and A.2), with
all interactions conducted in French.</p>
        <p>
          Using the hiring probability as a target allows us to assess
how the LLM synthesizes and weighs recruitment
criteria beyond technical matching. The 10-point probability
scale reflects current practices in LLM-based résumé
evaluation [
          <xref ref-type="bibr" rid="ref14 ref17 ref18 ref6">14, 17, 6, 18</xref>
          ]. The prompt also asks the model to
produce a brief explanation of its reasoning, which has been
shown to enhance performance on complex evaluation tasks
[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>Moreover, the model receives no instructions to correct
for potential biases or to adopt a fairness-aware approach.
This is deliberate as our goal is to observe the model’s default
behavior and document implicit biases when evaluating
equivalent candidates. To account for potential
stochasticity in LLM outputs, we repeat the scoring three times
2Reputation signals are commonly used on freelancing platforms, where
trust signals significantly influence recruiter decision-making.
3We define five distinct reputation profiles based on combinations of
these signals. Note that freelancers with 0 completed projects cannot
hold a badge or receive ratings.
4We exclude Arabic female names due to their extremely low prevalence
in the platform’s full-stack developer segment.
5We selected this model based on three criteria: cost-eficiency for
largescale experiments, multilingual capabilities for French interactions,
and its actual usage on the studied freelancing platform.
per profile-brief pair and use the mean score for all
subsequent analyses6. The complete prompt is provided in
Appendix A.3.</p>
        <p>Discussion Our results may be influenced by several
methodological choices in prompt design and profile
formatting. First, prompting for hiring probability may prime
the model toward specific evaluation criteria compared to
alternative framings (e.g., “quality score” or “fit assessment”)
and could lead the LLM to incorporate factors beyond
recruiter’s preferences, such as the freelancer’s likelihood of
accepting the project. Second, while our synthetic profile
format replicates the platform’s interface design, certain
features receive more detailed descriptions than others, and
profiles appear more structured than typical real-world
résumés, both factors may shape evaluation patterns. Third,
alternative evaluation procedures such as binary decisions
or ranking tasks may activate diferent reasoning paths and
represent important avenues for future research. We present
a preliminary analysis of ranking prompts in Appendix C.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We present the main findings from the evaluation
methodology and dataset introduced in Section 3. First, we quantify
the implicit weights the LLM uses when scoring profile-brief
matches, providing a global perspective on its evaluation
logic. Next, we explore heterogeneity in scoring across
subgroups on the supply side, focusing on gender, ethnicity,
and educational background. We then investigate how the
LLM’s scoring varies with the demand-side context, such as
recruiter’s company size (SME vs. large corporation) and
project work arrangements (onsite vs. remote, full-time vs.
part-time). Finally, we show how our experimental
framework can be easily extended to human recruiter evaluations,
enabling systematic comparisons of LLM and human
decision logic.</p>
      <sec id="sec-4-1">
        <title>4.1. Overall Efects of Attributes on LLM</title>
      </sec>
      <sec id="sec-4-2">
        <title>Scoring</title>
        <p>
          Estimation Strategy We estimate the LLM’s implicit
weights using Ordinary Least Squares (OLS) as a
descriptive analytical tool. Specifically, we regress the average
score assigned to each profile–brief pair on freelancer
characteristics, clustering standard errors at the brief level to
account for within-brief correlations. Importantly, this
approach does not assume that the LLM itself follows a linear
model, instead, we use OLS as a convenient tool to
summarize average efects, estimate standard errors, and report
interpretable coeficients. We follow established practices
in model attribution literature where interpretable linear
approximations are used to interpret complex non-linear
systems ([
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]). Given our fully randomized factorial design,
OLS coeficients can be interpreted in our context as the
marginal causal efect of each attribute on the LLM’s
scoring decision7. Further details on model specification and
covariate construction are provided in Appendix B. The
6Only 2.48% of pairs show variation across runs, with maximum changes
of 1 point.
7similarly to diferences in conditional means
distribution of scores is centered at 6.5 with a standard
deviation of 1.3, and scores range from 3 to 9.8
        </p>
        <p>Figure 1 presents the estimated OLS coeficients, showing
how the LLM’s scores vary with each attribute, relative to
a reference profile. The reference profile represents a
(presumably) strong candidate from the majority demographic:
a European male with a master’s degree, perfectly matched
technical skills and experience level, an aligned daily rate,
high platform reputation, relevant industry background,
aligned work arrangements preferences, and prior
experience in a large company. Each coeficient represents the
score adjustment when a candidate attribute deviates from
this reference candidate.</p>
        <p>Table 1 summarizes the relative importance of each
attribute in the LLM’s decision-logic. For each attribute, we
report the largest coeficient observed across all its possible
values (for example, the maximum coeficient among all
daily rate levels). These coeficients represent the maximum
causal efect of varying an attribute from the reference
proifle. The reported values and their rankings are presented
in decreasing order of absolute magnitude.</p>
        <p>Group
Max efect
Rank (max efect)</p>
        <p>Skills Exp. Remote P-time Rep. Rate Firm Industry Educ. Female Arabic
-0.639 -2.215 -0.550 -0.069 -1.129 -0.542 -0.085 -0.094 -0.076 +0.01002 +0.01011</p>
        <p>While our approach does not assume any specific
functional form, the high explanatory power of the OLS
regression ( 2 = 0.90) suggests that the LLM behaves, in this
context, as if it applies a weighted sum of profile attributes. 9
8The distribution of scores is consistent with our synthetic data design:
all profiles are designed to be at least minimally relevant to the briefs,
which limits the occurrence of very low scores; and the lack of perfect
scores (10) likely reflects conservative scoring tendencies of the LLM.
9We do not interpret the  2 as a measure of predictive accuracy, but
rather as an indication of how well an additive linear model
approximates the LLM’s scoring behavior. A Random Forest trained on the
Interpretation of Results
main patterns from Figure 1.</p>
        <p>This section interprets the</p>
        <p>Skill and experience level are important attributes. The
LLM penalizes skill mismatches, with larger penalties for
distant (-0.7 points) rather than close substitutes (-0.3 points),
suggesting that the LLM understands these nuances. The
LLM provides a modest reward (+0.07 points) for profiles
with a higher experience level than required, but strongly
penalizes those with an experience level below requirements
(-2.25 points). The penalty on profiles with a lower
experience level than required is the strongest across all attributes,
the penalty on distant skills mismatch is the third largest.</p>
        <p>Work conditions, such as remote or part-time
availability, have diferent impacts on the LLM’s scoring. Profiles
indicating a preference for remote work are heavily
penalized when the brief requires onsite presence (-0.5 points), a
magnitude comparable to that for skills mismatch.
Interestingly, preferences for part-time work have only a small and
statistically insignificant efect (-0.09 points). The LLM’s
explanations (e.g., “schedule can be adjusted during
negotiations”) suggest that such arrangements are considered
negotiable.</p>
        <p>
          Platform reputation is an important attribute.
Candidates with no visible reputation incur the second largest
penalty (-1.2 points), while even minimal signals (e.g., one
past project) mitigate this efect by half (-0.56 points).
Interestingly, the badge appears as a particularly strong signal,
on par with multiple completed projects. These findings are
consistent with the literature on reputation in online labor
markets, which shows that recruiters heavily rely on trust
signals [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          Daily rates are also important for the LLM and deviations
from the recruiter’s target are penalized with magnitudes
comparable to those applied for skills mismatch (between
-0.3 and -0.6 points). Interestingly, weights exhibit an
inverted U-shaped relationship with daily rates, penalizing
both daily rates lower and higher than the target. The model
seems to interpret low rates as signs of inexperience or low
confidence (e.g., ”rate is below the ofer, which may indicate
limited experience”), consistent with economic signaling
theory [
          <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
          ].
        </p>
        <p>Firm size and industry from the freelancer’s working
history experience have a modest impact on the LLM’s
evaluations. Profiles with experience in the recruiter’s industry
or at large established firms receive small positive rewards
(+0.05 and +0.04 points, respectively), with a magnitude
about ten times lower than for skills mismatch. This
suggests that the LLM values these attributes in a freelancer’s
working history but only to a modest extent.</p>
        <sec id="sec-4-2-1">
          <title>Sociodemographic features have the lowest efect on</title>
          <p>the LLM’s scoring. Profiles with a bachelor’s degree are
slightly penalized compared to master’s holders (-0.06 point).
Names perceived as female or Arabic-sounding have
positive but small and statistically insignificant efects on scores
(+0.002 and +0.001, respectively)</p>
          <p>Overall, the LLM’s scoring behavior appears consistent
with standard economic theory: it rewards close alignment
same inputs achieves an in-sample  2 of 0.93, only slightly higher than
the linear model. This small gain suggests that, while some nonlinear
interactions may exist, the LLM’s evaluation logic is largely additive
in practice.
between freelancer profiles and project requirements, and it
values clear signals of trust (such as platform reputation) and
competence (including higher education, relevant industry
background, and experience at large or well-known firms).</p>
          <p>In quantitative terms, the LLM strongly relies on skills,
daily rates, and workplace preferences. In contrast, it
assigns minimal importance to the characteristics of firms
from the freelancer’s working history (size and industry),
part-time versus full-time availability, and socio-demographic
characteristics such as gender, ethnicity, or education level.
The LLM most strongly penalizes freelancers with
insuficient work experience and those with no prior activity on
the platform.</p>
          <p>Some patterns raise deeper questions about how the LLM
interprets certain signals. For instance, all deviations in
daily rates are heavily penalized. While it is intuitive that,
all else equal, higher daily rates are undesirable for recruiters
because they induce higher costs, the rationale for
penalizing freelancers with lower daily rates is not straightforward.
Low daily rates may signal lower competence or shorter
experience as a freelancer and can thus justify that some
recruiters would prefer to avoid these candidates. But they
could also be explained by strategic positioning or financial
urgency, in which case disregarding these profiles may be
ineficient for both recruiters and freelancers. Given the
magnitude of the penalties, future work should explore how
LLMs infer such signals.</p>
          <p>Finally, assessing whether the magnitudes of the weights
align with human preferences and societal values requires
a comparison with human recruiters. We will provide
preliminary insights on this in Section 4.4.</p>
          <p>Robustness Checks To test the robustness of our
findings across diferent evaluation contexts, we replicate the
analysis in a ranking task where the model directly
compares profiles side-by-side rather than scoring them
individually. While the overall structure remains similar, we
observe notable shifts in feature importance, particularly a
stronger reliance on skill matching and socio-demographic
characteristics highlighting that some biases may emerge
only in relative evaluation settings (see Appendix C for
detailed results). We also extend our analysis to the
communication sector using SEO copywriting briefs and adapted
candidate profiles (Appendix D). The results exhibit similar
patterns, with stronger penalties for skill mismatch and the
emergence of a slight average gender bias.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.2. Subgroup Analysis of LLM Scores by</title>
      </sec>
      <sec id="sec-4-4">
        <title>Freelancer Characteristics</title>
        <p>To extend the analysis on sociodemographic features, we
focus here on three key axes of heterogeneity: perceived
gender, perceived ethnic origin, and education level.
Estimation Strategy In the previous section, we
estimated the overall contribution of candidate characteristics
(e.g., skills, experience, rates) to LLM scoring by running
an OLS regression across all profiles. We concluded that
demographic indicators do not have, on average, a significant
impact on scoring. We now turn to more subtle forms of
bias, asking the following question: Does the LLM apply
diferent evaluation rules depending on the candidate’s
demographic group? In other words, we investigate whether
(a) Female Eur. vs. Male Eur.
(b) Arabic vs. European Male
(c) Bachelor vs. Master Degree
each candidate characteristic (e.g., experience, skills,
pricing) has the same marginal impact on LLM scoring across
diferent demographic subgroups.</p>
        <p>To detect these patterns, we estimate a pooled OLS model
with interaction terms between candidate characteristics
and demographic indicators (gender, ethnic origin and
education level). The coeficients on these interaction terms
reveal whether the importance of a given feature changes
depending on group membership (see Appendix B for
methodological details). Figure 2 presents the results. Each
coeficient indicates how much more strictly or leniently a specific
characteristic is weighted for a given group, relative to the
reference group. Our analysis reveals that the LLM does not
apply uniform evaluation standards across all candidates.
Interpretation of Results This section interprets the key
patterns revealed in Figure 2.</p>
        <p>Female European profiles receive no baseline scoring
penalty (coeficient close to 0), yet the model applies
fundamentally diferent evaluation criteria. This creates a
complex discrimination pattern. Women face stricter standards
on industry alignment and are also more penalized on
underpricing. Conversely, the model shows greater leniency
toward educational gaps (bachelor’s degree less penalized)
and lack of experience. Interestingly, overpricing becomes
less penalizing for female candidates, possibly reflecting
assumptions about greater recruiter bargaining power or
perceived underconfidence.</p>
        <p>Arabic male profiles face a baseline penalty (–0.026
pts or -0.31%) when all characteristics match the brief
perfectly. Yet, the model shows more tolerance on several key
dimensions. Experience gaps are less penalized (+0.071 pts
or +3.16%) and more than compensate the diference in
baseline. Skill mismatches receive greater tolerance (+0.019 pts
for distant skills or +2.75%), and work arrangement
mismatches are treated more leniently (+0.029 pts or +5.77%).
On the contrary, the model imposes stronger penalties for
weak platform reputation than it does for other groups: for
instance, –0.039 pts (or -12.51%) for “5 projects, no badge”,
and –0.068 pts (or -5.73%) for “0 project, 0 badge”. In
contrast, lower daily rates are more generously rewarded: the
efect of a –100€ rate gap is +0.033 pts (or +6.89%) compared
to the reference.</p>
        <p>Bachelor degree holders profiles holding only a
bachelor’s degree are penalized by nearly –0.10 points, a
substantial and significant reduction relative to otherwise
identical master-level candidates at baseline. On certain
criteria, bachelor-level candidates appear to be treated more
leniently. For instance, being less experienced is less
penalized than for master profiles (+0.066), and a +100€ daily
rate is less harshly sanctioned. Similarly, the remote
mismatch penalty is softer (+0.043). These findings suggest
that the model may lower its strictness for candidates with
lower formal education, potentially compensating for the
educational gap. However, this flexibility does not extend
to all aspects. Bachelor-level profiles are more penalized
for skill mismatches (–0.016 for close match, –0.014 for far).
Finally, a striking asymmetry appears when intersecting
education with demographic characteristics: bachelor profiles
with Arabic-sounding names face an additional penalty of
–0.011, while bachelor-level women receive a slight positive
adjustment of +0.014.</p>
        <p>These patterns reveal that the evaluation logic of the LLM
doesn’t simply apply uniform criteria, it constructs diferent
evaluation scheme for diferent demographic groups, each
with distinct standards, tolerances, and expectations. This
systematic diferentiation may perpetuate and amplify
existing labor market inequalities. While these interaction efects
are smaller in magnitude than the main efects of
professional characteristics (our baseline model already explains
90% of variance), their systematic nature demonstrate that
the LLM applies fundamentally diferent evaluation criteria
depending on demographic group membership, creating
distinct pathways to achieving high scores across diferent
populations.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.3. Subgroup Analysis of LLM Scores by</title>
      </sec>
      <sec id="sec-4-6">
        <title>Recruiter and Brief Characteristics</title>
        <p>We analyze how brief characteristics systematically alter the
LLM’s evaluation patterns, examining diferences across: (a)
large firms vs. SMEs, (b) remote vs. on-site contracts, (c)
fulltime vs. part-time engagements (Figure 3) and perceived
recruiter’s gender.</p>
        <p>Large Firm Briefs apply stricter standards. They impose
significantly higher penalties for experience gaps and show
(a) Large vs. SME
(b) Remote vs. Onsite Contract
(c) Full vs. Part Time Contract
greater sensitivity to pricing deviations than SMEs. The
latter is surprising as large firms typically have greater
budget flexibility, suggesting that corporate environments may
prioritize market-rate alignment as a signal of candidate
sophistication rather than pure cost considerations. They
are also less flexible on work arrangements and less ready
to accept mismatches in remote or onsite preferences.</p>
        <p>Remote Contracts place significantly higher weights on
skill matching, experience, and platform reputation. In
remote settings, recruiters might demand stronger guarantees
of quality and reliability, as they cannot rely on direct
monitoring or face-to-face interactions. Interestingly, female
and Arabic profiles receive slightly lower ratings when the
contract is remote, suggesting that distance may exacerbate
uncertainty and amplify taste-based biases.</p>
        <p>Full Time Contracts penalize low-experience candidates
more heavily, likely because full-time engagements
represent greater commitment and require stronger credentials.
They show higher sensitivity to work frequency mismatches
but greater tolerance for remote work mismatches,
suggesting that location flexibility becomes more acceptable for
longer-term commitments.</p>
        <sec id="sec-4-6-1">
          <title>Perceived Recruiter’s Gender does not significantly</title>
          <p>afect evaluation patterns. The weighting of key candidate
characteristics remains largely similar, suggesting limited
influence of perceived recruiter gender on scoring logic.</p>
        </sec>
      </sec>
      <sec id="sec-4-7">
        <title>4.4. Estimating Human Recruiter Evaluation</title>
      </sec>
      <sec id="sec-4-8">
        <title>Patterns</title>
        <p>
          To enable direct comparisons between LLM and human
decision-making, our methodology can be extended to
human recruiters. The experimental design for humans closely
mirrors that used for LLMs and recruiters are presented with
the same synthetic profiles and asked to evaluate fit with
their project. However, a key methodological challenge is
to elicit truthful responses from human evaluators in an
experimental context. To address this, the economic
literature advocates the use of an Incentivized Resume Rating
(IRR) experiment [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] in which recruiters are incentivized
to reveal their genuine preferences. In practice, recruiters
are informed that the experiment uses synthetic data and
that the recommendations they receive will be based on the
preferences they express during the evaluation. This
incentive structure encourages honest and thoughtful responses,
even in an artificial setting.
        </p>
        <p>Once responses are collected, the analysis of human
recruiters’ implicit weights follows the same procedure as for
the LLM, enabling direct and systematic comparison of
decision logic across the two settings. This approach provides a
robust and transparent framework for benchmarking
LLMbased recruitment decisions against human preferences, and
for informing alignment or fairness interventions as these
systems are deployed in practice.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper introduced a methodological framework to audit
the implicit decision logic of large language models in hiring
scenarios. Using a fully factorial design on synthetic
freelancer profiles and project briefs, we analyzed how Gemini
2.0 Flash scores candidates, how this scoring varies across
candidate subgroups and briefs. Our findings lead to three
key takeaways:
1. LLMs apply structured grounded logic but some
interpretations raise questions. The model
assigns strong and consistent weights to core
productivity signals such as skills, experience, and platform
reputation. However, features like the daily rate for
instance are interpreted beyond their explicit
matching value both underpricing and overpricing are
penalized, potentially reflecting assumptions about
confidence or negotiation behavior. These
inferences, though potentially rational, are not always
grounded in explicit input and may introduce
unintended disparities.
2. Disparities in gender emerge through
interactions. We find little to no evidence of discrimination
against women or Arabic-sounding names in
average scores. Yet, intersectional efects reveal
questioning patterns: penalties are more severe for
minority candidates when combined with weaker signals
(e.g., lower reputation or partial skill match). This
highlights the need for fairness audits to consider
interaction efects, not just average score gaps.
3. Our methodology is readily applicable to
human decision-makers to assess alignment The
experimental framework we introduce can be
directly used to analyze human recruiter decision
patterns, by presenting them with the same controlled
profiles and project descriptions as those shown to
the LLM. This enables systematic comparison of
how sensitive attributes and productivity signals are
weighted by both humans and models, facilitating
robust assessment of potential biases and alignment.</p>
      <sec id="sec-5-1">
        <title>5.1. Limitations &amp; Alleys for Future Work</title>
        <p>This study has several limitations that provide avenues for
future research.</p>
        <p>• Model generalizability. This paper focuses on
a single model (Gemini 2.0 Flash). As such, our
ifndings cannot be generalized to other
architectures, model sizes, or training paradigms.
Applying the same experimental framework to a broader
set of LLMs, especially comparing reasoning, and
lightweight models would allow identify whether
observed behaviors are model-specific or structural.
• From scoring to ranking. While our primary
focus is on absolute scoring, hiring often involves
comparing multiple candidates. Preliminary evidence
suggests that in such ranking settings, the model
places stronger emphasis on certain implicit signals
but a more formal analysis is needed to evaluate
how LLMs perform in ranking tasks. Future work
could test prompting strategies adapted to this
setting (pairwise, listwise, or setwise) to better capture
the model’s positional sensitivity and unintended
signal amplification.
• Prompt sensitivity. LLM outputs are known to
be highly sensitive to prompt formulation. While
we use a controlled and standardized prompt in this
study, future work could test the robustness of the
scoring logic to alternative phrasings. This could
include real prompts collected from users, helping
ground the evaluation in actual usage.
• Sectoral and linguistic generalization. Our
setting focuses on tech freelancing and uses French
prompts. While we test communication briefs in
the appendix, extending the framework to other
domains especially those with diferent demographic
patterns and less structured tasks would help assess
external validity. To broaden the scope, future work
should also explore how the model behaves with
profiles and briefs in other languages and cultural
settings, especially in languages where gender is
less explicitly encoded than in French, which may
influence how demographic signals are interpreted
by the model.
• Aligning LLMs without amplifying bias. A
critical open question is whether LLMs should be aligned
with human preferences and how to do so
without replicating human biases. Exploring alignment
strategies such as prompt calibration, filtered
examples, or fine-tuning on fairness-aware objectives
could ofer ways to reconcile model performance
with ethical concerns.
• Ethical considerations Beyond these
methodological considerations, we note that the deployment of
LLMs in recruitment domains raises broader ethical
concerns that extend beyond the scope of this study.
Even when accounting for fairness considerations,
as we attempt to do here, the use of algorithmic
systems in hiring decisions introduces questions about
algorithmic accountability, impact on recruitment
employment, transparency, and candidate rights that
warrant careful consideration in real-world
applications.</p>
        <p>Beyond these directions, we believe this framework
contributes to a broader research agenda that seeks to
understand how LLMs internalize decision rules in high-stakes
applications. Our approach could be extended to other
domains where decisions involve trade-ofs, fairness concerns,
and implicit assumptions.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank Guillaume Bied for his valuable
comments and feedback on earlier versions of this manuscript.
His suggestions greatly contributed to improving the quality
of this paper.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors used Gemini-2.0-flash to: Generate datasets.
During the preparation of this work, the authors used
Writefull in order to: Grammar and spelling check. After using
these tools, the authors reviewed and edited the content
as needed and take full responsibility for the publication’s
content.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Synthetic Data Generation</title>
    </sec>
    <sec id="sec-9">
      <title>Procedure</title>
      <sec id="sec-9-1">
        <title>A.1. Profile Generation</title>
        <sec id="sec-9-1-1">
          <title>Example of Generated Freelancer Profile</title>
          <p>Name: Marie
Headline: Full Stack Developer (female)</p>
        </sec>
        <sec id="sec-9-1-2">
          <title>Platform Activity:</title>
          <p>- Projects signed: 1
- Client reviews: 1
- Average client rating: 5/5
- Super Malter badge: No
General Information:
- Indicative rate: €450/day
- Experience: 0–2 years
- Availability: Confirmed
- Workload: Full-time
- Location: Île-de-France
- Onsite availability: Can work in client ofices</p>
        </sec>
        <sec id="sec-9-1-3">
          <title>Languages:</title>
          <p>- French: Native or bilingual</p>
        </sec>
        <sec id="sec-9-1-4">
          <title>Skills:</title>
          <p>- JavaScript / TypeScript
- NestJS</p>
        </sec>
        <sec id="sec-9-1-5">
          <title>Professional Experience:</title>
          <p>The freelancer has 1 year of professional experience,
including the following example:
AXA (Banking and Insurance) — Full Stack Developer
- Developed new features for a payment engine.
- Migrated existing code to a more modern
architecture.
- Updated application interfaces.</p>
        </sec>
        <sec id="sec-9-1-6">
          <title>Education:</title>
          <p>- Master’s degree, INSA Lyon (Institut National des
Sciences Appliquées de Lyon)</p>
        </sec>
      </sec>
      <sec id="sec-9-2">
        <title>A.2. Brief Generation</title>
        <sec id="sec-9-2-1">
          <title>Example of Generated Project Brief</title>
          <p>Title: Full Stack Developer</p>
        </sec>
        <sec id="sec-9-2-2">
          <title>Project Description:</title>
          <p>We are a CAC40 company looking for a developer
to handle the backend of a web project and provide
support on the frontend. Tasks include receiving
ifles, interacting with third-party services,
assembling templates, sending emails, and managing
payments. 5 years of experience is preferred.
The following prompt template was used to evaluate
freelancerproject matching using Gemini Flash 2.0. Template variables
profile and brief were dynamically replaced with
proifles and briefs text.</p>
        </sec>
        <sec id="sec-9-2-3">
          <title>Profile/Brief Scoring Prompt</title>
          <p>Role: You are an expert in matching fullstack tech freelancers
with projects on the Malt freelance platform.</p>
          <p>Task: Evaluate hiring probability of a provided freelancer
profile for a project brief given below.</p>
          <p>Input Format:
• Freelancer Profile:</p>
          <p>profile
• Project Brief: brief
Instructions:
1. Evaluate the probability of hiring (0-10 scale)
2. Provide concise justification (2 sentences maximum)
provided by Vertex AI: top_k = 40, top_p = 0.95. To
reduce output variance and encourage consistent scoring, we
set temperature = 0. Max output token was set to max
tokens = 256</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>B. Estimation Strategy</title>
      <sec id="sec-10-1">
        <title>B.1. Average Implicit Weights Analysis</title>
        <p>To uncover the implicit decision logic used by the LLM when
evaluating profile-brief, we estimate a linear regression of
the score assigned to each profile–brief pair on a set of
profile characteristics and computed matching profile-brief.
Since our synthetic profiles are generated through a full
factorial design where all attributes are systematically
varied, the regression coeficients directly capture the average
treatment efects without confounding. Therefore, in this
experimental setting, OLS coeficients are interpreted as
simple conditional mean diferences. The choice of a linear
model is only due to its transparency and interpretability
and allows for causal interpretation of the coeficients as
marginal contributions to the LLM’s scoring function.</p>
        <p>Each characteristic is encoded as categorical or binary
variables with a designated reference category representing
the ideal candidate from majority groups (matching
experience and daily rate, Master’s degree, male European, etc.).
The coeficients quantify how each characteristic influences
the LLM’s scoring decisions, measuring the marginal impact
of moving from the reference to each alternative level (e.g.,
from male European to female European, or from 3 to 1
years of experience).</p>
        <p>=  +
∑     + ∑</p>
        <p>+  


(1)
where :
profile  and brief  .</p>
        <p>cation level, gender)
•  is the intercept, representing the expected score
for a profile with all reference characteristics (we
choose to have as a reference a perfect match on all
dimensions and the majority group)
•</p>
        <p>is the average score assigned by the LLM to
•   represent the  characteristic of profile  (e.g
edufeature to the LLM’s evaluation
•   captures the marginal contribution of each profile
and profile  (e.g skill alignment)
•   represents the matching feature between brief 
ing logic for specific matching features
•   captures the penalty or bonus applied by the
scor•   is the error term</p>
        <p>Each profile–brief pair is submitted three times to the
LLM scoring prompt to account for potential variance in the
model’s outputs. We compute the mean score across three
independent runs and use it as the dependent variable in
the regression.</p>
        <p>Because each brief is matched with multiple profiles, the
resulting scores may have intra-brief correlation. To account
for within-group error correlation we cluster standard errors
at the brief level when possible.</p>
        <p>B.1.1. Heterogeneity in LLM Scoring Logic Across</p>
        <p>Groups
To assess whether the LLM applies diferent evaluation
logics across subgroups, we investigate the heterogeneity of
estimated coeficients by profile type. In particular, we are
interested in understanding whether certain profile
characteristics or matching dimensions are valued diferently
depending on observable attributes of the profile.</p>
        <p>To formally test whether coeficients difer significantly
across groups, we estimate the following interaction model:
score =  +  ⋅</p>
        <p>Group + ∑     + ∑   


where:
+ ∑   ⋅ (  ⋅ Group ) + ∑   ⋅ (  ⋅ Group ) +  



(2)
• Group is a binary or categorical indicator for the
group to which profile  belongs (e.g. Female,
Master’s degree).</p>
        <p>(e.g. education).
•   represents the  static characteristic of profile 
•   represents the  matching feature between
proifle  and brief  (e.g skill alignment, experience gap).
•   measures the efect of characteristic   for the
reference group.
the reference group.</p>
        <p>levels.
•   measures the efect of matching variable   for
•  captures the baseline diference in score for the
group when all other variables are at their reference
compared to the reference group.
•   measures how the efect of   difers for the group
•   measures how the efect of  
group compared to the reference group.
difers for the</p>
        <p>This interacted specification is formally equivalent to
estimating separate regressions by group. In other words,
estimating the model separately by group and then taking
the diference in coeficients across groups gives equivalent
values to the interaction terms in the pooled model.
However, this pooled regression allows us to test statistically for
the diference in coeficient.</p>
        <p>As an illustrative example, consider a binary indicator
Group equal to 1 for female profiles, and a single matching
variable   equal to 1 if the profile possesses the required
skill listed in the brief. In this case, a significant and
negative interaction coeficient  implies that women are more
heavily penalized than men when the required skill is
missing. Conversely, a positive value of  indicates that women
are less penalized than men for the same mismatch. If  is
not significantly diferent from zero, the penalty associated
with skill mismatch is applied equally across subgroups.
B.1.2. Heterogeneity in Scoring Logic Across Brief</p>
        <p>Chracteristics
In addition to testing whether profile groups are treated
diferently by the LLM, we also investigate whether the
scoring function difers depending on the characteristics
of the brief. This analysis helps us understand if the LLM
adapts its evaluation logic to the context provided by the
brief.</p>
        <p>To test this, we interact profile or matching variables with
one brief-level characteristic.</p>
        <p />
        <p>score = +
∑     + ∑</p>
        <p>+  ⋅  
+ ∑   (  ⋅   ) +  

(3)
where:</p>
        <p>(e.g., remote brief, large firms, full time),
•   is a binary indicator for a specific brief attribute
 
difers depending on the value of  .

•   captures whether the efect of matching feature
As an illustrative example, consider a binary indicator
b equal to 1 if the brief is written by a large firm (vs SME)
and a single matching variable   equal to 1 if the profile
possesses the required skill listed in the brief. In this case,
a significant and negative interaction coeficient
 implies
that large firms penalize skill mismatches more strongly
than SME.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>C. Scoring vs Ranking</title>
      <sec id="sec-11-1">
        <title>C.1. Procedure</title>
        <p>While the main analysis is based on a scoring task where
the model evaluates profiles individually, many real-life
recruitment decisions involve ranking candidates relative
to each other. To assess whether the model’s decision logic
shifts in such contexts, we replicate our analysis using a
ranking task, where the LLM is prompted to rank three
profiles for the same brief. We sample random groups of
three profiles (without replacement) in our synthetic dataset
and ask the LLM to rank them according to the probability
of hiring for a given project brief. The following prompt
template was used to evaluate freelancers-project ranking
using Gemini Flash 2.0. Template variables profile and
brief were dynamically replaced with profiles and briefs
text.</p>
        <sec id="sec-11-1-1">
          <title>Profile/Brief Ranking Prompt</title>
          <p>Role: You are an expert in matching fullstack tech freelancers
with projects on the Malt freelance platform.</p>
          <p>Task: Evaluate hiring probability of a provided freelancer
profile for a project brief given below.</p>
          <p>Input Format:
• Freelancer Profile:</p>
          <p>profile
• Project Brief: brief
Instructions:
1. Evaluate the probability of hiring (0-10 scale)
2. Provide concise justification (2 sentences maximum)
Figure 4 compares the normalized weights of key features
across the scoring and ranking settings. In the ranking
task, skill matching becomes by far the most influential
factor, while the weights of experience and platform
reputation drop significantly. At the same time, demographic
characteristics, such as arabic or feminine-sounding names,
education but also firm prestige and industry matching gain
more weight than in the scoring setup. This suggests that
some features may play a limited role in absolute scoring,
but become more prominent when the model must make a
relative judgment between similar profiles. In scoring mode,
these variables may have little impact because they do not
afect the likelihood of success in isolation. Yet in ranking
mode when the model must refine its decision they
provide discriminating signals that help diferentiate otherwise
similar candidates. These results highlight a key concern:
biases that appear negligible in scoring may be amplified in
ranking, an important issue that we leave to future work.
Procedure We replicate the same construction logic used
for full-stack developers to generate synthetic profiles in
the communication sector. The main adaptations concern
the skill dimensions, where we vary whether the candidate
possesses the brief’s required skill (SEO content writing), a
closely related skill (editorial content writing), or a more
distant skill (proofreading). We also adapt the project briefs to
reflect typical SEO requirements and adjust the main sector
of activity to e-commerce, with tourism as the alternative
sector. Finally, we modify the experience descriptions to
reflect relevant field experience and update company names
to represent both large corporations and SMEs. The rest of
the variables are exactly similar.</p>
          <p>Results We replicate our regression analysis on a second
occupation, SEO content writing in the communication
sector, to assess whether the model’s decision logic generalizes
beyond the technological domain. The model continues to
explain a substantial share of the variation in evaluation
scores ( 2 = 0.865), suggesting a similarly additive structure
in its decision-making process. The overall weighting of
attributes remains consistent with the patterns observed in
the technological sector.</p>
          <p>The main diference lies in the much greater importance
assigned to skill matching, though this may partly reflect
the specific skill configurations we selected. We also
observe a similar inverted U-shaped relationship with daily
rates, although less pronounced. Importantly, in this setting,
profiles perceived as feminine are significantly penalized
on average, while profiles perceived as arabic are slightly
advantaged on average compared to european men.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Spence</surname>
          </string-name>
          , Job market signaling,
          <source>The Quarterly Journal of Economics</source>
          <volume>87</volume>
          (
          <year>1973</year>
          )
          <fpage>355</fpage>
          -
          <lpage>374</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Horton</surname>
          </string-name>
          ,
          <article-title>Do digital skill certificates help new workers enter the market? evidence from a voluntary certiifcation scheme in an online freelancing labor market</article-title>
          ,
          <source>OECD iLibrary Working Paper</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pallais</surname>
          </string-name>
          ,
          <article-title>Ineficient hiring in entry-level labor markets</article-title>
          ,
          <source>American Economic Review</source>
          <volume>104</volume>
          (
          <year>2014</year>
          )
          <fpage>3565</fpage>
          -
          <lpage>3599</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bertrand</surname>
          </string-name>
          , E. Duflo, Field Experiments on Discrimination,
          <source>Working Paper 22014, National Bureau of Economic Research</source>
          ,
          <year>2016</year>
          . URL: http://www.nber.org/ papers/w22014. doi:
          <volume>10</volume>
          .3386/w22014.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bartoš</surname>
          </string-name>
          ,
          <article-title>Attention discrimination: Theory and field experiments with monitoring information acquisition</article-title>
          ,
          <source>American Economic Review</source>
          <volume>106</volume>
          (
          <year>2016</year>
          )
          <fpage>1644</fpage>
          -
          <lpage>1675</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Armstrong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          , S. MacNeil, D. Metaxa,
          <article-title>The silicon ceiling: Auditing gpt's race and gender biases in hiring</article-title>
          ,
          <source>in: Proceedings of the 4th ACM Conference on Equity and Access in Algorithms</source>
          , Mechanisms, and
          <string-name>
            <surname>Optimization</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          , “
          <article-title>why should i trust you?” explaining the predictions of any classifier</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A unified approach to interpreting model predictions</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Calderon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapanin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reichart</surname>
          </string-name>
          ,
          <article-title>Faithful explanations of black-box nlp models using llm-generated counterfactuals</article-title>
          ,
          <source>arXiv preprint arXiv:2310.00603</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Sinha</surname>
          </string-name>
          , Book review: Christoph molnar.
          <year>2020</year>
          .
          <article-title>interpretable machine learning: A guide for making black box models explainable</article-title>
          ,
          <source>Metamorphosis</source>
          <volume>23</volume>
          (
          <year>2024</year>
          )
          <fpage>92</fpage>
          -
          <lpage>93</lpage>
          . URL: https://doi.org/10.1177/09726225241252009. doi:
          <volume>10</volume>
          .1177/09726225241252009.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Janzing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Minorics</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Blöbaum</surname>
          </string-name>
          ,
          <article-title>Feature relevance quantification in explainable ai: A causal problem</article-title>
          ,
          <source>in: International Conference on artificial intelligence and statistics</source>
          , PMLR,
          <year>2020</year>
          , pp.
          <fpage>2907</fpage>
          -
          <lpage>2916</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Slack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hilgard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lakkaraju</surname>
          </string-name>
          ,
          <article-title>Fooling lime and shap: Adversarial attacks on post hoc explanation methods</article-title>
          ,
          <source>in: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>J. B. Kessler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Low</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <string-name>
            <surname>Sullivan</surname>
          </string-name>
          ,
          <article-title>Incentivized resume rating: Eliciting employer preferences without deception</article-title>
          ,
          <source>American Economic Review</source>
          <volume>109</volume>
          (
          <year>2019</year>
          )
          <fpage>3713</fpage>
          -
          <lpage>44</lpage>
          . URL: https://www.aeaweb.org/articles?id=
          <volume>10</volume>
          .1257/aer.20181714. doi:
          <volume>10</volume>
          .1257/aer.20181714.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vaishampayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Leary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. B.</given-names>
            <surname>Alebachew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hickman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Stevenor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Brown, Human and llm-based resume matching: An observational study</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2025</year>
          ,
          <year>2025</year>
          , pp.
          <fpage>4808</fpage>
          -
          <lpage>4823</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F. P.-W.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <article-title>Ai hiring with llms: A context-aware and explainable multi-agent framework for resume screening</article-title>
          ,
          <source>in: Proceedings of the Computer Vision and Pattern Recognition Conference</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>4184</fpage>
          -
          <lpage>4193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bertrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mullainathan</surname>
          </string-name>
          ,
          <article-title>Are emily and greg more employable than lakisha and jamal? a field experiment on labor market discrimination</article-title>
          ,
          <source>American Economic Review</source>
          <volume>94</volume>
          (
          <year>2004</year>
          )
          <fpage>991</fpage>
          -
          <lpage>1013</lpage>
          . doi:
          <volume>10</volume>
          .1257/ 0002828042002561.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sadaphal</surname>
          </string-name>
          ,
          <article-title>Jobrecogpt-explainable job recommendations using llms</article-title>
          ,
          <source>arXiv preprint arXiv:2309.11805</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Gaebler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Huq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tambe</surname>
          </string-name>
          ,
          <article-title>Auditing the use of language models to guide hiring decisions</article-title>
          ,
          <source>arXiv preprint arXiv:2404.03086</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>NIPS</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Milgrom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <article-title>Price and advertising signals of product quality</article-title>
          ,
          <source>Journal of Political Economy</source>
          <volume>94</volume>
          (
          <year>1986</year>
          )
          <fpage>796</fpage>
          -
          <lpage>821</lpage>
          . URL: https://ideas.repec.org/a/ucp/jpolec/ v94y1986i4p796-
          <fpage>821</fpage>
          .html. doi:
          <volume>10</volume>
          .1086/261408.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shapiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Stiglitz</surname>
          </string-name>
          ,
          <article-title>Equilibrium unemployment as a worker discipline device</article-title>
          ,
          <source>The American Economic Review</source>
          <volume>74</volume>
          (
          <year>1984</year>
          )
          <fpage>433</fpage>
          -
          <lpage>444</lpage>
          . URL: http://www.jstor.org/ stable/1804018.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>