=Paper=
{{Paper
|id=Vol-3744/paper2
|storemode=property
|title=CO*IR: A Greedy and Individually Fair Re-ranker
|pdfUrl=https://ceur-ws.org/Vol-3744/paper2.pdf
|volume=Vol-3744
|authors=Seán Healy
|dblpUrl=https://dblp.org/rec/conf/aimmes/Healy24
}}
==CO*IR: A Greedy and Individually Fair Re-ranker==
CO*IR: A Greedy and Individually Fair Re-ranker
Seán Healy
ADAPT Centre (Dublin City University), Ireland
Abstract
An open-source and greedy approach to individually fair re-ranking is presented, evaluated and tested.
Previous literature on individually fair search suggested greedy-style heuristics, but no such designs or
implementations were introduced before this writing. We release our re-ranker as a zero-dependency
utility compatible with all major operating systems.
We publish our re-ranker under a permissive software license (GPLv3). By explicitly considering
individual fairness a post-processing (re-ranking) task, we implement the notion of individual fairness
as a microservice, de-coupled from specific IR systems. Our software package works on commodity
hardware, and finds application in small-to-medium-sized businesses, federated social networks, and
under the broader open web search initiative.
Keywords
fairness, ranking, individual fairness, open source
1. Introduction
Individual fairness is a distributive justice consideration within information retrieval, first dis-
cussed in the context of sociotechnical systems by Dwork et al. [1] (“similar individuals treated
similarly”). The idea follows: rather than collecting group labels for items, and optimising
for equity with respect to these group labels, individual fairness aims to allocate exposure [2]
(or perhaps impact/outcomes) equitably to relevance or utility, on the level of the individual.
Biega notes that positional bias, stemming from limited screen space and user attention, means
the only way to do this must involve multi-user repetition of equivalent information needs [3,
p. 33]. Each user is served ‘perturbed’ rankings, 𝜌𝑗̃ , as opposed to ‘canonical’ rankings (𝜌),
with a goal of equalising exposure : relevance among items (Figure 1). This method has been
applied in both individual and group-fair scenarios. However, the methods so far have often
assumed significant computational power, or alternately, reliable access to group labels, leading
to another injustice in the form of barrier to entry and centralisation. The original proposal for
the world wide web emphasised non-centralisation:
“Information systems start small and grow. They also start isolated and then
merge. A new system must allow existing systems to be linked together without
requiring any central control or coordination.” [4]
In assuming an abundance of accurate group label data, or an abundance of centralised com-
puting power, we inherently support a winner takes all view of the WWW [5], where isolated
systems have ultimately merged into monopolistic ones with access to highly centralised com-
puting resources and accurate group labels for subjects and searchers. Within fairness research,
we often assume this state-of-affairs, in an effort to reduce winner takes all effects among search
subjects.
Ignoring computational constraints and data access issues limits scope, and focuses research
effort on the development and validation of theoretical methods. However, we argue that it is
possible to re-imagine these fairness processes for a non-centralised web, with only a negligible
AIMMES’24: Workshop on AI bias: Measurements, Mitigation, Explanation Strategies, March 20, 2024, Amsterdam
£ sean.healy@adaptcentre.ie (S. Healy)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
impact on fairness outcomes. We demonstrate this through the lens of individual fairness, leav-
ing group fairness for future work or additional re-ranking steps placed beforehand or afterwards
within a ranking pipeline.
A major issue facing fairness-aware search adoption on the open web relates to compatibil-
ity and required expertise. With the exception of FairSearch [6], most fairness solutions are
published in the form of Python experiments, without implementations that work across var-
ied computer systems. Additionally, researching greedy fairness methods has been proposed as
future work [2, p. 409], but as of this writing, the greedy strand of fair ranking research has
received little explicit attention, despite efficient (greedy) solutions being vital for widespread
feasibility of fair search.
Dublin + restaurant Dublin + restaurant Dublin + restaurant
1. The Boxty House 1. P. Macs 1. Sheehans
2. The Old Mill 2. Featherblade 2. The Old Mill
3. Sheehans 3. The Boxty House 3. Featherblade
4. Featherblade 4. The Old Mill 4. P. Macs
5. P. Macs 5. Sheehans 5. The Boxty House
Figure 1: Individual amortized fairness operating in a ranking system across duplicate queries (the yellow bars
indicate each document’s relative share of exposure over time.
2. Background
Though some computationally efficient fairness techniques have been provided [7, 8], most others
rely on intense optimisation strategies, necessitating a GPU at the minimum. This creates a
high barrier to entry, thus excluding the open web as a consideration. There are also ecological
downsides to this approach. In the case of Biega et al.’s technique [2, p. 411], integer linear
programming (ILP) was used via a proprietary cloud-based system1 . Singh and Joachims [9]
concurrently suggested an approach equivalent to amortized fairness; their method uses Birkhoff-
von-Neumann decomposition (BvN decomp.), and they applied it to groups as opposed to
individuals, noting their method could apply to individuals by reducing groups to size one.
Again, BvN is a computationally expensive procedure. To our knowledge, Morik et al. [7] were
the first to publish work addressing bias in what is known as a greedy solution.
Definition 1. Greedy algorithm. “An algorithm that always takes the best immediate, or local,
solution while finding an answer.” [10]
Greedy algorithms can efficiently produce approximate solutions to NP-complete problems, in-
cluding NP-complete fair ranking problems (Section 3). Though Morik et al. don’t directly
mention greedy algorithms, their proposed method is undoubtedly greedy (Definition 1), par-
ticularly given that it is observed to produce oscillating behaviour when a certain 𝜆 parameter
is not tuned correctly [7, p. 434]. Though this oscillation decreases with 𝜆-tuning, results re-
main approximate, not optimal. The authors apply their P-controller method [7, 11] to group
fairness, and don’t include quality threshold in their scope (e.g. Biega et al.’s 𝜃 threshold [2,
1
https://www.gurobi.com
Table 1
Comparing some existing fairness techniques in IR: computational costs, quality thresholds, and implementa-
tion status.
Method Comp. Cost Quality threshold Production-ready
FA*IR [8] Low 7 3 [6]
P-Controller [7] Low 7 7
DELTR [12] Low-medium 7 3 [6]
ILP [2] Medium 3 7
LP (BvN decomp.) [9] High 7 7
p. 409]). We later present a similarly greedy method that accounts for quality threshold, 𝜃,
while retaining low computational cost. Like the method of Morik et al., ours is not optimal.
We define optimality in terms of the ILP solution [2], which is optimal by definition, given that
the ILP solver finds the permutation of valid rankings that minimises the difference between
attention [𝐴𝑖 ] and relevance [𝑟𝑖 ] (scaled by 𝑚 per ranking count) (1), given a quality constraint
(𝜃) on result pages of length 𝑘 (2). A major caveat is that this approach only performs within a
reasonable duration when lists of length ≈ 100 are being re-ranked [2]. Queries to a web search
engine, however, might instead return tens of thousands of similarly ranked results. The ILP
problem is defined as follows:
𝑚
minimise ∑ |𝐴𝑖 − 𝑚 ⋅ 𝑟𝑖 | (1)
𝑖=1
subject to NDCG𝑘 (𝜌, 𝜌𝑗̃ ) ≥ 𝜃, 𝑗 ∈ [1, 𝑛] (2)
Biega et al. propose (in addition) an online variant of (1), which optimises for a solution (𝜌)̃ at
each ranking as opposed to assuming knowledge of how many rankings (𝑛) are expected.
Some previous fairness work has reached production-ready implementation, notably the work
of Zehlike and her colleagues [8, 12, 13], who provide Python/Java libraries, and Elasticsearch
plugins for both an in-processing technique (DELTR [12]) and a post-processing technique
(FA*IR[8]). We categorise the DELTR approach as low-medium cost, given its reliance on an
LTR method (in-processing fairness). Organisations may already have complex LTR pipelines,
and integrating this approach might be similarly complex, involving specialised expertise. For
companies without an LTR pipeline, relying on traditional no-data IR metrics alone (e.g. BM25),
this approach would not apply.
The FA*IR approach [8], a post-processing technique, assures ranking prefixes carry pre-
defined proportions across protected groups. The implementation [6] is efficient and requires
little additional expertise for deployment. However, it has seen limited adoption in industry (7
‘GitHub stars’ from industry practitioners, 13 in total as of January 2024). This could be due
to a lack of group labels available to businesses (FA*IR is designed as a group-fair method), or
perhaps the absence of a quality threshold on rankings produced by FA*IR. We summarise our
analysis of these prior works in Table 1.
3. Method
We now propose an individual fairness method that is low-cost and incorporates a quality
threshold; we later introduce a reference C++ implementation (Section 3.2). Our approach is
inspired by the combination of characteristics reviewed in Table 1, with an aim to introduce a fair
re-ranking approach that can be immediately deployed in many varied computing environments,
focusing mainly on open web search and scenarios of group membership ambiguity.
Quality control. Central to the field of information retrieval is quality. Al-Maskari et al. [14]
state, “User satisfaction is a subjective variable and is influenced by several factors (e.g. system
effectiveness, user effectiveness, user effort, and user characteristics).” Two popular metrics
for ensuring quality or system effectiveness are expected reciprocal rank [ERR] (4) [15] and
discounted cumulative gain [DCG]. We adopt a definition of the latter from Burges et al. [16,
p. 95] (3).
𝑘
2rel(𝑖) − 1
𝐷𝐶𝐺𝑘 = ∑ (3)
𝑖=1
log2 (𝑖 + 1)
𝑘
𝑟𝑖 𝑖−1
𝐸𝑅𝑅𝑘 = ∑ ∏[1 − rel(𝑖)] (4)
𝑖=1
𝑖 𝑗=1
In the above formulae, 𝑟𝑒𝑙(𝑖) denotes the relevance of the document at position 𝑖 in a ranking
𝜌 [or re-ranking 𝜌].̃ NDCG is most often used as a quality metric during the development of
relevance estimation strategies (including LTR). However, within fair re-ranking the formula is
instead used in order to retain a certain level of quality; i.e. we define NDCG based on relevance
probabilities as revealed to the fair re-ranker from an earlier point in the ranking pipeline. In fair
re-ranking, these relevance scores do not arise directly from annotated data, and are generally
the output from a ranking function itself. This new usage of the formula prompted Biega et al.
to use differing notation for NDCG: NDCG-quality@𝑘(𝜌, 𝜌)̃ = DCG@𝑘(𝜌)/DCG@𝑘(𝜌)̃ [2], where
𝜌 is a ranking without any amortized fairness applied, and 𝜌 ̃ is a ranking with amortized fairness
applied. For simplicity, we instead refer to this as NDCG𝑘 in the remainder of this paper.
Our method begins with quality control. Therein we bypass the more difficult question:
“Subject to 𝑁 𝐷𝐶𝐺𝑘 ≥ 𝜃, what is the best re-ranking (𝜌)̃ to distribute attention 𝐴?” . Instead,
we ask a series of easier questions: “What is the fairest result to return first, subject to 𝑁 𝐷𝐶𝐺𝑘 ≥
𝜃?” Then, “What is fairest to return next?” etc. This is a greedy approach (Definition 1).
Before we go further, we must introduce what is meant by exposure, alternately referred to as
attention by some authors. These are intuitive terms, especially to those familiar with search
engine optimisation. The terms have numeric definitions in fair ranking research. The simplest
practical model defines attention as the marginal probability of examination [7] in reference to
earlier position-based models [17] which found that rank alone can predict exposure with high
enough levels of accuracy for practical purposes [7, 18, 19]. In this sense attention would be
defined similarly to one of the summand denominators in DCG: 𝐴𝑖 = 𝑙𝑜𝑔(𝑖 + 1)−1 [9]. Another
model for attention is geometric [2], i.e. a special case of the cascade model using equal click
probability 𝑝 for each item: 𝐴𝑖 = (1 − 𝑝)𝑖−1 𝑝. We suggest a possible amendment to this model:
removing the final 𝑝 multiplicand, given that attention maps to examination of items (not clicks),
whereas the 𝑝 multiplicand in the cascade model specifically models a click. In addition to the
geometric model, the more general cascade model can be used to predict attention per-item in
a ranking. This would differ from geometric in that each item receives varying click probability.
The choice of attention model is important, though not the central scope of this work. We
assume a geometric model with 𝑝 = 0.4 in our experiments (Section 4). Returning to the greedy
approach, our task is as follows: a) choose an item at each rank which adds the most individual
fairness, b) Ensure chosen items are good enough to prevent the overall ranking from dipping
below the quality threshold (𝜃). For the purposes of their ILP-based approach, Biega et al. use
𝑛 𝑛 𝑚
the following definition: unfairness(𝜌1 , … , 𝜌𝑛 ) = ∑𝑖=1 |𝐴𝑖 − 𝑚 ⋅ 𝑟𝑖 | = ∑𝑖=1 ∣∑𝑗=1 𝑎𝑗𝑖 − 𝑚 ⋅ 𝑟𝑖 ∣.
𝑎𝑗𝑖 is the attention obtained by document 𝑖 in ranking 𝑗, and 𝑟𝑖 is the relevance of the document
to the outer information need. We use an equivalent minimisation target (5) which aligns more
closely with the method we will later outline.
𝐴𝑖 𝐴
unfairness(𝜌1 , … , 𝜌𝑚 ) = max [ ] − min [ 𝑖 ] . (5)
𝑖 ∈ [1,𝑛] rel(𝑖) 𝑖 ∈ [1,𝑛] rel(𝑖)
Proof of equivalence. When Biega et al.’s unfairness is minimised: 𝐴𝑖 − 𝑚 ⋅ 𝑟𝑖 = 0, ∀𝑖. Put
𝐴𝑖
differently: rel(𝑖) = 𝑚, ∀𝑖. Therefore, unfairness(𝜌1 , … , 𝜌𝑚 ) would be minimised too (5), given
that max(𝑚) = min(𝑚), ∀ constants 𝑚.
𝐴∗
It is useful to label rel(∗) itself as the relative compensation received by document 𝑖 (relative to
merit or utility). In this sense, we define unfairness as a large gap between the highest relative
compensation and the lowest.
3.1. Deriving minimal relevance (a-posteriori) at each rank via an optimistic strategy.
We use an optimistically greedy strategy, meaning we select the next result which minimises
unfairness (5) so long as this selection is feasible under the assumption that strongly relevant
items will fill the remainder of the 𝑘 − 1 results (the result page suffix). We refer to this as
the deus ex machina strategy (DEM); regardless of how low the running NDCG𝑗 becomes at
each choice 𝑗, highly relevant results might arrive in the tail end, abruptly making up for the
predicament.
A.
zadt ohpkbvuq i x s j c f ge ?zadt ohpkb
B.
zadt ohpkbvuq i x s j c f ge ep i d?za t oh
C.
zadt ohpkbvuq i x s j c f ge ep i ds zxg?a
Figure 2: DEM selection; A. depicts the first rank being chosen; B. is fifth rank being chosen; C. is the ninth.
DEM selection is illustrated in three stages during the preparation of a single re-ranking (Figure
2). The left side of the diagram depicts canonical relevancies of items, as passed into the re-
ranking system from an earlier ranking process. The red line represents the first page cutoff
point (𝑘 = 10). The right side is the ongoing re-ranking of these items, with the sole coloured bar
representing the current rank whose item is being decided. While choosing the first item (part
A), the missing part of the ranking (dashed blocks) is assumed to be a prefix of the canonically
ordered results, ⟨z, a, d, t, o, h, p, b⟩. Part B occurs after the prior selection of four items using
DEM strategy. Notably, one of the items previously selected (d) was in the canonical prefix.
We illustrate this by coloring its bar grey, meaning it can no longer be used during optimistic
predictions about the remaining results. This is further illustrated in the dashed blocks (far right
of part B), where the optimistic prefix excludes d: ⟨z, a, t, o, h⟩. This selection process continues
(part C), and ultimately a re-ranking is produced within a given quality threshold. It might
not end up that a re-ranking suffix (right) resembles a canonical prefix (left), but our overall
strategy ensures we always remain within the quality threshold at each item choice. We now
validate this strategy by considering the quality threshold as introduced earlier (NDCG𝑘 ≥ 𝜃).
This can be rewritten as follows.
DCG𝑘 (𝜌)̃ ≥ 𝜃 ⋅ DCG𝑘 (𝜌) (6)
Using the notation rel(𝑖) to represent the relevance of the document at position 𝑖 in re-ranking
𝜌,̃ and Rel(𝑖) for relevance at 𝑖 in the more canonical ranking, 𝜌 (e.g. left in Figure 2), we have:
𝑘 𝑘
2rel(i) − 1 2Rel(𝑖) − 1
∑ ≥ 𝜃∑ (7)
𝑖=1
log2 (𝑖 + 1) 𝑖=1
log2 (𝑖 + 1)
Under DEM strategy, remaining item choices are assumed to have ideal relevance, and the
item under the present rank (𝑗) is picked accordingly. We represent this mathematically by
substituting an ideal suffix into the above (7), while also accounting for the relevance of previous
item choices:
𝑗−1 𝑘 : 𝑘
2rel(i) − 1 2rel(j) − 1 2Rel(𝑖) − 1 2Rel(𝑖) − 1
(∑ )+ ( )+ ( ∑ )≥𝜃∑ (8)
log2 (𝑖 + 1) log2 (𝑗 + ⏟⏟
1) log2 (𝑖 + 1) log2 (𝑖 + 1)
⏟⏟⏟⏟⏟⏟
𝑖=1 ⏟⏟⏟ ⏟⏟⏟⏟⏟ 𝑖=𝑗+1
⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟
𝑖=1⏟⏟⏟ ⏟⏟
past present ultimate ideal
future (ideal)
The above can be re-organised to find a minimum relevance required at the present rank 𝑗.
Below this relevance [𝑟𝑒𝑙(𝑗)], meeting the quality threshold would be impossible.
𝑘 𝑗−1 𝑘
:
2Rel(𝑖) − 1 2rel(i) − 1 2Rel(𝑖) − 1
rel(𝑗) ≥ log2 (1 + 𝜃 [∑ ] − [∑ ]−[ ∑ ]) (9)
𝑖=1
log𝑗+1 (𝑖 + 1) 𝑖=1
log𝑗+1 (𝑖 + 1) 𝑖=𝑗+1
log𝑗+1 (𝑖 + 1)
DCG
≥ min_rel𝑗 … the label given for this minimum relevance at each rank
In the above equations, a distinction is made between canonical relevance [Rel(𝑖)], and canonical
:
relevance with the previously chosen 𝑗 − 1 items removed [Rel(𝑖)]. This relates to the grey bars
in Figure 2; Rel(𝑖) counts from the beginning, including these grey bars. On the other hand,
:
Rel(𝑖) excludes these bars while counting 𝑖.
Applying the maths. Our proposed method for greedy individual fairness involves sorting re-
sults by relative compensation. We then scan over this list (lowest relative compensation to
DCG
highest), stopping at the first result with high enough relevance [rel𝑗 ≥ min_rel𝑗 ] (9). The
scan could be lengthy, depending on the aggressiveness of the quality threshold (high 𝜃), or de-
DCG
pending on result corpus quantity and size. Calculating min_rel𝑗 values in advance minimises
computer operations during the scan.
Applying DEM selection with ERR quality metrics. Though 𝜃 was first used with DCG [2],
it can also be used with any ranking quality metric, including ERR [introduced earlier (4)].
This is another advantage of the greedy approach over the ILP approach; ILP requires a highly
efficient quality metric, whereas the greedy method does not.
𝑘
𝑟𝑖 𝑖−1 𝑘
𝑟𝑖 𝑖−1
∑ ∏ [1 − rel(𝑖)] ≥ 𝜃 ∑ ∏ [1 − Rel(𝑖)] (10)
𝑖=1
𝑖 𝑚=1 𝑖=1
𝑖 𝑚=1
≥ 𝜃 ⋅ IERR𝑘 …we use IERR𝑘 (ideal ERR) as shorthand.
Expressing rel(𝑖) in terms of everything else (in the case of ERR), assuming a DEM strategy
(Figure 2), is more complicated than with DCG, the latter being a much more stateless metric.
For ERR, we have:
𝑗−1 𝑗−1
rel(𝑖) 𝑖−1 rel(𝑗)
(∑ ∏ [1 − rel(𝑚)]) + ( ∏ [1 − rel(𝑚)]) + Future
⏟ ≥ 𝜃 ⋅ IERR ⏟𝑘
𝑖 𝑚=1 𝑗 𝑚=1
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
𝑖=1 " (ideal) ultimate ideal
past present
(11)
: 𝑗−1
𝑘
Rel(𝑖 − 𝑗) 𝑖−1
:
with… Future = ∑ [ ( ∏ [1 − rel(𝑚)]) [1 − rel(𝑗)] ∏ [1 − Rel(𝑛 − 𝑗)]]
𝑖=𝑗+1
𝑖 𝑚=1 𝑛=𝑗+1
The above can be rewritten in the following intermediate form:
𝑗−1 𝑖−1
rel(𝑗) 𝜃 ⋅ IERR𝑘 − ∑𝑖=1 rel(𝑖)
𝑖 ∏𝑚=1 [1 − rel(𝑚)]
+X≥ 𝑗−1
(12)
𝑗 ∏ [1 − rel(𝑚)]
𝑚=1
:
𝑘
Rel(𝑖 − 𝑗) 𝑖−1 :
with… X = [1 − rel(𝑗)] ∑ ∏ [1 − Rel(𝑛 − 𝑗)]
𝑖=𝑗+1
𝑖 𝑛=𝑗+1
The completed rewrite, focusing on expressing rel(𝑗) in terms of everything else, follows.
𝑗−1 𝑖−1
𝜃⋅IERR𝑘 −∑𝑖=1 rel(𝑖) ∏𝑚=1 [1−rel(𝑚)] 𝑘
: 𝑖−1 :
𝑗−1 − ∑𝑖=𝑗+1 Rel(𝑖−𝑗)
𝑖 ∏𝑛=𝑗+1 [1 − Rel(𝑛 − 𝑗)]
∏𝑚=1 [1−rel(𝑚)]
rel(𝑗) ≥ : : (13)
1 𝑘 Rel(𝑖−𝑗) 𝑖−1
𝑗 − ∑𝑖=𝑗+1 𝑖 ∏𝑛=𝑗+1 [1 − Rel(𝑛 − 𝑗)]
ERR
≥ min_rel𝑗
Though the above formula involves a lot of summed products, previous work has provided
optimised O(𝑘) techniques for calculating these parts [15]. Furthermore, 𝑘 in our web search
∗
use-case will generally be quite low (≪ 75), and min_rel𝑗 need only be calculated once per item
returned to the user.
Minimum relevance a-priori. Above, we presented means of reducing computer operations dur-
𝐴∗
ing the linear scan of items ordered by relative compensation [ rel(∗) ]. Another optimisation
for reducing the cost of this linear scan is to pre-calculate a-priori minimum relevance (given
quality threshold 𝜃) at each rank 𝑗. Consider a scenario where almost all items in the re-ranking
𝜌 ̃ are ideal (relative to 𝜌), just not the item at rank 𝑗. The minimum relevance necessary at 𝑗
in order to make (14) or (15) hold true is thus defined by substituting Rel(⋅) for all instances of
:
Rel(⋅) and rel(⋅ ≠ 𝑗) in (9) and (13) respectively. Any items below this relevance can be filtered
out of a candidate list per-rank from the very start. This results in a shorter linear scan. We
∗
denote this cutoff relevance as min_rel_priori𝑗 , and present its derived expression below, first
for DCG and then for ERR.
DCG
min_rel_priori𝑗 =
𝑗−1 𝑘 𝑘−1
2Rel(𝑖) − 1 2Rel(𝑖) − 1 2Rel(𝑖) − 1
log2 (1 + [𝜃 − 1] [∑ ] + 𝜃 [∑ ]−∑ ) (14)
𝑖=1
log𝑗+1 (𝑖 + 1) 𝑖=𝑗
log𝑗+1 (𝑖 + 1) 𝑖=𝑗
log𝑗+1 (𝑖 + 2)
𝑗−1 𝑖−1
𝜃⋅IERR𝑘 −∑𝑖=1 Rel(𝑖) ∏𝑚=1 [1−Rel(𝑚)] 𝑘 𝑖−1
𝑗−1 − ∑𝑖=𝑗+1 Rel(𝑖−𝑗)
𝑖 ∏𝑛=𝑗+1 [1 − Rel(𝑛 − 𝑗)]
ERR ∏𝑚=1 [1−Rel(𝑚)]
min_rel_priori𝑗 = 𝑘 Rel(𝑖−𝑗) 𝑖−1
1
𝑗 − ∑𝑖=𝑗+1 𝑖 ∏𝑛=𝑗+1 [1 − Rel(𝑛 − 𝑗)]
(15)
Below, we outline at a high-level how our method works. Following this we discuss an edge-case.
𝐷 streams into the re-ranking system ordered by relevance to the information need.
(A) Serve top 𝑘 items as the first ranking, 𝜌1̃ = 𝜌, updating 𝐴∗ accordingly.
(B) Copy (𝑘 − 1) top items from 𝐷 into 𝐼 (left of red line, Figure 2).
∗
(C) Calculate min_rel_priori𝑗 using 𝐼 and 𝜃 (14 & 15), ∀𝑗 ∈ [1, 𝑘]
∗
(D) Remove items 𝑖 from 𝐷 where rel(𝑖) < min_rel_priori𝑘 (14 & 15).
(E) Create candidate lists per-rank, 𝑗 ∈ [1, 𝑘], st. list item relevances ≥ 𝑚𝑖𝑛_𝑟𝑒𝑙_𝑝𝑟𝑖𝑜𝑟𝑖∗𝑗 .
𝐴∗
(F) Sort candidate lists by by relative compensation [ rel(∗) ], low → high.
(G) Initiate a new re-ranking, 𝜌∗̃ .
(H) For 𝑗 between 1 and 𝑘:
∗
(H)(1) Calculate min a-posteriori relevance at 𝑗 (min_rel𝑗 ), using 𝐼, 𝜃 and 𝜌∗̃ (9 & 13).
∗
(H)(2) Scan over candidates, until an item 𝑖 with rel(𝑖) < min_rel𝑗 is found. Add it to 𝜌∗̃ .
(H)(3) Mask said item in all candidate lists until the loop completes (no duplicate selections).
(I) Serve 𝜌∗̃ as the next ranking, update 𝐴∗ , and return to Step (F).
An edge-case. One caveat excluded from the overview (for brevity) is that sometimes the item
selected at a given rank (coloured bars, right hand side, Figure 2) will be part of the optimistic
tail (dashed blocks, Figure 2). In many cases, this will not be an issue, given that the a-posteriori
past + present + future quality score (9 & 13) is likely to remain within threshold. However,
this does constitute double-counting, and there will be some edge-cases where this becomes an
issue. See Figure 3.
zobvuq ?zob bzob DCG( b z o b ) DCG( b z o v ) θ DCG( z o b v )
= 0.8 = 0.76 = 0.78
double count
Figure 3: An edge-case during DEM selection; b is selected, and DEM counts it both in the present and future
parts of (8). Expected (ultimate) quality, on completion of the ranking, is therefore estimated inaccurately.
An item b is chosen at the first rank. In reality, only the first two items, z and o, would result
in a ranking that meets the quality threshold of 𝜃 ⋅ 𝐼𝐷𝐶𝐺4 . The presence of b in the optimistic
tail (dashed blocks) causes it to be double counted.
Rather than addressing this issue before it arises, we instead prioritise linear scan speed, and
initially let the error happen, only checking after a candidate item is found whether or not it is
part of the optimistic tail (dashed boxes in Figure 2), and furthermore, whether the corrected
quality of the ranking falls below the threshold.
To address this edge-case, we insert the below steps in the algorithm outline, displacing the
original Steps (H)(1)→(H)(3) to (H)(4)→(H)(6). The new Step (H)(3) might seem controversial,
given that it scans upwards through 𝐼, a much shorter list than 𝐷. However, it can be proven
that when the edge-case occurs, scanning anything outside this upper-𝐼 part would be fruitless.
(H)
(H)(1) Check if previously selected document 𝑖 is in 𝐼 [skip to ((H)(4)) if not].
∗
(H)(2) If so, mask it in 𝐼, re-calculate min_rel𝑗−1 ; check 𝜃 was truly respected (8 & 11).
(H)(3) If it was not, unmask 𝑖 and scan up through 𝐼 until a document 𝑖′ fulfils the threshold.
𝐴∗
Continue backwards (to very beginning of 𝐼), finding 𝑖′′ with lowest rel(∗) .
Choose and subsequently mask 𝑖′′ instead of 𝑖.
(H)(4) … business as usual …
Proof for sufficiency of upward scan [Step (H)(3)]. We further illustrate the edge-case double
count problem via Figure 4, where a document x is chosen, but this document happens to be
the optimistic tail (dashed blocks). Given the left scenario in Figure 4 meets quality standards,
but the middle (accounting for the double count correctly) does not, we wish to know if this
indicates that the right side scenario will also not meet quality standards. In the right hand
side, rather than selecting x, another document beyond the end of the optimistic tail is chosen.
All items after x in the canonical ordering are assumed to have equal relevance, 𝑡 < rel(𝑗𝑥 ), and
the re-occurrence of x in the optimistic tail occurs at offset 𝑚 into that tail.
(choosing from RHS of x: is θ broken?)
(with double x; θ respected)
(without double x; θ broken)
A x H x T A x H T → A T H x T
Figure 4: DEM edge-case optimisation (support diagram for proof)
The diagram illustrates the formulae below. (16) corresponds to the middle portion of Figure 4,
and (17) corresponds to the right portion. We use a symbol in (17) representing the unknown
operator, ⟦⊕⟧ ∈ {⟦<⟧, ⟦≥⟧}. We seek to prove that ⟦⊕⟧ = ⟦<⟧.
𝑗−1 : 𝑘−𝑗
2𝑟𝑒𝑙(𝑖) − 1 2𝑟𝑒𝑙(𝑗𝑥 ) − 1 𝑚−1 2𝑅𝑒𝑙(𝑖) − 1 2𝑡 − 1
∑ + + ∑ + ∑ <𝑌 (16)
log2 (𝑖 + 1) ⏟ log
⏟ ⏟ (𝑗 +⏟⏟
1) log2 (𝑗 + 𝑖 + 1) 𝑖=𝑚+1 log2 (𝑗 + 𝑖)
⏟⏟⏟⏟⏟⏟⏟
𝑖=1 2 ⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟
𝑖=1
A x (middle) H (middle) T (middle)
… with 𝑌 = 𝜃 ⋅ 𝐼𝐷𝐶𝐺𝑘
𝑗−1 𝑚−1 : 𝑘−𝑗
2𝑟𝑒𝑙(𝑖) − 1 2𝑡 − 1 2𝑅𝑒𝑙(𝑖) − 1 2𝑟𝑒𝑙(𝑗𝑥 ) − 1 2𝑡 − 1
∑ + + ∑ + + ∑ ⊕𝑌
log2 (𝑖 + 1) ⏟ log
⏟2⏟(𝑗 +⏟⏟
1) log2 (𝑗 + 𝑖 + 1) ⏟⏟⏟⏟⏟
log2 (𝑗 + 𝑚) 𝑖=𝑚+2 log2 (𝑗 + 𝑖)
⏟⏟
𝑖=1⏟⏟⏟ ⏟⏟ ⏟⏟
𝑖=1⏟⏟⏟⏟ ⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟
A T (right; first) H (right) x (right) T (right; second)
(17)
Shifting some common parts from (16 & 17) [i.e. A, H] to the RHS, ‘𝑌 − ShiftedParts’ can be
replaced with another constant, 𝐶, and we now have a simpler problem. If we can prove that
⟦⊕⟧ = ⟦<⟧ in (19), then we will have proven the same in (17).
𝑘−𝑗
2𝑟𝑒𝑙(𝑗𝑥 ) − 1 2𝑡 − 1 2𝑡 − 1
+ + ∑ <𝐶 (18)
log
⏟ ⏟2⏟(𝑗 +⏟⏟1) log2 (𝑗 + 𝑚 + 1) 𝑖=𝑚+2 log2 (𝑗 + 𝑖)
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
x (middle) T (middle)
𝑘−𝑗
2𝑡 − 1 2𝑟𝑒𝑙(𝑗𝑥 ) − 1 2𝑡 − 1
+ + ∑ ⊕𝐶 (19)
log
⏟ ⏟2⏟(𝑗 +⏟⏟ log2⏟⏟⏟
1) ⏟⏟ (𝑗 + 𝑚 + ⏟⏟ 1) 𝑖=𝑚+2 log2 (𝑗 + 𝑖)
⏟⏟⏟⏟⏟⏟⏟
T (right; first) x (right) T (right; second)
Given that the LHS of (19) is clearly smaller than the LHS of (18), by transitivity ⟦⊕⟧ = ⟦<⟧.
Ultimately, this is why we only need to scan upwards in Step (H)(3). With descending (non-
uniform) T (Figure 4), this observation would only be exacerbated.
3.2. Implementation and Usage
In our implementation, we use a candidate list per rank (Figure 5, pre-populated with feasible
items. This reduces the length of the linear scans, and creates a strong basis for future hybrid
Relative compensation changes for 𝑖…
Dublin + restaurant Two items later… Dublin + restaurant → ∀ ranks, doc moves down candidate list.
Rank #1 → 3 candidates e z i doc e chosen z e i
Rank #2 → 7 candidates e p z w i r t doc p chosen z e w p i r t
Rank #3 → 7 candidates e p z w i r t doc i chosen z e w p i r t
Rank #4 → 10 candidates d e p s z w i o r t Rank #4 → 10 candidates d z e s w p i o r t
Rank #5 → 12 candidates u d e p s z w i y o g r t Rank #5 → 12 candidates u d z e s w p i y o g r t
Figure 5: Candidate lists per-rank, as determined by the 𝜃 threshold. Candidate list length ascends with rank,
and relative-order is preserved in lesser ranked candidate lists (no overlapping dashed lines).
greedy/shallow graph-search approaches. The methods and optimisations described above can
be implemented in any programming language, but we provide a sample GPLv3-licensed imple-
mentation written in standard C++† . This choice focuses on the open web, given the ubiquity of
C++ compilers for all modern operating systems, and the performance benefits associated with
the low-level programming paradigm. The implementation offers a built-in CLI interface with
further integrations to follow. The underlying re-ranker runs as a daemon, reachable by differ-
ent processes. With the optimisations described in Section 3, particularly the DEM selection
strategy, individually fair re-rankings are placed within reach for any organisation limited to
commodity hardware. Usage instructions can be found within the project website.
4. Evaluation
RQ1 Can we address quality-constrained individual fairness via greedy item-by-item selection?
RQ1a Is the greedy method comparable to best-case scenarios (ILP, Graph search, BvN
decomp.) in terms of optimal fairness per-ranking?
RQ1b Is the greedy method fast enough for general-purpose usage?
RQ1c Is the installation process easy enough to support a non-centralised ecosystem?
The first research question (RQ1) is more-so a mathematical one, with an answer etched out
in Section 3, as well as in the subsequent implementation (Section 3.2). However, without
analysing how closely the fairness results of this approach align with the best case (RQ1a), it
would be unclear under what scenarios each method is better.
RQ1a Optimality We were unable to obtain funding for Gurobi’s cloud-based solution. How-
ever, we were able to ascertain some optimally fair re-rankings using exhaustive graph search in
SWI-Prolog. We then used these optimal solutions as a reference point to estimate fairness loss
when using our greedy method. The Prolog solution space search relies on an observation from
Figure 5: given a high enough quality threshold, low ranks will have small numbers of feasible
candidates, and higher ranks will have higher numbers of feasible candidates. This a-priori
observation considerably reduces the search space of feasible rankings. With a small enough
corpus size (≈ 30), and proper re-calculation of feasible items (DEM strategy, Section 3.1), the
search space becomes small enough to iterate over each valid ranking. It is thereby possible to
determine how many excess rankings our method requires, as compared to a re-ranker that finds
the optimally fair re-ranking at each query instance. We investigate this for various 𝜃 values,
with the initial and trivial observation that when 𝜃 = 0, the greedy method produces optimal
rankings. Likewise, when 𝜃 = 1, greedy re-rankings are optimal.
†
https://librecoir.com/
1.00
0.75
Query
Relevance Query 1
0.50 Query 2
Query 3
0.25
Documents
Figure 6: Three simulated queries, for comparing the performance of greedy fairness against optimal fairness.
Using an aggregated metric for unfairness (20), we conducted experiments with the three
query types from Figure 6. This metric counts instances of negative discrimination (under-
representation as opposed to over-representation). The attention afforded to each document is
made relative by division with the sum of all attention. This value is then compared to the
item’s relative utility, rel(𝑖)/ ∑ rel(𝑗).
𝑟𝑒𝑙(𝑖) 𝐴𝑖
unfairness = ∑ max [0, − ] (20)
𝑖∈[1,𝑛]
∑𝑗 rel(𝑗) ∑𝑗 𝐴𝑗
The results from this experiment indicated that fairness is much more attainable with some
query types versus others. Query 2 in particular (the green peak in Figure 6) was able to
approach fairness relatively quickly, even when 𝜃 was set high (𝜃 ≈ 0.88). However, with
excessively high 𝜃, results are inevitably unfair. For example, when a quality threshold of
𝜃 = 0.98 is used, barely any change in ordering is allowed near the top of the result page, and
this results in a slowly rising unfairness metric. On the other hand, when quality threshold
is low enough (𝜃 ≈ 0.78), fairness is achieved in Query 2 very early on. Even with more
difficult queries (Queries 1 and 2), a low enough 𝜃 threshold (𝜃 < 0.5) while using the greedy
method leads to reasonable individual fairness. Though the comparison with optimal rankings
was restricted to small levels of duplication, and thereby not statistically significant, there is
inevitably a trade-off between algorithm efficiency and optimality. This is future research.
RQ1c Ease-of-use Evaluating a system’s ease-of-use is a qualitative and highly subjective task,
especially in the case of WWW software; a software suite or package built for the web could be
easy to install one month, later rendered entirely unusable the next month due to dependency
versioning or backwards compatibility issues. In the case of ML systems, the problem is made
worse by widespread practices of publishing pre-trained models without background training
sets. At times this is done for copyright reasons, other times because the dataset is too large. We
aimed to bypass most compatibility issues by writing the CO*IR program in C++, purposely
avoiding external dependencies. Additionally, we bypass issues around datasets by producing a
fairness method that doesn’t rely on data. Future work will amend CO*IR to allow for gradually
acquired training data for tuning 𝜃 automatically per-query. Evaluation of the system’s ease-of-
use is ongoing, and we welcome feedback from all stakeholders, especially those in positions to
implement fairness in product settings (software engineers, system admins, etc.)
0.3
Quality Threshold
θ = 0.28
θ = 0.48
θ = 0.78
0.2
θ = 0.88
Unfairness
θ = 0.98
Query
0.1
Query 1
Query 2
Query 3
0.0
0 50000 100000 150000 200000
Repetition
Figure 7: Results from experiments conducted with distinct simulated queries from Figure 6
Table 2
Computing time taken for bulk re-ranking jobs on a large corpus (10,000 items).
Load Total computing time (i5 core)
500 re-rankings 0.221 seconds
2000 re-rankings 0.730 seconds
8000 re-rankings 2.925 seconds
RQ1b Efficiency One of the primary purposes of this work was the attempted introduction
of individual fairness to web-scale search on commodity hardware. To evaluate this goal, the
CO*IR system was applied to the task of re-ranking a query with 10,000 competing documents,
and a gradually declining relevancy slope among documents (like Query 1 in Figure 6). CO*IR
was able to produce 8,000 re-rankings in under 3 seconds on an i5 processor (2.30GHz). Results
from this experiment are presented in Table 2.
5. Conclusion and Future Work
In this work we considered the problem of barriers to entry in fair re-ranking, particularly
individually fair re-ranking. We designed and analysed a fair re-ranker for commodity hardware.
Optimisations included: tackling the re-ranking problem in a greedy fashion, pre-eliminating
infeasible (irrelevant) items, and exploiting edge-cases to allow for early exit from a linear scan.
With these optimisations, applying individual fairness on a much higher quantity of results (as
compared to prior work; ≫ 100) is now within reach for smaller (as well as larger) organisations.
Future work might aim to make the package more usable in web search scenarios (the pre-
liminary version targets standalone rankings). Distributing the program’s execution across a
network of collaborating servers would also be advantageous for the open web. Integrating
group fairness as an additional feature, alongside individual fairness, would also be necessary
for a more holistically ethical ranking tool, as individual fairness would be inappropriate for
the purposes of positive action intervention in rankings [20]. For group fairness, however, there
already exists FairSearch [6], which we recommend. Optional multi-threading or GPU support
would also increase our tool’s efficiency in many scenarios. We didn’t target this initially; ver-
sion one was intended first-and-foremost to bring individually fair ranking into the range of
possibility for small organisations partaking in an open, non-centralised, web.
Finally, we aim to introduce a modified version of the CO*IR algorithm, which uses the
same heuristics as above (8 and 13), applying them in A* graph search [21]. With a pessimistic
heuristic, A* search produces admissible solutions [22]; and would therefore be equivalent to the
ILP method used by Biega et al. [2]. Varying between pessimistic and optimistic heuristics, we
hypothesise it is possible to have a fairness solution that uses system resources dynamically, i.e.
optimality when the resources are there, sub-optimality otherwise, and many different grades of
optimality in-between the two extremes.
Acknowledgements
This paper was in part funded by the Science Foundation Ireland. I must thank Maaike Visser
for her ongoing support, sharing of ideas and guidance. Her Masters thesis [23] provided me
an introduction to fairness evaluation metrics, which will help inform future work.
References
[1] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, R. Zemel, Fairness through awareness, in:
Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS
’12, Association for Computing Machinery, 2012, pp. 214–226.
[2] A. J. Biega, K. P. Gummadi, G. Weikum, Equity of attention: Amortizing individual
fairness in rankings, in: The 41st International ACM SIGIR Conference on Research &
Development in Information Retrieval, 2018, pp. 405–414.
[3] J. Biega, Enhancing privacy and fairness in search systems, 2018.
[4] T. Berners-Lee, Information management: A proposal,
https://web.archive.org/web/20100401051011/https:
//www.w3.org/History/1989/proposal.html, 1989. Accessed: March 24, 2024.
[5] C. Santesteban, S. Longpre, How big data confers market power to big tech: Leveraging
the perspective of data science, The Antitrust Bulletin 65 (2020) 459–485. URL:
https://doi.org/10.1177/0003603X20934212. doi:10.1177/0003603X20934212.
arXiv:https://doi.org/10.1177/0003603X20934212.
[6] M. Zehlike, T. Sühr, C. Castillo, I. Kitanovski, Fairsearch: A tool for fairness in ranked
search results, in: Companion Proceedings of the Web Conference 2020, WWW ’20,
Association for Computing Machinery, New York, NY, USA, 2020, p. 172–175. URL:
https://doi.org/10.1145/3366424.3383534. doi:10.1145/3366424.3383534.
[7] M. Morik, A. Singh, J. Hong, T. Joachims, Controlling fairness and bias in dynamic
learning-to-rank, in: Proceedings of the 43rd International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR ’20, Association for
Computing Machinery, New York, NY, USA, 2020, p. 429–438. URL:
https://doi.org/10.1145/3397271.3401100. doi:10.1145/3397271.3401100.
[8] M. Zehlike, F. Bonchi, C. Castillo, S. Hajian, M. Megahed, R. Baeza-Yates, FA*IR: A
fair top-k ranking algorithm, in: Proceedings of the 2017 ACM on Conference on
Information and Knowledge Management, CIKM ’17, Association for Computing
Machinery, 2017, pp. 1569–1578.
[9] A. Singh, T. Joachims, Fairness of exposure in rankings, in: Proceedings of the 24th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD
’18, Association for Computing Machinery, New York, NY, USA, 2018, pp. 2219–2228.
[10] P. E. Black, Greedy algorithm, https://www.nist.gov/dads/HTML/greedyalgo.html,
2005. Accessed: 2024-02-12.
[11] B. W. Bequette, Process Control: Modeling, Design and Simulation, Prentice Hall,
Rensselaer Polytechnic Institute, 2003.
[12] M. Zehlike, C. Castillo, Reducing disparate exposure in ranking: A learning to rank
approach, in: Proceedings of The Web Conference 2020, WWW ’20, Association for
Computing Machinery, New York, NY, USA, 2020, p. 2849–2855. URL:
https://doi.org/10.1145/3366424.3380048. doi:10.1145/3366424.3380048.
[13] M. Zehlike, Fairness in Rankings (Dissertation), Ph.D. thesis, 2022.
[14] A. Al-Maskari, M. Sanderson, A review of factors influencing user satisfaction in
information retrieval, Journal of the American Society for Information Science and
Technology 61 (2010) 859–868. URL:
https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.21300.
doi:https://doi.org/10.1002/asi.21300.
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.21300.
[15] O. Chapelle, D. Metlzer, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded
relevance, in: Proceedings of the 18th ACM Conference on Information and Knowledge
Management, CIKM ’09, Association for Computing Machinery, New York, NY, USA,
2009, p. 621–630. URL: https://doi.org/10.1145/1645953.1646033.
doi:10.1145/1645953.1646033.
[16] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender,
Learning to rank using gradient descent, in: Proceedings of the 22nd International
Conference on Machine Learning, ICML ’05, Association for Computing Machinery, New
York, NY, USA, 2005, p. 89–96. URL: https://doi.org/10.1145/1102351.1102363.
doi:10.1145/1102351.1102363.
[17] N. Craswell, O. Zoeter, M. Taylor, B. Ramsey, An experimental comparison of click
position-bias models, in: Proceedings of the 2008 International Conference on Web
Search and Data Mining, WSDM ’08, Association for Computing Machinery, New York,
NY, USA, 2008, p. 87–94. URL: https://doi.org/10.1145/1341531.1341545.
doi:10.1145/1341531.1341545.
[18] A. Agarwal, I. Zaitsev, X. Wang, C. Li, M. Najork, T. Joachims, Estimating position bias
without intrusive interventions, in: Proceedings of the Twelfth ACM International
Conference on Web Search and Data Mining, WSDM ’19, Association for Computing
Machinery, New York, NY, USA, 2019, p. 474–482. URL:
https://doi.org/10.1145/3289600.3291017. doi:10.1145/3289600.3291017.
[19] X. Wang, N. Golbandi, M. Bendersky, D. Metzler, M. Najork, Position bias estimation
for unbiased learning to rank in personal search, in: Proceedings of the Eleventh ACM
International Conference on Web Search and Data Mining, WSDM ’18, Association for
Computing Machinery, New York, NY, USA, 2018, p. 610–618. URL:
https://doi.org/10.1145/3159652.3159732. doi:10.1145/3159652.3159732.
[20] M. Zehlike, A. Loosley, H. Jonsson, E. Wiedemann, P. Hacker, Beyond incompatibility:
Trade-offs between mutually exclusive fairness criteria in machine learning and law, 2022.
arXiv:2212.00469.
[21] P. E. Hart, N. J. Nilsson, B. Raphael, A formal basis for the heuristic determination of
minimum cost paths, IEEE Transactions on Systems Science and Cybernetics 4 (1968)
100–107. doi:10.1109/TSSC.1968.300136.
[22] S. J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 4th ed., Pearson,
Boston, 2018.
[23] M. Visser, Investigating Fair Rankers Under the Expected Exposure Framework, Master’s
thesis, 2022.