<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>European
Journal of Investigation in Health</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/1148170.1148181</article-id>
      <title-group>
        <article-title>On the Biased Assessment of Expert Finding Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jens-Joris Decor t</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jeroen Van Haut</string-name>
          <email>roen@techwolf.ai</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>e Chris Develde</string-name>
          <email>ris.develder@ugent</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>rand Thomas Demeester</string-name>
          <email>thomas.demeester@ugent.b</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Information Retrieval, Expert Retrieval, Knowledge Management</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ghent University - imec</institution>
          ,
          <addr-line>9052 Gent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TechWolf</institution>
          ,
          <addr-line>9000 Gent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>15</volume>
      <fpage>14</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>In large organisations, identifying experts on a given topic is crucial in leveraging the internal knowledge spread across teams and departments. So-called enterprise expert retrieval systems automatically discover and structure employees' expertise based on the vast amount of heterogeneous data available about them and the work they perform. Evaluating these systems requires comprehensive ground truth expert annotations, which are hard to obtain. Therefore, the annotation process typically relies on automated recommendations of knowledge areas to validate. This case study provides an analysis of how these recommendations can impact the evaluation of expert finding systems. We demonstrate on a popular benchmark that system-validated annotations lead to overestimated performance of traditional term-based retrieval models and even invalidate comparisons with more recent neural methods. We also augment knowledge areas with synonyms to uncover a strong bias towards literal mentions of their constituent words. Finally, we propose constraints to the annotation process to prevent these biased evaluations, and show that this still allows annotation suggestions of high utility. These findings should inform benchmark creation or selection for expert finding, to guarantee meaningful comparison of methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>expert profiling and expert finding are strongly related,
there is a trend of using expert profiling benchmarks for
As organisations grow in size, efectively leveraging tinh-e task of expert finding as we7ll,8[, 9]. If a complete
ternal expertise becomes harder, as it is harder to loacnadtaeccurate gold standard of expertise profiles is
availexperts on specific matters. This need for eficiently locata-ble, it can be inverted to identify relevant experts for
ing expertise in large organisations has been recognaizesdpecific topic. However, achieving such
comprehensive and accurate profiles is often unrealistic. The sheer
pert lists, limited to a mere seven topics ove6r].aAlls[a
TIC Expert that automatically identify experts basednuomnber of topics typically precludes exhaustive
considinformation in an organisation’s intr1a]n.eTth[is task, eration during annotation. Secondly, self-selected topics
known as expert finding, is a specialized form of inforre-ly on the expert’s recollection and understanding of the
mation retrieval (IR) where the focus is on identifsyyisntgem’s taxonomy, often resulting in sparse profiles
subindividuals with relevant expertise rather than on rjeetcrtietvo- cognitive biases, such as recency bias. To address
ing documents. Another related task, expert profilitnhgi,s, the annotation process often includes an automated
focuses on retrieving all areas of expertise for a gsyivsetnem recommending additional, likely topics for each
individual2[]. The evaluation of these tasks, groupeexdpert [8, 9]. These system-validated topics yield more
underexpertise retrieval [3], requires developing a com-comprehensive and varied expert profile annotations.
prehensive gold standard of expertise annotations. NAote that these personalized recommendations can
setup where a list of experts is annotated for a givengtroepaictly influence which topics each expert might
conproves dificult, resulting in either just one or two expesritdser during annotation. We argue that this can, under
linked per topic4[,5], or in the case of more extensive exc-ertain circumstances, preclude meaningful comparisons
of annotations across experts, and therefore the use of
result, annotation eforts have shifted towards asking tthheese benchmarks for expert finding systems.
Specifiexperts themselves to list their areas of expe7r]t.iAses [ cally, this work addresses these research questions:
nEvelop-O
LGOBE
RecSys in HR’24: The 4th Workshop on Recommender Systems for
Human Resources, in conjunction with the 18th ACM Conference on
CEUR
htp:/ceur-ws.org
ISN1613-073</p>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
      <p>CEUR</p>
      <p>Workshop ProceedingsC(EUR-WS.org)
• RQ1: Do system-validated and self-selected
annotations exhibit significantly diferent characteristics that
could bias the evaluation of expert finding systems?
• RQ2: How do system-validated annotations impact
perceived performance of term-based versus neural
retrieval systems in expert finding tasks?
• RQ3: Can we establish constraints for a new
annota</p>
      <p>systems remains representative and unbiased?
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons Licensteion setup that ensures the evaluation of expert finding</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>We address these questions based on an analysisaonfexpert search task consisting of 50 knowledge areas,
anthe popular TU Expert Collecti8o]n,w[ hich makes avail- notated with experts from a total of 1,092 cand4id].ates [
able multiple sets of ground truth that nicely facIinli2t0a0t7e,the CERC dataset was introduc5e].dS[imilar to
our analysis. The nature of self-selected and systheme-W3C dataset, it included 50 topics, which were
devalidated expertise profiles is compared, specifically exv-eloped by nine science communicators at CSIRO. These
amining the properties of system-validated annotatciomnsmunicators were then tasked with identifying one
and their impact on the validity of the expert finding toarsktwo CSIRO staf members as experts on each topic,
in section3. We implement both traditional term-basceodntributing to a robust dataset for expert finding
reretrieval systems and more recent neural IR methodsseianrch. Another notable dataset is derived from DBLP
section4. Additionally, the section covers a procedbuiroegraphical data, augmented with abstracts from Google
to augment all test queries in the TU Expert CollecStcihoonlar6[]. This dataset contains 953,774 papers in total
with synonyms, allowing to further analyze the efectanodf 574,369 valid authors, with 2,498 topics sourced from
any term-based biases in the annotations. We also parroe-search events website. However, only seven topics
pose constraints for system-validated annotationhsaavnedbeen annotated with expert lists, each containing
demonstrate the potential of a new annotation subgegtewse-en 20 and 45 experts6][. The dataset development
tion system in this section. Finally, sec5tdioisncusses all process, both for TREC and the DBLP-based dataset,
highresults and provides answers to the research questiloinghs.ts the dificulty of identifying experts on a certain
topic within large organisations, as annotators often lack
detailed knowledge of the topics themselves.</p>
    </sec>
    <sec id="sec-4">
      <title>Expert finding systems The development of expertExpert profiling benchmarks A more distributed</title>
      <p>ifnding systems has a rich history, starting with the inatprpor-oach to gathering annotations involves asking
duction oPf@NOPTIC Expert, an early systems designedemployees to fill in their own expertise profiles, as seen
to automatically identify experts based on textuailndtohc-e UvT dataset7].[ This new annotation scheme
became more prominent shortly after the introduction
uments available within an organisation’s int1r]a.net [
To this day, expert finding remains an important anodf the task of experptrofiling (rather thafinnding ),
challenging topic for large organisations across difewrehnetre the goal is to retrieve all areas of expertise for
niches like the medical domai1n0][. Most expert finding a given individual2][. The UvT dataset was the first
methods are formalized as one of two prominent modlealsr:ge-scale benchmark for expertise profiling relying
the candidate model and the document model, referreodntoself-reported expertise. However, this approach
as the query-independent and query-dependent modeolftesn, results in sparse profiles due to the dificulty of
or Model 1 and Model 2, respective1l1y].[ The candi- recalling all areas of expertise. To address this sparsity,
date model regards each candidate expert as the sesetmoif-automated annotation procedures have been
their linked documents, to directly retrieve relevapnrtopeox-sed. Berendsen et a8l].[extended the self-selected
profiles from the UvT dataset by presenting up to
perts given a query. The document model operates on
two steps, first determining the relevance of individ1u0a0l high-probability topics for further annotation,
documents towards a query, and afterwards aggregaret--releasing the new annotation sets under the name
ing this document ranking into a candidate rankingo.fOtfhe TU Expert Collection. Similarly, Mangaravite et
al. [9] employed a content-based tag recommendation
these, the document model (Model 2) has generally been
shown to be more efective1[1]. Subsequent researchsystem to suggest annotations for approval. These
has largely built upon these models, with some studeniersiched annotations, while originally intended for
exploring expert finding as a voting problem, utilizinexgpert profiling, have been increasingly used to evaluate
data fusion techniques in a metasearch framew12o]r.k e[xpert finding tasks as well15[, 16].</p>
      <p>Other works have an extended input data scope to the
expert finding system, such as by incorporating prior topTiche use of personalized annotation suggestions raises
distributions13[] or by leveraging document structucroencerns about the validity of these annotations for
evalto enhance retrieval performa1n4c]e. O[ur work relies uating expert finding systems, which this study aims to
on the presence of a textual corpus linked to employaededsress. Specifically, because these suggestions are
diferwithout further properties like document structureen.t for each employee, the impact of this mechanism may
introduce properties in the annotations that forego their
Expert finding benchmarks The first large-scale comparability across employees. Our work is closely
related to that of Berendsen et8]a,wl. h[ o conducted an
benchmarks for expert finding were developed as part of</p>
      <p>extensive study on the impact of diferent expert profile
the TREC Enterprise track, which ran from 2005 to 20a0n8n. otation schemes on the evaluation of expert profiling
It introduced the W3C expert corpus in 2005, alongside</p>
      <p>tasks. However, our focus diverges in that we specifically
investigate the impact of these benchmarks on exp3e.r2t. Distribution of system-validated
ifnding, uncovering significant challenges in their usabil- annotations
ity under system-validated setups.</p>
      <sec id="sec-4-1">
        <title>As reported in8[], the average self-selected profile in GT1</title>
        <p>contains 6.4 knowledge areas. The average size of the
ex3. Analysis of Annotation Schemes tended profiles in GT5 expands to 8.6 areas. Notably, we
ifnd that the percentage of employees with three or fewer
We analyze the diferences of self-selected versus systeemxp-ertise areas decreases from 19.7% in GT1 to 10.5% in
validated expertise annotations, and how they may iGnfluT-5. Additionally, system-generated profiles capture
ence the perceived performance of expert finding system8s1% of the final knowledge areas compared to 65% for
when used as a benchmark. Secti3o.n1introduces the TUself-selected profiles, and unique topics in the
annotaExpert Collection, which is the dataset used in thistainoanls-grows from 937 in GT1 to 1,266 in GT5. This shows
ysis, as introduced in8][. We perform initial analysis otfhe sparsity of self-selected expertise profiles, and how
the annotation suggestions in sec3t.2io,nshowing their the situation improves through personalized system
recutility for expanding profile annotations, but also tohmemirendations for annotation. However, because these
high false negative rate. Finally, sec3.t3ioanalyzes the recommendations are personal to each expert, the
unmechanism behind these false negatives, exposing a lardgeerlying recommendation method may compromise the
positive bias towards literal mentions of the knowlceodmgpearability of the annotated knowledge areas across
topics’ constituent words in the corpus. experts. To explore this, we focused on niche topics,
present in more than one but no more than three
self3.1. TU Expert Collection selected profiles in GT2, identifying 290 such topics.
Examples aresub-saharan africa, policy evaluation,
The TU expert collection is an expertise retrieval
bench</p>
        <p>cognitive linguistics, nonprofit organisations
mark focused on a knowledge-intensive organisataionnd,extreme value theory. By contrasting GT2 with
namely the Tilburg Universit8y]. [It is an updated ver-GT3, we know whether a self-selected topic was also part
sion of the earlier UvT data7s]e.tT[he dataset containsof the system annotation suggestions, allowing us to
estia variety of documents, being academic publications,msaut-e the recall of this system. We find that only 125 out of
pervised student dissertations, course descriptions,tahned290 niche topics – around 43% – were recommended
research summaries. These documents are primarily finor annotation to all experts who had self-selected it.
Dutch and English, and are explicitly linked to expIetritssthis low recall that can compromise the
comparain the university’s Webwijs system, indexes over 2,00b0ility of annotations across experts: if it is caused by
unique knowledge areas and 761 employees. The TUa certain weakness of the annotation recommendation
dataset provides several ground truth (GT) sets of grsaydsteedm, expert finding systems with a similar weakness
expert profile annotations, labeled GT1 through GT5.</p>
        <p>will produce the same recall patterns and therefore may
These annotations are the result of experts indicaaptpineagr stronger than they are.
their expertise areas on a scale of 1 (lowest) to 5 (highest).</p>
        <p>Note that in this work, we consider all annotations as
binary and ignore the attached grades, due to inco3n.s3i.s-Term-bias in system-validated
tencies in how diferent annotators may interpret and annotations
apply these grades, compromising the comparabilityToofconstruct GT5, up to 100 knowledge areas were
suggrades across expert8s][. We leave the analysis of gradegdested for further annotation to each expert, produced by
relevance annotations for future work. GT1 containasntehnesemble of eight expert profiling system8s].[These
self-selected knowledge areas of 761 TU employees. To</p>
        <p>systems vary in retrieval models (Model 1 or Model 2),
further expand these annotations, a system was detvheelq-uery language (English or Dutch), and whether they
oped to recommend up to 100 highly likely knowledcgoensider relationships between topics in the Webwijs
sysareas to each expert, which they could easily validatteemo.rAll these systems have in common that they model
discard through a user interface. The extended expertthieseprobability of a topic for a document or expert based
profiles of all 239 participating employees is bundled oans the literal textual occurrences of their constituent
GT5, accompanied by GT4 which simply binarizes thewords. This approach is prone to false negatives due to
annotations by dropping the graded relevance sc oitrsesin.ability to account for synonyms and other semantic
For ease of use, GT2 is provided as the subset of GT1</p>
        <p>nuances, leading to a low recall and a strong bias towards
for those 239 employees. Finally, GT3 is an alterationlitoefral mentions of the topic’s constituent words.
self-selected profiles in GT2, reduced to only those topW-e aim to quantify the presence of this bias towards
ics that were also present in the personalized suggelsitteedral mentions in system-validated topics or their
conannotations. stituent words. To this end, we construct a corpus with
one long document per expert, being the concatenataiornesult, there are likely false negatives in the
systemof all original documents linked to the expert. Wevtahleindated annotations that systematically favor systems
calculate tf-idf scores of queries with respect towtihtehsesimilar term frequency-based biases. This suggests
concatenated expert documents to express the degretehtaot evaluations of expert finding systems using these
anwhich their constituent words are being literallynmoteant-ions may overestimate the performance of systems
tioned. Whenever both an English and Dutch name tahraet share these biases.
available for a topic, we consider the largest of the tf-idf
scores of both versions. By contrasting GT2 and GT3, we
are able to determine for each topic in the self-sel4ec.teAd ssessment of Expert Finding
profiles whether it was also included in the suggested
annotation list for the corresponding expert. 1Fig4u.r1e. Expert Finding Models
shows the distribution of tf-idf scores for both gWroeuipmsplement two diferent expert finding systems and
of self-selected topics throuh a boxplot, with commeovnallyuate them against diferent ground truths. Both
sysused whiskers drawn to the farthest datapoint withtienm1s.5operate under the query-dependent Model 2, first
times the interquartile range from the nearest hindgeet. eIrtmining the relevance of each document in the
corshows a large shift in scores for the subset of self-selecptuesdto the query and then aggregating document ranks
topics that are also included in the annotation reco minmteona- final ranking of experts. Two diferent information
dations versus those that are not included. retrieval models are implemented to rank the documents
against a query. We implement a popular term-based
retrieval model, as well as a more modern neural IR
technique, aiming to maximally expose potential term-based
biases in the annotations.</p>
        <p>Given this observation, we expect this bias to beLeaxt-e-interaction neural retrieval: We opt for the</p>
        <p>ColBERT retrieval techniq1u8e],[due to its unique
comtended into the additional topics that are added to the
profiles through system-validation. Figu2rsehows the bination of eficiency and strong IR capabilities. To
fadistribution of tf-idf scores of the topics in the coirliitgai-te the multilingual nature of the corpus, we use the
nal self-selected expertise profiles, compared to the tmoupl-tilingual ColBERT-XM mode1l9][. The corpus is split
ics added after system-validation. As expected, a cleainrto chunks of up to 256 tokens, with an overlap of 64
topositive bias in tf-idf scores is observed for the systkeemns-. We make use of thReAGatouille1 implementation
validated topics. These findings validate our concertnoschunk and index the whole corpus, which took close to
about system-driven annotation biases. eight hours on one NVIDIA P100 GPU. As with the BM25
model, we simply concatenate the English and Dutch
translations of the topic when present. The documents
are ranked according to the average retrieval score of all
their constituent chunks.</p>
        <p>Term-based retrieval: we use BM25 [17] to rank all
documents against the query topic because it is a
commonly used term-based retrieval method that accounts
for document lengths. In order to support the
multilinguality of the dataset, we simply concatenate the
English and Dutch translations of the topic when both are
present.</p>
      </sec>
      <sec id="sec-4-2">
        <title>In conclusion, our analysis reveals a strong term frequency bias in the annotation recommendations. 1Ahtstps://github.com/AnswerDotAI/RAGatouille</title>
      </sec>
      <sec id="sec-4-3">
        <title>Following1[6], for both IR methods, we use the same</title>
        <p>rr function to aggregate the document ranking into an
expert ranking, as defined by:</p>
        <p>| , |
rr() = ∑ 1
=1 rank(  )</p>
        <p>In this equation,, is the subset of documents linke4d.3. Alternative Annotation Suggestions
to candida tethat are retrieved for q ue.rTyhe rank of
the retrieved documents is indicaterdabnyk(  ), starting We analyze an alternative approach to suggest additional
at1 for the highest ranked document. knowledge areas for annotation. Specifically, we
introduce the constraint that the suggestion mechanism has
no access to the document corpus, and may only
recom4.2. Query Augmentation mend additional knowledge areas based on the available
With the goal of further studying the impact of steelrfm-s-elected topics. This setup prevents any systematic
based biases in the annotations, we extend the Web wbiiajss stemming from the annotation procedure with
reknowledge areas with synonyms. Qualitative expert finsdpe-ct to the textual corpus. The downside of this setup is
ing system should surface relevant experts on a totphiact, the rich information in the corpus cannot be utilised,
even if provided by a synonym of the query instead of tahned it instead relies on at least a small number of
manuoriginal query. Because of this, query synonyms proviadlely annotated expertise areas per expert. We argue that
an opportunity to study term-based annotation bieaxspeesr.ts should be suficiently engaged in the annotation
We manually annota1t0e9 randomly selected knowledgeprocess such that this is a reasonable requirement.
Adareas with both English and Dutch synonyms. To facdiiltii-onally, there is an opportunity for these annotation
tate the annotation process, the topic up for annotsautgigoenstions to dynamically adapt throughout the
validais contextualized by providing the annotator with atliolnitpsrocedure, although we leave this out of the current
scope.
relations to other topics in the Webwijs inventory. For
further contextualization, we gather the correspondWinegdevelop two item-to-item recommendation
systems (where the Webwijs topics serve as items). The first
wikidata and wikipedia page on the topic if they exist.</p>
        <p>system learns from the full set of self-selected profiles in
Whenever available, good synonyms are selected from
the“Also known as” table in wikidata. If not, we scaGnT1, and recommends topics that have high pointwise
the first paragraph of the wikipedia page for synonymmsu. tual information (PMI) with the given topic. PMI is
Finally, we manually provide a synonym if none are avaaisli-mple but efective measure of association between
able in the wiki pages. An example result is shown heirtee:ms. Topics that occur three times or less are excluded
from this system to ensure a minimum level of
robustTopic EN: auction theory ness. The second system operates on all Webwijs topics,
Topic NL: veilingstheorie and recommends topics that are semantically similar to
Webwijs links: a given topic, as measured by the cosine similarity of
seMakes use of: auctions / veilingen mantic neural embeddings of their respective names. We
use a generic multilingual sentence-transformer model
Annotation: (paraphrase-multilingual-mpnet-base-v2)3 that was
prewwiikkiipdeadtiaa::hhttttpsp:s/://w/ewn.ww.iwkiikpeiddaiat.ao.rogr/gw/ wikiik/Ai/Quc7t71io3n34_theory trained on over 1B English sentence pa2i2r]sa[nd then
adapted to over 50 languages through a knowledge
distilSynonym EN: Bidding Theory lation process23[]. We embed English and Dutch names
Synonym NL: Biedingstheorie separately and always consider the highest similarity
between either versions. For a given self-selected profile,
Based on the initial set of synonym annotations, wewaeuc-ompile a list of up to 100 annotation
recommendatomatically generate synonyms for all remaining quetriioenss by pooling the top recommendation of both
sysusing OpenAI’s GPT4o model, selected for its high atce-ms, looping over the self-selected expertise topics in a
curacy on common benchmarks. We randomly selecrtound-robin fashion.
35 annotated queries for training and the remaining 74
for validation, and automatically optimize a
chain-ofthought prompt20[] using DSPy’sBootstrapFewShot 5. Results and Discussion
prompt optimization techniqu2e1][. We release the full
set of synonym annotatio2nsL.imited manual qualityImpact of the annotation on expert finding
evaluachecks were performed on these synonyms. We notteions We evaluate system performance using precision
that this process can be more qualitatively perforamte5d(P@5), mean average precision (MAP), normalized
in future studies, however the main reasoning of usdiinscgounted cumulative gain (nDCG) and mean reciprocal
topic synonyms to indicate a bias towards literal mraennk- (MRR). These metrics provide a comprehensive view
tions of their constituent words still holds, irrespoefctthiveeretrieval system’s ability to rank relevant experts at
of suboptimal quality of the synonyms. the top of the list. All evaluations are conducted using the</p>
      </sec>
      <sec id="sec-4-4">
        <title>2https://huggingface.co/datasets/jensjorisdecorte/</title>
        <p>TU-Expert-Collection-Topic-Synonyms</p>
      </sec>
      <sec id="sec-4-5">
        <title>3https://huggingface.co/sentence-transformers/</title>
        <p>paraphrase-multilingual-mpnet-base-v2
BM25
ColBERT
MAP
37.81
39.78
nDCG
MAP
56.56
46.46
nDCG
oficial TREC evaluation software4, ensuring standard-the organisation, we do not expect strong overlap, and
ized and comparable results. We report the performawnececonsider 48% to be relatively high. Apart from this
of the BM25 and the ColBERT system on both the seolvfe-rlap measurement, it is dificult to assess the true
preselected profiles (GT2) as well as their larger counterpcairstiosn of these recommendations because we did not have
extended with system-validated topics (GT5) in 1t.ablaeccess to the candidate experts to facilitate manual
validaWe find that the ColBERT-based expert finding systemtion. However, examples of the topic recommendations
outperforms its BM25-based counterpart on all metirsipcrsovided in appendixA.
when using the self-selected profiles (GT2) as ground
truth. The performance of both systems increases conInsidc-onclusion, our analysis of the TU Expert Collection
erably when assessing them against the system-validaatleldows us to answyeers to all three research questions.
profile (GT5). However, the increase in performance iSsection3 provides an answer tRoQ1, showing that the
much more drastic for the BM25-based system, leadiunngderlying mechanism used for the system-validated
it to strongly exceed ColBERT. We hypothesize that titopisics is subject to a high false negative rate, and that
the term-frequency bias in the annotation procedurite ionftroduces a significant bias towards literal mentions
GT5 that leads to a strongly overestimated performoafnkcneowledge areas’ constituent words. With respect to
measurement for the term-based BM25 approach. WheRnQ2, as shown above, the perceived performance of
exswapping the test queries in GT5 for their synonymsp,eart finding systems is indeed strongly impacted by these
significant drop in performance is observed, especiallsyystem-validated topics, and they even lead to significant
for the BM25 system which drops over 20 %-points diniferences in the ranking of these systems. Finally, we
MAP, nDCG and MRR. We also observe that system ranhka-ve proposed an annotation suggestion procedure that
ing in this scenario corresponds to that under theisGiTn2dependent of the document corpus, and we have
evaluation. developed such a system accordingly. It exhibits strong
utility for the annotation process while significantly
reStatistics of the alternative annotation suggestions ducing the false negative rates observed in the original
We perform the same analysis as in sect3iownith re- benchmark, leading us to answReQr3 positively as well.
spect to the recall of 290 specific topics in the annotatOiuorn analysis should help future work on expert finding
recommendations. This requires the specific topic fo–ror evaluation thereof – make more informed decisions
which recall is measured to first be removed from twheith respect to the selection or creation of these
benchself-selected profile. The results show that 235 out ofmtahreks.
290 specific topics (around 81%) is being recommended to
etvoetrhyee4x3p%erotftthhaetchonatdesnetl-fb-asesleedcatnednotthaettioonpirce. cCoommmpeAanrdceadk- nowledgments
tions in the original study, this is a considerable increase</p>
        <p>We thank the anonymous reviewers for their valuable
in recall, which should further improve the comparability</p>
        <p>feedback. This project was funded by the Flemish
Governof annotations across experts. For completeness, we also</p>
        <p>ment, through Flanders Innovation &amp; Entrepreneurship
report how well these new recommendations cover the</p>
        <p>(VLAIO, project HBC.2020.2893).
topics that where added in GT5 compared to GT2. We
ifnd that a total of 1,059 topic additions are made, and
505 (around 48%) of these topics are also present in our
proposed annotation recommendation method. Because
this proposed method has no access to the documents in</p>
      </sec>
      <sec id="sec-4-6">
        <title>Note that we use the Dutch topic name if no English name is</title>
        <p>available for the topic in Webwijs. Now consider a self-selected
expertise profile consisting of the following three topics:
19. municipal law
20. maatschappelijke organisaties (NL)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>A. Example annotation recommendations</title>
      <sec id="sec-5-1">
        <title>Given the knowledge arceoamputer linguistics, the</title>
        <p>top five item-to-item recommendations according to the
PMI-based system are:</p>
      </sec>
      <sec id="sec-5-2">
        <title>1. talking computer</title>
      </sec>
      <sec id="sec-5-3">
        <title>2. automatic language analysis</title>
      </sec>
      <sec id="sec-5-4">
        <title>3. man-machine interaction</title>
      </sec>
      <sec id="sec-5-5">
        <title>4. algemene taalwetenschap (nl)</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>tion over bert</article-title>
          ,
          <source>in: Proceedings of the 43rd Inte5r</source>
          .
          <article-title>nasp-eech technology</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>velopment in Information Retrieval, SIGIR '20, Abse-dding similarity method are:</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>sociation for Computing Machinery</article-title>
          , New York,
          <year>N1Y</year>
          ., language and computers
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>USA</surname>
          </string-name>
          ,
          <year>2020</year>
          , p.
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          . URL: https://doi.org/10.1145/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          3397271.3401075. doi:
          <volume>10</volume>
          .1145/3397271.3401075.
          <article-title>2. taalproductie door computers (nl</article-title>
          ) [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Louis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. van Dijck</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Spanakis,3. taaltechnologie en computers (nl)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>Colbert-xm: A modular multi-vector represen4t</article-title>
          .
          <article-title>al-anguage technology</article-title>
          and computers
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>tion model for zero-shot multilingual informa5t.iocnomputer and grammar</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          retrieval,
          <source>CoRR abs/2402</source>
          .15059 (
          <year>2024</year>
          ). URhLt:tps:
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>//arxiv.org/abs/2402.1505.9 doi:10.48550/arXiv.</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          2402.15059. arXiv:
          <volume>2402</volume>
          .
          <fpage>15059</fpage>
          . [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>systems 35</source>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>