On the Assessment of Expertise Profiles (Abstract)

                Richard Berendsen                                         Krisztian Balog                             Toine Bogers
     University of Amsterdam, The Netherlands                           University of Stavanger,                Royal School of Library
              r.w.berendsen@uva.nl                                              Norway                      Information Science, Denmark
                                                      krisztian.balog@uis.no          tb@iva.dk
                                      Antal van den Bosch               Maarten de Rijke
                      Radboud University Nijmegen, The Netherlands                        University of Amsterdam, The
                                  a.vandenbosch@let.ru.nl                                          Netherlands
                                                                                                  derijke@uva.nl

1.     INTRODUCTION                                                             profiles?” We benchmark eight state-of-the-art expertise retrieval
   We summarize findings from [3]. At the TREC Enterprise Track                 systems with both sets of ground truth and investigate differences
[2], the need to study and understand expertise retrieval has been              in completeness, system ranking, and the number of significant dif-
recognized through the introduction of the expert finding task. The             ferences detected between systems.
goal of expert finding is to identify a list of people who are knowl-
edgeable about a given topic. An alternative task, building on the              2.    THE ASSESSMENT EXPERIMENT
same underlying principle of computing people-topic associations,
is expert profiling, where systems have to return a list of topics that         Generating profiles. We use eight expert profiling models.
a person is knowledgeable about [1].                                            Each of them uses either Model 1 or Model 2 [1], either uses Dutch
   We focus on benchmarking systems performing the topical ex-                  or English representations of knowledge areas, and either uses rela-
pert profiling task. We define this task as a ranking task, where               tions between knowledge areas extracted from the thesaurus or not.
knowledge areas from a thesaurus have to be ranked for an expert.               Because experts have limited time and participate in the experiment
We release an updated version of the UvT (Universiteit van Tilburg)             on a voluntary basis, we rank areas by their estimated probability
expert collection [1]: the TU (Tilburg University) expert collec-               of being part of the expert’s profiles. The more traditional pooling
tion.1 The TU expert collection is based on the Webwijs (“Web-                  approach would require experts to exhaustively judge the pool. We
wise”) system2 : a publicly accessible database of TU employees                 linearly combine output scores of the eight systems, giving each
who are involved in research or teaching. In a back-end for this                system equal weight. We boost the top three of each system by
database, experts can indicate their skills by selecting knowledge              adding a sufficiently large constant to the top three scores, to make
areas from an alphabetical list. Prior work has used these self-                sure they are judged. System-generated knowledge areas that were
selected knowledge areas as ground truth for both expert finding                in the original self-selected profile of the expert are ticked by de-
and expert profiling tasks [1].                                                 fault in the interface, but the expert may deselect them, thereby
   One problem with self-selected knowledge areas is that they may              judging them non-relevant.
be sparse, since experts have to select them from an alphabetically
ordered list of well over 2,000 knowledge areas. Using these self-              The assessment interface. Using the assessment interface,
selected knowledge areas as ground truth for assessing automatic                each expert can judge retrieved knowledge areas relevant by tick-
profiling systems may therefore not reflect the true predictive power           ing them. Immediately below the top twenty knowledge areas listed
of these systems. To find out more about how well these systems                 by default, the expert has the option to view and assess additional
perform in real-world circumstances, we have asked TU employ-                   knowledge areas. For the ticked knowledge areas, experts have the
ees to judge and comment on profiles that have been automatically               option to indicate a level of expertise. If they do not do this, we still
generated for them. We refer to this process as the assessment ex-              include these knowledge areas in the judged system-generated pro-
periment. In § 2 we answer the broad research question “How well                files, with a level of expertise of three (“somewhere in the middle”).
are we doing at the expert profiling task?” We do this through an               At the bottom of the interface, experts can leave any comments they
error analysis and through a content analysis of free text comments             might have on the generated profile.
that experts could give. During the assessment experiment, experts
judge areas in the system-generated profiles on a five point scale.             Error analysis of system-generated profiles. Here, we
This yields a new set of graded relevance assessments, which we                 aim to find properties of experts that can explain some of the vari-
call the judged system-generated knowledge areas. In § 3 our re-                ance in nDCG@100 performance. We use the self-selected profiles
search question is: “Does benchmarking a set of expertise retrieval             of all 761 experts we generated a profile for, allowing us to incor-
systems with the judged system-generated profiles lead to differ-               porate self-selected knowledge areas that were missing from the
ent conclusions, compared to benchmarking with the self-selected                system-generated profiles in our analysis. Based on visual inspec-
                                                                                tion, we find no correlation between the number of relevant knowl-
1
    http://ilps.science.uva.nl/tu-expert-collection                             edge areas selected and nDCG@100, and no correlation between
2
    http://www.tilburguniversity.edu/webwijs/                                   the number of documents associated with an expert and nDCG@100
                                                                                either. Intuitively, the relationship between the ratio of relevant
                                                                                knowledge areas and number of documents associated with the ex-
DIR 2013, April 26, 2013, Delft, The Netherlands.                               pert is also interesting. However this ratio does not correlate with
Copyright remains with the authors and/or original copyright holders.           nDCG@100 either. Looking a bit deeper into the different kinds
of document that can be associated with an expert, we find that it        computed with nDCG@100. With eight systems, Kendall’s τ cor-
matters whether or not an expert has a research description. For          relations of 0.79 or higher are significant at the α = 0.01 level.
the 282 experts without a research description we achieve signif-         Correlating GT1-GT2, we find that evaluating on a subset of ex-
icantly lower average nDCG@100 performance than for the re-               perts does not change system ranking much: τ = 0.86. Correlat-
maining 479 experts (Welch Two Sample t-test, p < 0.001). The             ing GT2-GT3, we find that regarding non-pooled knowledge ar-
difference is also substantial: 0.39 vs. 0.30 for experts with and        eas as irrelevant does not rank our eight systems very differently:
without a research description, respectively. It is not surprising that   τ = 0.86. Correlating GT3-GT4 we find that new knowledge ar-
these research descriptions are important; they constitute a concise      eas judged relevant during the assessment do change system rank-
summary of a person’s qualifications and expertise, written by the        ing: τ = 0.56. Contrasting GT4-GT5 we find that considering the
expert himself/herself.                                                   grade of relevance does not change system ranking: τ = 1.00.

Content analysis of expert feedback. 239 Experts partic-                  Pairwise significant differences. The final analysis we con-
ipated in the self-assessment experiment, providing graded rele-          duct concerns a high-level perspective: the sensitivity of our eval-
vance judgments. 91 Of them also left free text comments. We              uation methodology. The measurement that serves as a rough es-
study what are important aspects in expert feedback by means of           timate here is the average number of systems each system differs
a content analysis. In our analysis, expert comments were coded           from; we compute this for each of the five sets of assessments GT1-
by two of the authors, based on a coding scheme developed in a            5, and focus here on nDCG@100. We use Fisher’s pairwise ran-
first pass over the data. A statement could be assigned multiple as-      domization test (α = 0.001). For GT1 we get 4.75. For GT2
pects. After all aspect types were identified, the participants’ com-     we observe 3.00, the decrease is not surprising as GT2 has much
ments were coded in a second pass over the data. Upon completion,         less experts. Regarding non-pooled knowledge areas as irrelevant
the two coders resolved differences through discussion. Micro-            does not affect sensitivity much (GT3: 2.75). The sensitivity in-
averaged inter-annotator agreement (the number of times a com-            creases again when we evaluate with the more complete judged
ment was coded with the same aspect divided by the total number           system-generated knowledge areas (GT4:3.50). Taking into ac-
of codings) was 0.97. The main aspects in the feedback of experts         count the level of expertise indicated, we see another small increase
are (i) missing a key knowledge area in the generated profile (36%);      (GT5:4.00).
(ii) only irrelevant knowledge areas in the profile (16.9%); (iii) re-
dundancy in the generated profiles (11.2%); (iv) knowledge areas          4.    CONCLUSION
being too general (11.2%). Based on these results, it seems there is         We released, described and analyzed the TU expert collection for
still room for improvement in the performance of expert profiling         assessing automatic expert profiling systems. In an error analysis
systems. Also, interesting directions for future work are to address      of system-generated profiles, we found that it is easier to generate
the redundancy in generated profiles, and to take into account the        profiles for experts that have a research description. A content anal-
specificity of knowledge areas.                                           ysis of expert feedback revealed that there is room for improvement
                                                                          in the expert profiling task, and that an interesting direction for fu-
3.    BENCHMARKING DIFFERENCES                                            ture work is to consider diversity in profiles. Contrasting using the
                                                                          self-selected profiles or using the judged system-generated profiles
Completeness. To assess completeness, we estimate the set of              for evaluation, we find that the latter profiles are more complete.
all relevant knowledge areas for an expert with the union of the self-    The two sets of ground truth rank systems somewhat differently.
selected profile and the judged system-generated profile. Doing
this, we find that the judged system-generated profiles are more          Acknowledgments. This research was partially supported by the
complete. On average, a judged system-generated profile contains          European Union’s ICT Policy Support Programme as part of the
81% of all relevant knowledge areas, while a self-selected profile        Competitiveness and Innovation Framework Programme, CIP ICT-
contains only 65%.                                                        PSP under grant agreement nr 250430, the PROMISE Network of
                                                                          Excellence co-funded by the 7th Framework Programme of the Eu-
                                                                          ropean Commission, grant agreement no. 258191, the DuOMAn
Changes in system ranking. To better understand the differ-               project carried out within the STEVIN programme which is funded
ences in evaluation outcomes between using the self-selected pro-
                                                                          by the Dutch and Flemish Governments under project nr STE-09-
files (we call this ground truth set: GT1) and the judged system-
                                                                          12, the Netherlands Organisation for Scientific Research (NWO)
generated profiles (we call this set GT5), we construct three inter-
                                                                          under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70-
mediate sets of ground truth (GT2-4). Each intermediate set differs
                                                                          011, the Center for Creation, Content and Technology (CCCT), the
from the previous set in only one aspect; in this way we can iso-
                                                                          Hyperlocal Service Platform project funded by the Service Innova-
late the contribution each difference makes to differences in evalu-
                                                                          tion & ICT program, the WAHSP project funded by the CLARIN-
ation outcomes. The intermediate sets of ground thruth are: GT2:
                                                                          nl program, and under COMMIT project Infiniti.
The 239 self-selected profiles of participants in the assessment ex-
periment; GT3: For each self-selected profile of an assessor, we
only use knowledge areas that were in the system-generated pro-           References
file. This means that knowledge areas that are not in the system-         [1] K. Balog, T. Bogers, L. Azzopardi, M. de Rijke, and A. van den
generated profile are treated as irrelevant; GT4: The knowledge               Bosch. Broad expertise retrieval in sparse data environments.
areas judged relevant during the assessment experiment. We only               In SIGIR’07, pages 551–558. ACM, 2007.
consider binary relevance; if a knowledge area was selected it is         [2] K. Balog, I. Soboroff, P. Thomas, N. Craswell, A. P. de Vries,
considered as relevant, otherwise it is taken to be irrelevant. We            and P. Bailey. Overview of the TREC 2008 Enterprise Track.
report Kendall’s τ correlation between system rankings using con-             In TREC 2008 Proceedings. NIST, 2009. Special Publication.
secutive sets of ground truth. We rank the eight systems that con-        [3] R. Berendsen, K. Balog, T. Bogers, A. van den Bosch, and
tributed to the generated profile, but leave out the algorithm that           M. de Rijke. On the assessment of expertise profiles. JASIST,
combined them. In this abstract, we focus on system rankings                  To appear.