On the Assessment of Expertise Profiles (Abstract) Richard Berendsen Krisztian Balog Toine Bogers University of Amsterdam, The Netherlands University of Stavanger, Royal School of Library r.w.berendsen@uva.nl Norway Information Science, Denmark krisztian.balog@uis.no tb@iva.dk Antal van den Bosch Maarten de Rijke Radboud University Nijmegen, The Netherlands University of Amsterdam, The a.vandenbosch@let.ru.nl Netherlands derijke@uva.nl 1. INTRODUCTION profiles?” We benchmark eight state-of-the-art expertise retrieval We summarize findings from [3]. At the TREC Enterprise Track systems with both sets of ground truth and investigate differences [2], the need to study and understand expertise retrieval has been in completeness, system ranking, and the number of significant dif- recognized through the introduction of the expert finding task. The ferences detected between systems. goal of expert finding is to identify a list of people who are knowl- edgeable about a given topic. An alternative task, building on the 2. THE ASSESSMENT EXPERIMENT same underlying principle of computing people-topic associations, is expert profiling, where systems have to return a list of topics that Generating profiles. We use eight expert profiling models. a person is knowledgeable about [1]. Each of them uses either Model 1 or Model 2 [1], either uses Dutch We focus on benchmarking systems performing the topical ex- or English representations of knowledge areas, and either uses rela- pert profiling task. We define this task as a ranking task, where tions between knowledge areas extracted from the thesaurus or not. knowledge areas from a thesaurus have to be ranked for an expert. Because experts have limited time and participate in the experiment We release an updated version of the UvT (Universiteit van Tilburg) on a voluntary basis, we rank areas by their estimated probability expert collection [1]: the TU (Tilburg University) expert collec- of being part of the expert’s profiles. The more traditional pooling tion.1 The TU expert collection is based on the Webwijs (“Web- approach would require experts to exhaustively judge the pool. We wise”) system2 : a publicly accessible database of TU employees linearly combine output scores of the eight systems, giving each who are involved in research or teaching. In a back-end for this system equal weight. We boost the top three of each system by database, experts can indicate their skills by selecting knowledge adding a sufficiently large constant to the top three scores, to make areas from an alphabetical list. Prior work has used these self- sure they are judged. System-generated knowledge areas that were selected knowledge areas as ground truth for both expert finding in the original self-selected profile of the expert are ticked by de- and expert profiling tasks [1]. fault in the interface, but the expert may deselect them, thereby One problem with self-selected knowledge areas is that they may judging them non-relevant. be sparse, since experts have to select them from an alphabetically ordered list of well over 2,000 knowledge areas. Using these self- The assessment interface. Using the assessment interface, selected knowledge areas as ground truth for assessing automatic each expert can judge retrieved knowledge areas relevant by tick- profiling systems may therefore not reflect the true predictive power ing them. Immediately below the top twenty knowledge areas listed of these systems. To find out more about how well these systems by default, the expert has the option to view and assess additional perform in real-world circumstances, we have asked TU employ- knowledge areas. For the ticked knowledge areas, experts have the ees to judge and comment on profiles that have been automatically option to indicate a level of expertise. If they do not do this, we still generated for them. We refer to this process as the assessment ex- include these knowledge areas in the judged system-generated pro- periment. In § 2 we answer the broad research question “How well files, with a level of expertise of three (“somewhere in the middle”). are we doing at the expert profiling task?” We do this through an At the bottom of the interface, experts can leave any comments they error analysis and through a content analysis of free text comments might have on the generated profile. that experts could give. During the assessment experiment, experts judge areas in the system-generated profiles on a five point scale. Error analysis of system-generated profiles. Here, we This yields a new set of graded relevance assessments, which we aim to find properties of experts that can explain some of the vari- call the judged system-generated knowledge areas. In § 3 our re- ance in nDCG@100 performance. We use the self-selected profiles search question is: “Does benchmarking a set of expertise retrieval of all 761 experts we generated a profile for, allowing us to incor- systems with the judged system-generated profiles lead to differ- porate self-selected knowledge areas that were missing from the ent conclusions, compared to benchmarking with the self-selected system-generated profiles in our analysis. Based on visual inspec- tion, we find no correlation between the number of relevant knowl- 1 http://ilps.science.uva.nl/tu-expert-collection edge areas selected and nDCG@100, and no correlation between 2 http://www.tilburguniversity.edu/webwijs/ the number of documents associated with an expert and nDCG@100 either. Intuitively, the relationship between the ratio of relevant knowledge areas and number of documents associated with the ex- DIR 2013, April 26, 2013, Delft, The Netherlands. pert is also interesting. However this ratio does not correlate with Copyright remains with the authors and/or original copyright holders. nDCG@100 either. Looking a bit deeper into the different kinds of document that can be associated with an expert, we find that it computed with nDCG@100. With eight systems, Kendall’s τ cor- matters whether or not an expert has a research description. For relations of 0.79 or higher are significant at the α = 0.01 level. the 282 experts without a research description we achieve signif- Correlating GT1-GT2, we find that evaluating on a subset of ex- icantly lower average nDCG@100 performance than for the re- perts does not change system ranking much: τ = 0.86. Correlat- maining 479 experts (Welch Two Sample t-test, p < 0.001). The ing GT2-GT3, we find that regarding non-pooled knowledge ar- difference is also substantial: 0.39 vs. 0.30 for experts with and eas as irrelevant does not rank our eight systems very differently: without a research description, respectively. It is not surprising that τ = 0.86. Correlating GT3-GT4 we find that new knowledge ar- these research descriptions are important; they constitute a concise eas judged relevant during the assessment do change system rank- summary of a person’s qualifications and expertise, written by the ing: τ = 0.56. Contrasting GT4-GT5 we find that considering the expert himself/herself. grade of relevance does not change system ranking: τ = 1.00. Content analysis of expert feedback. 239 Experts partic- Pairwise significant differences. The final analysis we con- ipated in the self-assessment experiment, providing graded rele- duct concerns a high-level perspective: the sensitivity of our eval- vance judgments. 91 Of them also left free text comments. We uation methodology. The measurement that serves as a rough es- study what are important aspects in expert feedback by means of timate here is the average number of systems each system differs a content analysis. In our analysis, expert comments were coded from; we compute this for each of the five sets of assessments GT1- by two of the authors, based on a coding scheme developed in a 5, and focus here on nDCG@100. We use Fisher’s pairwise ran- first pass over the data. A statement could be assigned multiple as- domization test (α = 0.001). For GT1 we get 4.75. For GT2 pects. After all aspect types were identified, the participants’ com- we observe 3.00, the decrease is not surprising as GT2 has much ments were coded in a second pass over the data. Upon completion, less experts. Regarding non-pooled knowledge areas as irrelevant the two coders resolved differences through discussion. Micro- does not affect sensitivity much (GT3: 2.75). The sensitivity in- averaged inter-annotator agreement (the number of times a com- creases again when we evaluate with the more complete judged ment was coded with the same aspect divided by the total number system-generated knowledge areas (GT4:3.50). Taking into ac- of codings) was 0.97. The main aspects in the feedback of experts count the level of expertise indicated, we see another small increase are (i) missing a key knowledge area in the generated profile (36%); (GT5:4.00). (ii) only irrelevant knowledge areas in the profile (16.9%); (iii) re- dundancy in the generated profiles (11.2%); (iv) knowledge areas 4. CONCLUSION being too general (11.2%). Based on these results, it seems there is We released, described and analyzed the TU expert collection for still room for improvement in the performance of expert profiling assessing automatic expert profiling systems. In an error analysis systems. Also, interesting directions for future work are to address of system-generated profiles, we found that it is easier to generate the redundancy in generated profiles, and to take into account the profiles for experts that have a research description. A content anal- specificity of knowledge areas. ysis of expert feedback revealed that there is room for improvement in the expert profiling task, and that an interesting direction for fu- 3. BENCHMARKING DIFFERENCES ture work is to consider diversity in profiles. Contrasting using the self-selected profiles or using the judged system-generated profiles Completeness. To assess completeness, we estimate the set of for evaluation, we find that the latter profiles are more complete. all relevant knowledge areas for an expert with the union of the self- The two sets of ground truth rank systems somewhat differently. selected profile and the judged system-generated profile. Doing this, we find that the judged system-generated profiles are more Acknowledgments. This research was partially supported by the complete. On average, a judged system-generated profile contains European Union’s ICT Policy Support Programme as part of the 81% of all relevant knowledge areas, while a self-selected profile Competitiveness and Innovation Framework Programme, CIP ICT- contains only 65%. PSP under grant agreement nr 250430, the PROMISE Network of Excellence co-funded by the 7th Framework Programme of the Eu- ropean Commission, grant agreement no. 258191, the DuOMAn Changes in system ranking. To better understand the differ- project carried out within the STEVIN programme which is funded ences in evaluation outcomes between using the self-selected pro- by the Dutch and Flemish Governments under project nr STE-09- files (we call this ground truth set: GT1) and the judged system- 12, the Netherlands Organisation for Scientific Research (NWO) generated profiles (we call this set GT5), we construct three inter- under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70- mediate sets of ground truth (GT2-4). Each intermediate set differs 011, the Center for Creation, Content and Technology (CCCT), the from the previous set in only one aspect; in this way we can iso- Hyperlocal Service Platform project funded by the Service Innova- late the contribution each difference makes to differences in evalu- tion & ICT program, the WAHSP project funded by the CLARIN- ation outcomes. The intermediate sets of ground thruth are: GT2: nl program, and under COMMIT project Infiniti. The 239 self-selected profiles of participants in the assessment ex- periment; GT3: For each self-selected profile of an assessor, we only use knowledge areas that were in the system-generated pro- References file. This means that knowledge areas that are not in the system- [1] K. Balog, T. Bogers, L. Azzopardi, M. de Rijke, and A. van den generated profile are treated as irrelevant; GT4: The knowledge Bosch. Broad expertise retrieval in sparse data environments. areas judged relevant during the assessment experiment. We only In SIGIR’07, pages 551–558. ACM, 2007. consider binary relevance; if a knowledge area was selected it is [2] K. Balog, I. Soboroff, P. Thomas, N. Craswell, A. P. de Vries, considered as relevant, otherwise it is taken to be irrelevant. We and P. Bailey. Overview of the TREC 2008 Enterprise Track. report Kendall’s τ correlation between system rankings using con- In TREC 2008 Proceedings. NIST, 2009. Special Publication. secutive sets of ground truth. We rank the eight systems that con- [3] R. Berendsen, K. Balog, T. Bogers, A. van den Bosch, and tributed to the generated profile, but leave out the algorithm that M. de Rijke. On the assessment of expertise profiles. JASIST, combined them. In this abstract, we focus on system rankings To appear.