-

Impact of the Query Set on the Evaluation of Expert Finding Systems

Robin Brochier

0 1

Adrien Guille

adrien.guilleg@univ-lyon2.fr 1

Benjamin Rothan

benjaming@peer.us 0

Julien Velcin

1 0 Digital Scienti c Research Technology , Lyon , France 1 Universite de Lyon , Lyon 2, ERIC EA 3083 , France

Expertise is a loosely de ned concept that is hard to formalize. Much research has focused on designing e cient algorithms for expert nding in large databases in various application domains. The evaluation of such recommender systems lies most of the time on humanannotated sets of experts associated with topics. The protocol of evaluation consists in using the namings or short descriptions of these topics as raw queries in order to rank the available set of candidates. Several measures taken from the eld of information retrieval are then applied to rate the rankings of candidates against the ground truth set of experts. In this paper, we apply this topic-query evaluation methodology with the AMiner data and explore a new document-query methodology to evaluate experts retrieval from a set of queries sampled directly from the experts documents. Speci cally, we describe two datasets extracted from AMiner, three baseline algorithms from the literature based on several document representations and provide experiment results to show that using a wide range of more realistic queries provides di erent evaluation results to the usual topic-queries.

expert nding recommender system evaluation

It is common to consider expertise as an implicit knowledge about a domain that someone carries and shares in di erent manners. Expertise retrieval aims at identifying this knowledge through explicit artifacts such as communications, actions or interactions between people. When someone call for an expert, she expects to nd a candidate able to understand a speci c query. Whereas most evaluations for expertise retrieval consist in directly querying the namings or descriptions of the ground truth topics of a given dataset, we claim that these queries do not show much interest for a real case scenario since: { the textual content of the topics namings are very limited in terms of language. Using richer (hence noisier) descriptions might better test the robustness of the evaluated algorithms. For example, it is better to query multiple times a retrieval algorithm with several texts relevant to the eld of \data mining" than only once with the naming of the eld itself. In real case scenarios, users have a wide range of behaviors and seldom use the same queries when looking for the same thing { no one really seeks for experts in so broad subjects. Most of the time, someone looks for an expert with a very speci c application in mind. Indeed, if a recruiter from a company is looking for a researcher to work on a speci c subject, it is more likely that she will use the detailed description of the project instead of a generic naming of the job to nd the right person.

In this paper, we rst provide in Section 3 a formal de nition of the expert nding task applied to the data extracted from AMiner 3. In particular, we describe two protocols: the topic-query evaluation and the document-query evaluation. We then describe in Section 4 three baseline algorithms from the literature that we reimplemented and tested using several document representations. Finally we show and analyze in Section 5 the results of our experiments, demonstrating the impact of the type of query on the behaviors of the algorithms and document representations.

Precisely, our contribution is fourfold: 1. we propose two di erent procedures for generating queries and study their impact on the evaluation results 2. we describe two ways of using AMiner's data for expert nding and detail the preprocessing needed 3. we reimplement and evaluate 3 algorithms from the literature based on several document representations 4. the corresponding Python code is made publicly available 4 which makes it easy to reproduce the experiments or even expand the proposed pipeline. 2

Related Works

The automation of expert nding appeared as a research eld along with the creation of large databases when started the digitalization of libraries and of the communication tools in big companies. P@noptic Expert [ 2 ] is one of the rst published works on expertise retrieval. The proposed model transforms the expert nding task in a text similarity task by building a meta-documents for each candidate, aggregating all documents where the name of this candidate appears. In 2005, the research around expert nding received a boost with the TREC2005 Enterprise Track, Expert search task. They provided a dataset extracted from the World Wide Web Consortium (W3C). Moreover, they shared an evaluation toolkit to allow researchers to confront their algorithms. As a result, a formal de nition of the problem emerged [ 3 ]. As presented in [ 1 ], the generative document-model of Balog et al., we denote q a query, d a document and e a

3 http://AMiner.org/ 4 https://github.com/brochier/impact query expert nding

candidate. The expert nding task consists in estimating the probability of a candidate to be an expert given a query P (ejq) = P (qje)P (e) . Voting models as in P (q) [ 7 ] relax the probabilistic view of the latter equation. As an example, the score of a candidate can be computed by ranking all documents against the query with a document representation such as the bag-of-words based model term frequency. Then each candidate is provided a score given the ranks of the documents she is associated to. In [ 14 ] and [ 10 ], the authors propose to propagate the a nity between the query and the documents across the collaboration graph in a similar manner as PageRank [ 8 ]. More recently, [ 12 ] adapted a word embedding technique to embed words and candidates in the same vector space. Many algortihms presented recently in the eld of representation learning such as TADW [ 13 ] and metapath2vec [ 4 ] can be adapted to the task of expert nding but their authors did not experiment them on this speci c task. Much work has been done for expert nding in community-based question answering as shown in [ 17 ] and their ranking metric network learning framework and in [ 16 ] which adresses the cold-start expert nding problem. 3

Framework for Expert Finding Evaluation

In this Section, after formally describing the expert nding task, we present two methodologies to generate queries. The rst, usually used in the literature, directly sets topics labels as queries whereas the second, which we introduce in this paper, samples documents from the experts of each topic. Finally we detail how we used the data from AMiner to generate two datasets for the expert nding task. 3.1

Formal Description

Let G = (V; E) be a bipartite graph with nodes V = VC [ VD corresponding to a set of candidates C and a set of documents D, where the links are undirected associations candidate-documents. Let X 2 RjDj N be the textual features of the D documents. The expert nding task, given such (G; X) dataset (see Figure 1), consists in scoring the set of candidates given a textual query q 2 RN , in order to answer the question \who are the candidates more likely to be experts in the topics present in the query ?". Given a set of queries Q = (q1; :::; qi; :::; qM ) each associated with an identi ed set of experts Ei E C, E being the global set of known experts among the candidates, we want to optimize the ranking of the ground truth experts Ei among the global set of experts E. 3.2

Evaluation

To evaluate the ranking of experts produced by an algorithm given a query, we use several common metrics from information retrieval such as Precision at rank K (P@K), Average Precision (AP) and Reciprocal Rank (RR). Moreover, to better understand the behavior of the algorithms tested, we construct the 4 3 1 5 6 2 1 3

2 (a) Bipartite graph linking candidates and documents.

candidates documents

Receiver Operating Characteristic (ROC) curve and compute its Area Under the Curve (AUC). For each of these metrics, we also compute their standard deviations along the queries which shows the robustness of the tested algorithms against the variations in the data. Moreover, when we have multiple queries per topic, we compute the standard deviation along the topics. We now present two ways of generating queries and their corresponding ground truth experts. Topic-query evaluation This approach is straightforward and is commonly adopted in the expert nding community. For a speci c topic, its naming or description is directly used as a query and its associated experts are the ground truth list of candidates to be retrieved. Algorithm 1 shows the complete evaluation procedure. As a result, if the dataset is composed of 10 topics, the protocol of evaluation consists of 10 queries. We call this approach the topic-query evaluation. For each measure described above, we are interested in its mean (Mean) and standard deviation (STD ) along the queries.

Algorithm 1 Topic-query evaluation procedure. The function Evaluate generates metrics such as P@10 and the ROC AUC based on the produced ranking and the ground truth expert set of a given topic.

Require: Ranking Algorithm scores [ ] for all topics do candidates ranking = Ranking Algorithm(current topic textual expression) current score Evaluate(candidates ranking, ground truth experts set) scores.append(current score) end for return Mean(scores), STD(scores) Document-query evaluation We propose to sample the documents linked with the experts of a given topic in order to use them as queries. Instead of using the topic description, we use the set of documents associated to the ground truth experts of a given topic. Precisely, we create a set of queries and their associated experts by selecting each document of the dataset linked with the ground truth experts. As such, the evaluated algorithm produces a ranked list of candidates for each document-query and its performance is measured by comparing the ranking with the experts of the same topic as the expert who produced the document-query. Since several document-queries are sampled for each topics, we also compute the means and standard deviations along the topics, by computing these values along the averaged measures intra-topics. To avoid any bias in the metrics, when evaluating an algorithm on a sampled document, we leave it out of the data. We call this approach document-query evaluation. Algorithm 2 shows the complete evaluation procedure.

Algorithm 2 Document-query evaluation procedure. Note that the computed metrics are also averaged for each topic in order to compute the inter-topic standard deviation.

Require: Ranking Algorithm scores [ ] topical scores fg for all topics do topical scores[current topic] [ ] for all experts of current topic do for all documents of current expert do candidates ranking = Ranking Algorithm(current document textual expression, leave out = current document) current score Evaluate(candidates ranking, ground truth experts set) scores.append(current score) topical scores[topics].append(current score) end for end for topical scores[current topic] Mean(topical scores[current topic]) end for return Mean(scores), STD(scores), STD(topical scores) 3.3

AMiner Data

The AMiner project aims to provide tools for mining researcher's social network. They provided several datasets 5 [ 11 ] collecting papers, authors, co-authorship and citations links extracted from DBLP [ 6 ], ACM (Association for Computing Machinery) and other sources in the eld of computer science. For the task

5 https://AMiner.org/data

of expert nding, they provided two lists of experts 6. The rst, the machineannotated list, is composed of 13 topics and has been built from topical web search. The second, the human-annotated list, is composed of 7 topics built with the method of pooled relevance judgments together with human judgments as described in [ 15 ]. We used the machine-annotated list with the citation dataset V2 and the human-annotated list with the citation dataset V1 available on the AMiner website 7.

We preprocessed the two datasets based on the distribution of links between candidates and documents. We also took into account the document string length (number of letters). First we kept only authors with less than 100 documents links and with at least one link. This reduces author name ambiguity by discarding authors who were originally connected to tens of thousands documents. Then we composed the textual content of the documents by concatenating their titles and abstracts and by keeping only those with string length greater than 50. As a result, we ended up with two datasets: { AMiner expert dataset 1: using the machine-annotated list of experts, is composed of 996,110 candidates, 1,125,082 documents, 1,269 experts in 13 topics. The distribution of the experts across topics is given in Table 1a (one expert can be linked to several topics) with the total number of documents linked to those experts { AMiner expert dataset 2: using the human-annotated list of experts, is composed of 532,968 candidates, 480,630 documents, 210 experts in 7 topics.

The distribution is given in Table 1b. 4

Baseline Algorithms

After a short description of document representation, we describe three baseline algorithms taken from the literature. We reimplemented them since their original codes were not available or hardly reusable. Moreover, we could easily extend them to work with any kind of document representation. 4.1

Document Representation

Our three baseline algorithms rely on a measure of semantic similarity between the queries and the corpus of documents. We chose to try several document representations: term frequency (TF), term frequency - inverse document frequency (TF-IDF) and latent semantic indexing (LSI) [ 9 ]. We tokenized the text of the documents by lowercasing the characters, removing stop words and concatenating tokens based on their co-occurrence counts to compound 2-grams and 3-grams. Then, words appearing less than 3 times in the corpus or in more than 50% of the documents were discarded to reduce the computational cost

6 https://AMiner.org/lab-datasets/expert nding/#expert-list 7 https://AMiner.org/citation

(a) AMiner dataset 1. (b) AMiner dataset 2. without a ecting the retrieval performance. The number of dimensions of the singular value decomposition for the LSI is 300. This number was chosen to ensure components above noise level are retained as proposed in [ 5 ]. 4.2 P@noptic Expert [ 2 ] is a simple algorithm which creates meta-documents for each author. Our implementation rst concatenates the contents (title+abstract) of all documents linked with each candidate, then vectorizes this meta-documents using the pretrained documents representation models. Finally, it computes the cosine similarities between a query and the meta-documents and ranks the candidates by descending order of their scores. 4.3

Text-based Approach 2: Voting Model

Our voting model based on [ 7 ] rst computes the cosine similarities between the query and the documents of the dataset and then ranks all documents by descending order of their score. The algorithm then sums the inverse value of the rank (Reciprocal Rank - RR) of each document a candidate is linked with. If a candidate is linked with the 2nd, 3rd and 7th closest documents to the query, its score will be 12 + 13 + 17 = 0:976. This algorithm gives a huge boost to candidates who have at least one document well ranked and tends to promote candidates with more documents than others. We also tried other fusion techniques than the RR such as CombSUM and CombMNZ, described in [ 7 ], but they provided weaker results. 4.4

Graph and Text-based Approach: Propagation Model

The propagation model we made is a simpler version of those described in [ 10 ]. The algorithm rst computes the cosine similarities between the query and the documents and it initializes a score vector S0 of length jCj + jDj with zeros for candidates and the documents-query scores for documents. It then operates several two-steps random walks with restart until the score vector converges (until the L2 norm of the di erence of its previous value and current value is below 10 6). These random walks are done iteratively: Si+1 = (1 ) A(ASi)+

R where is the jumping factor, a scalar between 0 and 1, which controls the restart, R = S0 is the restart vector that represents the global probability of a random walk to restart from its original node, A is the column-wise L1 normalized adjacency matrix of the bipartite graph, also known as the PageRank transition matrix [ 8 ]. At each step, scores jump from documents to candidates then from candidates to documents. A last step is nally done to propagate scores back to the candidates. These scores are then ranked by descending order. 5

Experiments

In this section, we present the experiments we did with both topic-query and document-query evaluations. We rst show some general results before analyzing the e ect of the type of query and nally focusing on the variations of ranking along the queries and the topics. 5.1

Settings

We evaluated our baseline models on the topic-query and the document-query methodologies. We made two evaluations for the propagation model using = 0:1 where the restart is weak, hence the propagation is wide, and = 0:5 where the scores stay close to their initial values. Moreover, for each model, the semantic similarity was computed with TF, TF-IDF and LSI document representations. Table 2 shows the results on the AMiner expert dataset 1 and Table 3 shows the results on the AMiner expert dataset 2. 5.2

General Results

For both datasets, the document representation TF-IDF performs generally better except for the AUC score, where LSI performs best, especially on topic queries. Actually, taking a closer look at the ROC curve, we could see that LSI is better in ranking for the worst ranked experts. It smoothes the curve in the top right corner and hence improves the area under the curve. Most metrics (P@10 and RR for example) are intended to focus on the quality of the very rst ranked experts but the ROC AUC allows us to analyze the behavior of a ranking algorithm over the entire ranking. It is also important to note that the results are more stable across the choice of document representation for the second dataset. This behavior is expected since the ground truth experts have been human curated.

E ect of the Type of Query

We observe di erent rankings of the baseline algorithms depending on the type of evaluation performed. For the topic-query procedure, the propagation model performs best (with = 0:5 for the rst dataset and = 0:1 for the second) whereas the voting model is the best for the document-query evaluation. Our explanation is that voting models are good when queries and documents are of the same type since we only need one candidate's document to be similar to the query to push her to the top of the ranking. When the query is as short as \data mining", the chance to nd such a similar document is low since few documents about data mining have the words \data mining" in their content. Indeed, scienti c articles rarely deal with data mining in general but rather focus on particular aspect of this eld.

In contrast, the propagation model can give a good score to a candidate if in her neighborhood, the query is similar to some documents. Even if this candidate is an expert of \data mining" without never actually using the expression, there are quite some chances that in its close social network, some other candidates used these two words.

Then, the voting model might perform best than propagation for document queries because the latter tends to mistaken an information retrieval expert who worked closely with data mining experts. This situation is less likely to happen when the query is a short and very speci c description than with paper contents that share a lot of similar terms between topics.

Moreover, this di erence of results between query types are weaker when using LSI, which is due to the ability of this document representation to capture a similarity between two texts that do not share any word in common. The e ect of short query is thus highly reduced compared to TF and TF-IDF. 5.4

Standard Deviation along Queries and Topics

One important aspect is the amount of dispersion the sets of scores have around their means. We computed the standard deviations for each evaluation to have an insight of the robustness of the algorithms to queries and to topics. Interestingly, for the document-query, the standard deviations along topics evaluation are lower than the deviations along queries. This shows that the robustness of the algorithms are not that much impacted by the variation of topics, as could have suggested the standard deviations for the topic-query evaluation, but merely by the variety of queries intra-topics. As a result, using only a few topic queries is statistically biased since some topic namings might have lesser chance to appear in their related documents. Finally, in the second dataset, the deviations along topics for the voting model are signi cantly lower than other models which is a precious information that cannot be revealed by a topic-query evaluation if one wants to favorite stability over the searched topics of expertise. 5.5

Pros and Cons of the Document-Query Evaluation

Beside the fact that the document-query evaluation seems to better represent a real case application of expert nding, we showed that it provides a deeper insight on the robustness of an algorithm. The di erent rankings of algorithms for both evaluations and their corresponding inter-topics standard deviations prove that using only the namings of the topics is not a satisfactory protocol to compare expert nding systems. However, in a general manner, measures are much better with the topic-query evaluation. This is due to two aspects: { document-queries are semantically ne grained and it is more di cult to separate two queries of di erent topics. This makes the expert nding task harder to solve but it is not a bad thing for the evaluation. { in our current con guration, document-queries do not rely on an annotated dataset. As a consequence, some sampled documents might not actually belong to the topic their authors are associated with. This motivates the construction of a ground truth set of documents associated to at least one of the human-annotated expert topics. 6

Summary and Future Work

We compared two evaluation protocols for scienti c expert nding that rely on two types of query generation. Evaluating our baseline models with this framework, we showed that using the documents written by the ground truth experts brings di erent results than with the usual topic queries. Speci cally, short queries can pro t of a propagation model whereas longer queries are better handled by a simpler voting model. Moreover, the lower standard deviations along topics for the document-query evaluation shows that there is a bias in using only one topic naming as query since the document representations do not handle well such short query similarity to the documents.

To improve the document-query evaluation with the AMiner data, we would like to lter the set of sampled documents by human annotation in order to keep only those that match the expertise of their authors. This would then justify a deeper analysis of the signi cance of the measurements to consider the variations of ranking of the evaluated algorithms along the queries. Another interesting work would be to perform an online evaluation of the same expert nding algorithms in the case of a reviewer assignment application in order to compare the results with our framework. (b) Baseline mean scores and their query standard deviations for the document-query evaluation. Vote

Prop ( = 0:1)

Prop ( = 0:5) (b) Baseline mean scores and their query standard deviations for the document-query evaluation.

1. Balog , K. , Azzopardi , L. , De Rijke , M. : Formal models for expert nding in enterprise corpora . In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval . pp. 43 { 50 . ACM ( 2006 )

2. Craswell , N. , Hawking , D. , Vercoustre , A.M. , Wilkins , P. : P@ noptic expert: Searching for experts not just for documents . In: Ausweb Poster Proceedings, Queensland, Australia . vol. 15 , p. 17 ( 2001 )

3. Craswell , N., de Vries , A.P. , Soboro , I. : Overview of the trec 2005 enterprise track . In: Trec . vol. 5 , pp. 199 { 205 ( 2005 )

4. Dong , Y. , Chawla , N.V. , Swami , A. : metapath2vec: Scalable representation learning for heterogeneous networks . In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . pp. 135 { 144 . ACM ( 2017 )

5. Kaiser , H.F. : The application of electronic computers to factor analysis . Educational and psychological measurement 20(1) , 141 { 151 ( 1960 )

6. Ley , M.: The dblp computer science bibliography: Evolution, research issues, perspectives . In: International symposium on string processing and information retrieval . pp. 1 { 10 . Springer ( 2002 )

7. Macdonald , C. , Ounis , I. : Voting for candidates: adapting data fusion techniques for an expert search task . In: Proceedings of the 15th ACM international conference on Information and knowledge management . pp. 387 { 396 . ACM ( 2006 )

8. Page , L. , Brin , S. , Motwani , R. , Winograd , T. : The pagerank citation ranking: Bringing order to the web . Tech. rep. , Stanford InfoLab ( 1999 )

9. Papadimitriou , C.H. , Tamaki , H. , Raghavan , P. , Vempala , S. : Latent semantic indexing: A probabilistic analysis . In: Proceedings of the seventeenth ACM SIGACTSIGMOD-SIGART symposium on Principles of database systems . pp. 159 { 168 . ACM ( 1998 )

10. Serdyukov , P. , Rode , H. , Hiemstra , D. : Modeling multi-step relevance propagation for expert nding . In: Proceedings of the 17th ACM conference on Information and knowledge management . pp. 1133 { 1142 . ACM ( 2008 )

11. Tang , J. , Zhang, J., Yao , L. , Li , J. , Zhang, L. , Su , Z. : Arnetminer: extraction and mining of academic social networks . In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining . pp. 990 { 998 . ACM ( 2008 )

12. Van Gysel , C. , de Rijke , M. , Worring , M. : Unsupervised, e cient and semantic expertise retrieval . In: Proceedings of the 25th International Conference on World Wide Web . pp. 1069 { 1079 . International World Wide Web Conferences Steering Committee ( 2016 )

13. Yang , C. , Liu , Z. , Zhao , D. , Sun , M. , Chang , E.Y.: Network representation learning with rich text information . In: IJCAI . pp. 2111 { 2117 ( 2015 )

14. Zhang , J., Tang , J. , Li , J. : Expert nding in a social network . In: International Conference on Database Systems for Advanced Applications . pp. 1066 { 1069 . Springer ( 2007 )

15. Zhang , J., Tang , J. , Liu , L. , Li , J.: A mixture model for expert nding . In: Paci cAsia Conference on Knowledge Discovery and Data Mining . pp. 466 { 478 . Springer ( 2008 )

16. Zhao , Z. , Wei , F. , Zhou , M. , Ng , W. : Cold-start expert nding in community question answering via graph regularization . In: International Conference on Database Systems for Advanced Applications . pp. 21 { 38 . Springer ( 2015 )

17. Zhao , Z. , Yang , Q. , Cai , D. , He , X. , Zhuang , Y. : Expert nding for communitybased question answering via ranking metric network learning . In: IJCAI . pp. 3000 { 3006 ( 2016 )