=Paper=
{{Paper
|id=None
|storemode=property
|title=Term Weighting in Expert Search Task: Analyzing Communication Patterns
|pdfUrl=https://ceur-ws.org/Vol-871/paper_5.pdf
|volume=Vol-871
}}
==Term Weighting in Expert Search Task: Analyzing Communication Patterns
==
Term Weighting in Expert Search Task: Analyzing Communication Patterns? Anna Kravchenko and Dmitry Romanov Higher School of Economics, Research and Educational Center of Information Management Technologies Kirpichnaya ul. 33/4, 105679 Moscow, Russia Abstract. The goal of the expert search task is finding knowledgeable persons within the enterprise. In this paper we focus on its distinctions from the other information retrieval tasks. We review the existing ap- proaches and propose a new term weighting scheme which is based on analysis of communication patterns between people. The effectiveness of the proposed approach is evaluated on a collection of e-mails from an organization of approximately 1500 people. Results show that it is possible to take into account communication structure in the process of term weighting, effectively combining communication-based and document-based approaches to expert finding. Keywords: term weighting, expert finding, expert search task, graph analysis, referral systems 1 Introduction The problem of locating desired information or source of knowlege has been faced by almost everyone, and sometimes finding a right person may be even more valuable than finding the right documents. In large enterprises such as companies and goverment agencies it may become a task of major importance. For example, according to Yimam-Seid&Kobsa [14] people may search/tend to seek for an expert as a source of information for the following reasons: – Access to non-documented information. Not all information in organizations can possibly be explicitly documented or information may be deliberately not made publicly available for economic, social and political reasons. – Specification need. Userw may be unable to formulate a plan to solve a problem, or to pick a direction of research and resorts to seeking experts. ? This work was conducted with financial support from the Government of the Russian Federation (Russian Ministry of Science and Education) under contract 13.G25.31.0096 on ”Creating high-tech production of cross-platform systems for processing unstructured information based on open source software to improve management innovation in companies in modern Russia”. Term Weighting in Expert Search Task 41 – Leveraging on others expertise (group efficiency). E.g. finding a piece of information that a relevant expert would know/find with less effort than the seeker or filtering reliable information from the huge mass of information available. – Interpretation need. Userw may need help understanding the information even if he/she manages to find it from documents. – Socialization need. Users may prefer the human dimension that is involved in asking an expert as opposed to interacting with documents and computers. It may be also important to identify users or groups of users who are more knowlegeable than the others. In a big enterprise it may help discover new ideas, the most valuable employees and sources of innovations. An expert search system is designed to help with these tasks. In contrast with classical information retrieval where documents are retrieved, given a query, the system has to return a ranked list of person names in response. Most approaches to expert finding tend to copy standard information re- trieval methods, such as Vector Space Model [15], representing user profiles as documents. We argue that it may not be the best way, since the need to search for experts mostly emerges in business environments (like corporate mail, for example), which implies a completely different data organization. Authors are usually connected to each other and data mostly consists of messages, not iso- lated documents. Existing methods can be divided into communication-based, that take into account the communication patterns between users, and content-based, which focus strictly on the documents content. An essential notion in content-based methods is term significance. It is used to estimate the generality or specificity of a term and evaluate how important is a word to a document, collection or corpus. The process of calculating term significance is called term weighting and the most common approach to term weighting is the tf-idf measure and it’s variations [8]. The problem of existing term weighting schemes as well as other content- based methods is that they don’t take into account communication patterns between users. They may be combined with communication-based approaches sequentially, but there is no existing method that allows communication struc- ture to affect the term’s weight. Developing such method could be a significant contribution to the domain. In this paper we analyse the specifics of the task, review the existing ap- proaches to expert finding and term weighting and propose a new weighting scheme developed specifically for expertise evaluation. 2 Task specifics As it has been mentioned earlier, the expert search task has different properties from the standart information retrieval task. The standard IR task was developed for analysing web pages and text collections such as online libraries. An expert 42 A. Kravchenko et al. search task needs to deal with forums, social networks, blogs and corporate e- mails. We consider the following properties of the expert search task the most im- portant: – In business environment size of the available corpora is limited by the com- pany lifespan, therefore number of documents is relatively small. Same counts for social networks. – Corpora may contain information that normal text collections don’t, like authorship or communication patterns. – Communications between authors are not the same as hyperlinks. The possi- bility to relate every document to its author creates a hierarchical structure, which is absent in the standard IR task. Naturally, different data organization demands alternative approaches to its processing, that take into account additional features and information they pro- vide. Existing but rare approaches to standart IR such as time marks may also prove useful. Further investigation of the impact of some of this information to extracting data from business corpora is required. 3 Existing Approaches 3.1 Content-based schemes The most direct approach to expert finding is adapting Vector Space Model for the task. VSM, one of the most successful solutions to IR, has proved to be highly effective for many types of documents. VSM represents queries and documents as vectors in an n-dimensional information space. Then it systematically compares each document vector with the query vector to find the documents nearest to the query. To adapt VSM for referral systems, each person is represented with a vector based on a ”document” consisting of all messages they sent to the user. A person vector contains weights for each term in the query based on how frequently the term was used by the sender in his or her email messages. An example of this approach has been described in [15]. Let Pi be the person vector for colleague pi . This vector contains the com- puted weights for all terms in the sender’s email that also occurs in the query vector. Adaptation of TF-IDF metrics and variations are the most common for calculating the weights. tfik log(N/nk ) wi k = pPn . (1) 2 t=1 [(tfik ) log(N/nk ] Here, N = number of colleagues of the user, nk = number of colleagues who have used term k, tfik = term frequency; number of times colleague pi has used term k. It can be separated into the term frequency factor (tf) and the inverse doc- ument frequency factor (idf, collection factor). Term Weighting in Expert Search Task 43 The first gives a measure of the importance of the term t within the particular context, document or user profile. The simplest case is the occurrence count of a term in a document. The more a term is encountered in a certain context, the more it contributes to the meaning of the context Inverse document frequency factor diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely. It is a measure of the general importance of the term (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient). Variations of this scheme include term relevance weight (i.e. probabilistic term inverse frequency) and inverse term frequency, a good overview can be found in [8]. The calculation of the query vector, Qj , can be taken directly from infor- mation retrieval methods with computed weights for each term in the query vector < wj1 , wj2 , ...wjn >. Several different query vectors has been proposed, for example, by Salton and Buckley [12]: 0.5tfik wik = [0.5 + ] log N/nk . (2) tfmax Here tfmax = maximum frequency of all terms in query j. After person vectors have been calculated for each sender, the similarity between the query vector and a person vector can be calculated as the cosine of the angle between those vectors. Using this sort of similarity measure, the agent will return to the user a list of potential experts ranked in decreasing order of computed similarity values upon a query. The term weights can be normalized, so the similarity is between 0 and 1. It is convenient to think of this value as the likelihood or con dence that a person will be able to answer the query of the user. 3.2 Communication-based schemes The ranking scheme and it’s variations described above don’t take into account link structure or communication patterns. A rather more popular approach for expert finding are schemes that focus specifically on those properties, which has proven to be more effective [2]. An overview of graph-based algorithms is given in [4]. One of the most famous is PageRank. In this approach the obtained ranking is the principal eigenvector of the adjacency matrix of the digraph in which edges correspond to links (messages) between users. This approach turns out to be virtually idenctical to the well-known PageRank algorithm for ranking web pages, the only difference being that in PageRank low-weight edges are added between all nodes in both directions. The simplest is the successor model, where messages are viewed as directed edges pointed from greater to lesser expertise. All the people ’downstream’ from a given node are considered of lesser expertise. This one measure of expertise 44 A. Kravchenko et al. is simply the count of such people (nodes) Another is positional power function (PPF), where the ranks(ri|i = 1..n) satisfy the following system of equations: X 1 ri = (rj + 1) . (3) n j ∈ Si An adaptation of HITS algorithm, a precursor to PageRank was also pro- posed in [2]. Neither of those pays attention to term weighting, focusing only on commu- nication patterns. 4 Proposed approach The goal of our work was is to investigate the impact of communication pattern analysis to expertise evaluation in business corpora and the possibility of using this information during term weighting. In case of expert finding the main task of term weighting is to divide generic terms from professional vocabulary. Selecting rare terms doesn’t always work, as well as the TF-IDF scheme, since rare terms may be generic as well. How- ever, significance of the term may correlate with the patterns of communicaion between people who use it. Generic terms tend to come up randomly in doc- uments of any domain created by authors of all kinds of occupation and level of professional expertise, while professional terminology is mostly used between experts and people asking for a consultation, and experts are normally few and tend to interact with each other. Those people should form a tight cluster, while generic term users would be scattered around the enterprise with relatively small amount of connections. The main idea of our approach is to calculate the number of clusters that people form for every term in the corpora and to compare it to the number of clusters that would be formed if the links between users were distributed randomly. The proposed method was analyzed on a collection of e-mails from an orga- nization of approximately 1500 people consisting of 1270000 messages. 5 Algorithm realisation For every term, information about the number of occurrences, number of people using the term and messages id was included in the index. For every message senders and receivers were known. For every n = 3..50 all terms that were used by n people was selected. Then for every term the number of clusters k its users were forming was computed, and finally for every pair (n, k) the number of terms was measured. Then a random distribution of messages has been modelled. For every n = 3..50 200 terms were randomly selected from the table, then, according to number of term users n1 and number of term encounters n2 , n1 Term Weighting in Expert Search Task 45 people were randomly selected from the list of users and liked (”sending mes- sages”) to n2 any other users on the list. Uniform distribution was used for the probability of creating link with another user. As a result an alternative index was created, then numbers of all (n, k) pairs from this index were computed according to the algorithm described above. 6 Results Fig. 1. Distribution pattern for the factual communication structure (on the left) and the randomly generated one (on the right). Fig. 2. Distribution for n=10. The graph on the right shows randomly generated com- munications Fig. 1 and Fig. 2 depict the results of the experiment. As it can be seen, random distribution differs prominently from the factual one. 46 A. Kravchenko et al. The random distribution pictures show that the number of terms, users of which form a single cluster, is vanishingly small. It is highly improbable for such terms to occur by chance. At the same time, pictures of the factual distribution contain a much higher number of ”single cluster” terms, and the distribution graph itself is flatter. Therefore it can be concluded that those terms do not appear at random and may share some common properties. It also appeared that probability of term being professional instead of generic corresponds with the number of clusters. Terms with smaller k values tend to be more significant. Taking all these things into account, we find it reasonable to assume that number of clusters can be used as an important indicator of term significance. This leads to a following weighting scheme: p wij = . (4) pr N p = NNnk n , pr = Nnk,r n,r , where p is the probability of n users forming k clusters for factual distributuion, pr the probability of it for random distributuion, Nnk and Nnk,r are the number of terms, users of which form k clusters, in factual and random distribution accordingly, Nn and Nn,r are the total number of terms that are used by n people. If wij > 1 term is considered a professional one and if wij < 1 the term is considered general. A simpler, though less accurate scheme is k wij = . (5) kmostprobable Here if the number of users is n and they form k clusters, kmostprobable is the value that has the highest number of (n, k) pairs in random distribution. Those schemes can be used by itself or in a combination with TF-IDF as an additional factor. Precise evaluation is currently unachievable due to the small amount of an- notated data available - there are no corpora focusing on the term level. The direction of our future work is building a full expert search system based on this scheme. It will allow to evaluate the approach using standart expert search corpora, annotated on text level, such as, for example, the TREC 2005 task. 7 Conclusion In this paper we have shown that it is possible to make analysis of communication patterns a part of a term weighting scheme, and have suggested an example of such scheme. We have also focused on the distinctions between expert finding and standard IR task. Term Weighting in Expert Search Task 47 The proposed approach allows to take into account the specifics of the expert search task. It also allows to adapt the existing approaches (such as vector space models, for example) for expert finding more effectively. For further investigation, we are focusing on proper evaluation of the scheme, combining it with other term weighting schemes. and improving it by exploring the impact of other properties like graph density. 48 A. Kravchenko et al. References 1. Balog, K., De Rijke, M.: Determining Expert Profiles (with an Application to Expert Finding). In IJCAI07: Proc. 20th Intern. Joint Conf. on Artificial Intelligence, pp. 2657–2662 (2007) 2. Campbell, C.S., Maglio, P.P., Cozzi, A., Dom, B.: Expertise Identification Using Email Communications. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 528–531 (2003) 3. Chen, J., Geyer, W., Dugan, C., Muller, M., Guy, I.: Make New Friends, but Keep the Old: Recommending People on Social Networking Sites. In Proceedings of the 27th International Conference on Human Factors in Computing Systems, pp. 201– 210 (2009) 4. Dom B., Eiron I., Cozzi A., Zhang Y.: Graph-Based Ranking Algorithms for E-mail Expertise Analysis. In proceedings of 8th AGM SIGMOD Workshop on Research issues in Data Mining and Knowledge Discovery (2003) 5. Feldman, R., M. Fresko, Y. Kinar, Y. Lindell, O. Liphstat, M. Rajman, Y. Schler, and O. Zamir: Text Mining at the Term Level. Principles of Data Mining and Knowledge Discovery: pp. 6573. (1998) 6. Golbeck, J., and J. Hendler: Reputation Network Analysis for Email Filtering. In Proceedings of the First Conference on Email and Anti-Spam, 44:5458. (2004) 7. Kolari, P., Finin, T., Lyons, K., Yesha, Y.:Expert Search Using Internal Corporate Blogs. In Workshop on Future Challenges in Expertise Retrieval, SIGIR 2008, pp. 2–5. (2008) 8. Lan, M., Sung, S.Y., Low, H.B., Tan. C.L.: A Comparative Study on Term Weighting Schemes for Text Categorization. (2005) 9. Macdonald, C., Ounis, I.:Voting for Candidates: Adapting Data Fusion Techniques for an Expert Search Task. In Proceedings of the 15th ACM International Confer- ence on Information and Knowledge Management, pp. 387–396. (2006) 10. Mesnage, CS, Carman, M.: Piloted Search and Recommendation with Social Tag Cloud-based Navigation. In 1st Workshop On Music Recommendation And Discov- ery (WOMRAD), ACM RecSys, 2010, Barcelona, Spain. (2010) 11. Petkova, D., Croft, W.B.: Hierarchical Language Models for Expert Finding in Enterprise Corpora. In Tools with Artificial Intelligence, 2006. ICTAI06. 18th IEEE International Conference On, pp. 599608. (2006) 12. Salton. G., Buckley C.: Term Weighting Approaches in Automatic Text Retrieval Gerard Salton. Proceedings. 2005 IEEE International Joint Conference (2005) 13. Serdyukov, P., Rode, H., Hiemstra, D.: Modeling Relevance Propagation for the Expert Search Task. Technical report, TREC 2007 Enterprise Track (2007)Dom 14. Yimam, S., Kobsa, A.: Expert Finding Systems for Organizations: Problem and Domain Analysis and the DEMOIR Approach. (2003) 15. Yu, B., Singh, M.P.: A Multiagent Referral System for Expertise Location. In Working Notes of the AAAI Workshop on Intelligent Information Systems, pp. 66– 69 (1999)