Development and evaluation of clustering techniques for finding people M. D. Dunlop1 Centre for Human-Machine Interaction Systems Analysis Department., Risø National Laboratory, P.O. Box 49 4000 Roskilde, Denmark mailto:mark.dunlop@risoe.dk http://www.chmi.dk/people/mdd/ Abstract  the work of the people. Two possible scenarios where this form of matching would be helpful were used as Typically in a large organisation much the main motivation behind this work: 1 expertise and knowledge is held informally within employees’ own memories. When In any organisation a large amount of information employees leave an organisation many concerning large projects is typically not documented documented links that go through that person but held only in the staff’s memories. Such are broken and no mechanism is usually undocumented information typically includes details of available to overcome these broken links. This who was involved in a project in consultation roles, matchmaking problem is related to the what the problems of the project were and how these problem of finding potential work partners in a were solved (final project documentation often only large and distributed organisation. This paper records the final result and not the process, which is reports a comparative investigation into using arguably the most important information for reuse). In standard information retrieval techniques to the engineering domain, Hertzum & Pejtersen [1999] group employees together based on their web have identified that people typically interleave pages. This information can, hopefully, be searching for people with searching for documents: “we subsequently used to redirect broken links to find that engineers search for documents to find people, people who worked closely with a departed search for people to get documents, and interact employee or used to highlight people, say in socially to get information without engaging in explicit different departments, who work on similar searches”. Furthermore, they identified that “design topics. The paper reports the design and documentation seems to be biased toward technical positive results of an experiment conducted at aspects of the chosen solution, while information about Risø National Laboratory comparing four the context of the design process is typically not different IR searching and clustering available. Hence, people become a critical source of approaches using real users’ web pages. information because they can explain and argue about why specific decisions were made and what purpose is served by individual parts of the design”. A 1 Motivation problematic implication of this observation is that considerable knowledge about the context of a project This paper addresses the problem of automatically is lost when staff leave an organisation. One of the matching people to other people in an organisation main aims of this work is to support finding of based on already existing written documents describing colleagues retrospectively, based on automatically keeping records of who work together. Thus, for example, when a key document is written by S Jones The copyright of this paper belongs to the paper’s authors. Per- who has since left the company, a record will be mission to copy without fee all or part of this material is granted available of whose work was closest to Jones at the provided that the copies are not made or distributed for direct time the document was written. This colleague should commercial advantage. Proc. of the Third Int. Conf. on Practical Aspects of 1 Now at: Computer Science Department, Univerity of Strathclyde, Knowledge Management (PAKM2000) Glasgow, G1 1XH, Scotland Basel, Switzerland, 30-31 Oct. 2000, (U. Reimer, ed.) mailto: mark.dunlop@cs.strath.ac.uk http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-34 http://www.cs.strath.ac.uk/~mdd/ M.D. Dunlop 9-1 then, hopefully, be able to act as a contact point for developed over many years to support searching for finding other people and documents concerning the documents [e.g. Van Rijsbergen 1979, Baeza-Yates & project. Ribeiro-Neto 1999]. In large and distributed organisations, it is highly possible that two people will have similar interests or 2.1 Searching be working on similar projects without begin aware of each other. As an example, many Universities have The first technique used here to match people is based several locations in which related work may be carried on indexing staff web pages then performing searches out in different departments and from different based on finding the most similar web pages for each backgrounds. It is hoped that the models presented in user. For example, the content of user Ui’s web page is this paper will help to bring together people who have used as the query for a search of all other staff home similar research interests or work areas. This has been pages within an organisation, resulting in a ranked list, identified as one of the major social implications of a L, of those users who are closest to user Ui (i.e. L1 is move from physical libraries to digital ones: one no the closest person to Ui, L2 the next closest, etc.) as longer fortuitously meets fellow researchers at the defined by the content of their home pages. shelves. In a preface to their work on supporting interaction with and awareness of others in digital The experiments were run using a baseline ranked IR libraries, Robertson and Reese [1999] state that engine developed in house using standard IR “libraries are hubs for social and intellectual techniques [e.g. Salton & McGill 1983, Frakes & interactions in communities and organisations. Virtual Baeza-Yates 1992, Sparck Jones & Willett 1997] and libraries should serve the same purpose, yet virtual implemented in Java 1.1. Documents were indexed libraries often focus simply on making their holdings using: available”. Twidale and Nichols [1998] give a short • tf/idf weighting, which weights terms proportional overview of, so called, matchmaking systems and link to how often they occur in the current document these with other work on computer supported co- but inversely to how often they occur in the operative work in information retrieval (expanded collection as a whole [Sparck Jones 1972]; description of their work on matchmaking in digital libraries can be found in [Twidale and Nichols 1997]). • a simple stop-word list based on the collection This paper reports an investigation into the use of itself, the 30 most common words in the collection information retrieval (IR) techniques to automatically were not indexed; match people based on their web pages. The resulting • Porter’s stemming algorithm, an algorithmic matches could be used to highlight people who are stemmer that conflates variants of a words into the working closely together and this information could be same base form, e.g. walking, walks etc all used to highlight potential collaborations now, or used conflate to walk [Porter 1980]; historically to point searchers to colleagues of staff who have, say, left the company. The paper starts by • and the cosine matching function, an IR standard describing the IR and clustering techniques used in the that takes into account term weights and document experiments, then it describes the experimental lengths [Salton and McGill 1983]. framework, the results of the experiment and, finally, presents a discussion of potential extensions to the The IR engine was designed to index web pages: it only model. indexes content baring sections, omitting HTML tags, and gives greater weight to words in the title of the 2 The techniques page. With the advent of large search engines over the World 2.2 Clustering Wide Web, searching techniques are increasingly used to find people based on their name and, less frequently, Clustering techniques have long been used in IR to based on similar research portfolios. The experiments improve the performance of search engines, both in reported here target finding people by similar topic and terms of timing and quality or results [e.g. Jardine and are based around testing four different solutions to the Van Rijsbergen 1971, Van Rijsbergen and Croft 1979 problem. These solutions can be classified into two and Griffiths, Luckhurst and Willett 1986]. This work categories: simple searching (one approach) and three follows from the observation, known as the cluster approaches to cluster based matching. All of the hypothesis, that relevant documents are more like one approaches are based on indexing users’ home pages another than they are to non-relevant documents [Van using standard IR techniques. IR techniques have been Rijsbergen & Sparck Jones 1973]. The work in this M.D. Dunlop 9-2 paper investigated the use of clustering techniques to • documents are clustered in a ‘stable’ manner - improve the performance of people matching. Three clusters of three documents are permitted when clustering techniques were used: balanced clustering, both documents in a lower pair on the ranked list single link clustering and group average clustering. All have stronger links with already paired documents; these clustering algorithms are hierarchic agglomerative algorithms, meaning a hierarchical structure of clusters • the resulting hierarchy is tight and fairly balanced and sub-clusters is created by starting with small with a maximum outage at any node of 3 and a clusters and adding documents and merging clusters normal minimum of 2 (occasionally single node until a single cluster remains. The clustering techniques clusters are formed) were used to produce a hierarchical clustering of the users, H, that, hopefully, has similar users grouped 2.2.2 Single Link Clustering together on the lower levels of the hierarchy. Single link and group average were chosen for showing Single link clustering is based on creating a hierarchical significantly different performance in comparative tree by continually inserting an additional node that experiments for document retrieval [Griffiths, satisfies the following criteria: Luckhurst and Willett 1986] with balanced clustering ƒ the new node is currently outside the hierarchy; being added following their observation that small clusters appear to be a strong factor in performance of ƒ of all similarities between nodes inside and outside clustering algorithms. the hierarchy, the new node is selected that has the strongest similarity. It is then added to the 2.2.1 Balanced Clustering hierarchy at a level based on how strong the similarity is. In the balanced clustering approach each user Ui is grouped with another one or two users based on the This approach is fairly fast and results in hierarchies similarity between the nodes (as defined by the same IR where the closest nearest neighbours are at lower levels engine and indexing approaches used in the baseline of the hierarchy. However, it leads to non-balanced searching method). These pairs or triples are then clusters and does not yield a binary hierarchy – many grouped together with the most similar other group node-node comparisons can have the same strength of based on the average vector for the groupings (this is similarity thus many documents can be linked at the essentially a balanced variation of group average same level in the hierarchy. clustering discussed later). Again these grouping are further grouped into pairs or triples until the second top The implementation was based on the pseudo code level where a pair is forced. The process is more clearly shown below. The pseudo code based closely on that described by the following pseudo code: from [Voorhees 1996], where the reader is directed for a more complete description. repeat take current set of documents or most recent set of clusters // initialise hierarchy and insert document one into it calculate all descriptor-descriptor comparisons for (i=2 to collectionSize) insert descriptor-descriptor pairs into a list sorted by weight info[i].sim = 0; for each pair working down list info[i].inHierarchy = false; if neither element has been assigned then info[i].nn = UNDEF; record this pair as a new cluster currentID = 1; else if both elements are assigned & both would prefer a higher cluster // place the document having maximum similarity with & those higher clusters are currently pairs // a document in the hierarchy into the hierarchy until & the higher clusters are different // all documents are in the hierarchy add both as a 3rd to appropriate higher clusters while (currentID ≠ UNDEF) else info[currentID].inHierarchy = TRUE; ignore this pairing // they can’t be assigned just now ComputeSims(currentID); until only one cluster remains maxSim = 0; nextID = UNDEF; // update nearest neighbour for docs outside hierarchy Although a relatively slow algorithm, this approach for (i=1 to collectionSize) if (not info[i].inHierarchy) gives the following characteristics: if (sims[i] > info[i].sim) info[i].sim = sims[i]; info[i].nn = currentID; • the closest two documents are grouped together if (info[i].sim > maxSim) first, then the next available pair, and so on so that maxSim = info[i].sim; nextID = i; the strongest matches are honoured; if (nextID ≠ UNDEF) currentID = nextID; M.D. Dunlop 9-3 2.2.3 Group Average Clustering counting how many intermediate internal nodes there are in the hierarchy on the path through the hierarchy Group average link clustering is based on creating a from one leaf node to the other, see figure 1 for an hierarchical tree by initially creating a singleton cluster example). The list L was then based on these distances, for each document and marking these as “active”. The smallest highest in ranking. clustering then repeats the following until only one cluster remains active: ƒ merge the two clusters with most similar cluster representatives. Where the cluster representative is the mean vector of all document vectors in the cluster (with singleton clusters being self representing); ƒ make the new pairing active and the two clusters 3 3 1 Ui 4 4 which formed the pair non-active. Again, pseudo code and implementation were based on Figure 1: Distances from user Ui that extracted from [Voorhees 1996]: The following four subsections describe in more detail //initialise the four approaches taken: searching, balanced maxSim = 0; clustering, single link clustering and group average for (i = 1 to collectionSize) clustering. //create singleton clusters info[i].representative = document[i].representative computeSim(i, nn, sim) 3 Experimental setting info[i].nn = nn, info[i].sim = sim; info[i].size = 1; if (sim > maxSim) To evaluate the performance of the different people id1 = i; id2 = nn; maxSim = sim; finding algorithms, an experiment was conducted based numActive = collectionSize; on the web pages for the Systems Analysis Department for (i = 1 to numActive) active [i] = i; at Risø National Laboratory. The Risø S.A.D. web contains home pages for 60 staff within the department. //merge clusters until only 1 left or remaining sims are zero while (maxSim > 0 & numActive >1) To complete the test collection, each member of the smaller = min(id1,id2); larger = max(id1,id2); department was given a form in which they were asked info[smaller].centroid = mergeCentroids(smaller,larger); to assess, on a four point scale, how closely they info[smaller].size = info[smaller].size + info[larger].size; worked with each other person in the department. They a = index of larger in active; were given the instructions to tick: active[a] = active[numActive]; numActive--; mergeClusters(smaller, larger, maxSim) ƒ “3 boxes for those you work very closely with; maxSim = 0; for (each cluster a in active) ƒ 2 boxes for those you work with; if (info[a].nn = larger | info[a].nn = smaller) findMaxSim(a, nn, sim); ƒ 1 box for those whose work is related mildly; info[a].nn = nn; info[a].sim = sim; if (info[a].sim>maxSim) ƒ 0 boxes if your work is unconnected (or you don't id1 = a; id2 = info[a].nn; maxSim = info[a].sim; know who the person is!).” Group average clustering is slower than single-link To prevent biasing the staff towards defining these clustering, but is known to produce better clustering for terms closer to definitions that would match the document retrieval, guarantees to produce binary trees clustering algorithms, no explanation of the terms in the and keep closely related documents together in the instructions were given (e.g. the terms “work with”, initial pairs. Group average is essentially a non- “closely” and “unconnected” were left undefined). balanced, purely binary, version of balanced clustering. A total of 27 forms were returned with a mean of 11.8 people marked per form (minimum 1, maximum 33, 2.2.4 Evaluation mean 11.76, standard deviation 6.74) and 21.6 ticks per For consistent comparison with simple retrieval, a list form (min 2, max 53, mean 21.60, standard deviation of matching documents was required for each user. For 13.80). each user Ui, each node Uj in the hierarchy H was Following standard IR practice, precision and recall scored based on how far Uj was from Ui (based on figures were calculated. However, these were based on M.D. Dunlop 9-4 relevance weight rather than the more common approach of simply counting how many relevant 1 Group Average Balanced documents were found (following definition in [Reid 0.9 Single Link 2000] for non-binary test collections). For these 0.8 Searching experiments, relevance weight is defined as how many 0.7 ticks were marked divided by the maximum number of 0.6 ticks (e.g. 0.333 for 1 tick and 1 for 3 ticks). This 0.5 allows the results to highlight how well the system is at 0.4 finding those people who users work closely with over 0.3 those they simply work with. For a ranked list L with 0.2 the best match at position 1, second at position 2 etc., 0.1 precision and recall at position p were defined as follows: 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Relevance weight in L1...L p for user Ui Recall recalli,p = Total relevance weight for user Ui Figure 1: Results from experiments with full lists Relevance weight in L1...L p for user Ui precisioni,p = p Figure 2 shows the same results plotted by ranked position – showing how many ticks were accumulated, For consistency of evaluation, each algorithm produced on average, by each rank position up to the tenth rank a full ranking of all users (i.e. L contains an entry for position. This shows that for the best approach, group every user in the collection bar Ui). Individual recall average clustering, an average of 1.72 ticks were found precision graphs were then combined using standard at rank position one (57% perfect) whereas the worst macro-evaluation as defined in Van Rijsbergen [1979 approach, straight searching, was only achieving 0.88 pp 152-153]. ticks (29% accuracy). For group average clustering, the success rate rose to an average of 3.08 ticks by rank 4 Results position 2 – which could be considered as “one close colleague equivalent” within the first two suggestions Figure 1 shows the results for the four algorithms. It of the system. clearly shows that for this collection clustering-based approaches are more effective at matching people than simple searching and that overall, for basic IR 8 techniques, the performance of the system is good. Of 7 the clustering algorithms, group average performs best 6 for low recall (0..0.27 approximately). This is the region in which people matching programmes are most 5 likely to have their main impact – usually looking for 4 one or two substitute names rather than, say, 70% of colleagues. However, balanced retrieval is only slightly 3 poorer and, probably because of the more balanced 2 group average clustering hierarchy leading to more normalised distances within balanced clustering 1 the tree, is better than Group Average Clustering as the single link clustering straight searching recall levels are increased. 0 1 2 3 4 5 6 7 8 9 10 Rank position Figure 2: Mean number of ticks by rank position To examine how resilient each approach was to storing limited information on people who each user works closely with, the lists L created by each of the four matching techniques were limited to nine elements each and the evaluation repeated (this simulates, say, a monthly recording of the nine closes people for each member of staff so involves storing 9u records as M.D. Dunlop 9-5 opposed to u2 records, where u is the number of users). such as documents written by the staff (possibly Figure 3 shows that this results in a considerable drop including weekly diaries, which are common in many in performance for all methods at higher recall levels, engineering settings) and pages for but relatively little drop in the 0...0.2 precision range. In organisations/activities the user is involved in. particular the performance of group average clustering is almost unaffected in this range by storing limited One problem highlighted in the experiment reported information. Considering that the 0.2 recall point here is that of assessing the different interpretations of implies finding 20% of the colleagues and, in many “works with”. For example, staff completing the data settings, we are only likely to be looking for one/two - capture form were varied in how they reacted to this is a promising result for reduced storage. secretaries being on the list of staff as well as other research colleagues and department managers. It would be worth investigation classification methods so that 1 searches can be restricted, or at least rank positions 0.9 affected by, matching users who are at roughly the 0.8 same level in an organisation. This is likely to take 0.7 place somewhat automatically as those users home 0.6 pages are likely to contain more in common than users Precision 0.5 who are at drastically different levels in the 0.4 organisation but these claims need further investigation. 0.3 As the motivational scenarios for this work did not 0.2 require fast and frequent clustering of staff, only high 0.1 quality clustering algorithms were considered. It may 0 be worth investigating the use of faster and lighter 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 methods, such as scatter/gather, to compare their Recall performance with the tested algorithms. Searching Balanced Single Link Group Average 6 Conclusions Figure 3: Results from experiments with lists limited to The results of this experiment show that IR techniques 10 entries can be used to match users home pages with those of other users to find colleagues who work in similar areas 5 Extensions with a fair level of success. In the case of the experiment reported here, precision approximately 0.6 As documents rarely exist in isolation, approaches to can be achieved for low levels of recall where the document retrieval that make use of the hypertext approach is most likely to be used. At rank position 1 network in which documents are found [e.g. Dunlop an average of 1.72 ticks were found, where 2 ticks 1991, Dunlop and Van Rijsbergen 1993, Frei, and represents a staff declaration of someone that they work Stieger 1992] could be used to augment home pages with (but not closely) – this rises to a total of 3.01 ticks with connected documents (such as linked pages and by rank position two, equivalent to finding one close subpages). These hypertext-IR techniques have been colleague in the first two suggestions from the system. shown successful in improving retrieval for standard Furthermore, use of limited length lists shows that searching [Savoy 1996] and for accessing documents storing only the nine closest predictions has little effect that cannot be indexed directly [Harmandas et al 1997]. on the performance of the system at low recall while As well as providing the usual benefits of hypertext-IR drastically reducing storage requirements for historical indexing, if used for indexing staff home pages, the recording. The experiment compared straight IR approaches will also solve many of the problems of searching to match users with potential colleagues with these pages. The Risø home pages used in this three different clustering approaches (balanced, single experiment were fairly consistent corporate style mini- link and group average), all clustering approaches CVs. Less consistent pages, such as those typically performed better with group average being the best found in universities, have many problems such as overall (and noticeably more stable at low recall when frame based front pages (which don’t actually have any using reduced length lists). This indicates that it is content on the base home page URL), very simple front better to connect people based on the best overall pages with linked pages and drastically different styles arrangement (clustering approaches create a single best and quality and quantity of information provided. The cluster hierarchy for the whole department then rank for use of hypertext-IR techniques could also bring in individual users) rather than the best arrangement for many other sources of evidence as to users’ activities, M.D. Dunlop 9-6 each individual person as performed in straight and Visual Information Research (ELVIRA 4), searching. Milton Keynes, Aslib: London, UK, pp 31-38, May 1997. Further work is planned to investigate the results here in a corporate setting using different sources of Porter, M. F., “An algorithm for suffix stripping”, documentation, over a longer timescale and using more Program, v 14(3), pp130-137, July 1980 users as the test base. Hypertext-IR approaches will (reproduced in Sparck Jones and Willett 1997) also be investigated to see if they improve the Reid, J., “A task oriented non-interactive evaluation effectiveness of the clustering approaches and make methodology for information retrieval systems”, them more amenable to highly variable styles of home Information Retrieval, v 2(1), 2000 to appear. pages visible in some organisations. Robertson, S., and Reese, K., “A virtual library for building community and sharing knowledge”, Acknowledgements International Journal of Human-Computer Studies, Many thanks are due to the staff of Risø's Systems vol 51, pp 663-685, 1999. Analysis Department who completed questionnaires as Salton, G., and McGill, M.J., Introduction to modern part of the evaluation work. information retrieval, McGraw-Hill, 1983. The project was supported by the Danish Research Savoy, J., "An Extended Vector-Processing Scheme for Foundation's Centre for Human Machine Interaction. Searching Information in Hypertext Systems", Information Processing and Management, v32(2), 155-170, March 1996. References Sparck Jones, K., “A statistical interpretation of term Baeza-Yates, R., & Ribeiro-Neto, B., Modern specificity and its application in retrieval”, Journal Information Retrieval, ACM Press 1999. of Documentation, v28, pp 11-21, 1972. Dunlop, MD, Multimedia Information Retrieval, PhD Sparck Jones, K., and Willett, P. (Eds), Readings in Thesis, Glasgow University Computing Science Information Retrieval, Morgan Kaufmann, 1997. Research Report 1991/ R21, October 1991. Twidale, M.B., and Nichols, D.M., “Computer Dunlop, M.D., and Van Rijsbergen, C.J., "Hypermedia supported cooperative work in information search and free text retrieval", Information Processing and and retrieval”, in Williams, E.M. (Ed), Annual Management, vol 29(3), May 1993. Review of Information Science and Technology Frakes, W.B., and Baeza-Yates, R. (Eds), Information (ARIST),v33, pp259-319,Association of Information Retrieval: Data Structures and Algorithms, Prentice Science, 1999. Hall, 1992. Van Rijsbergen, C.J. Information Retrieval (second Frei H.P., and Stieger D., “Making use of hypertext edition). Butterworths, 1979. links when retrieving information”, Proceedings Van Rijsbergen, C.J., and Sparck Jones, K., “A test for ACM-ECHT’92, Milan, Italy, pp. 102-111, 1992. the separation of relevant and non-relevant Griffiths, A., Luckhurst, C., and Willett, P., “Using documents in experimental retrieval collections”, inderdocument similarity information in document Journal of Documentation, v29, pp251-7, 1973. retrieval systems”, Journal of American Society for Voorhees, E.M., Implementing Agglomerative Information Science, v37, p3-11, 1986 (reproduced Hierarchic Clustering Algorithms for Use in in Sparck Jones and Willett 1997). Document Retrieval, Technical report TR 86-765, Harmandas, V., Sanderson, M., and M. D. Dunlop, Department of Computer Science, Cornell M.D:, "Image retrieval by hypertext links", University, July 1986. Proceedings of SIGIR-97, 1997. Hertzum, M. and Pejtersen, A.M., “The information- seeking practices of engineers: Searching for documents as well as for people”. To appear in Information Processing and Management. Nichols, D.M., and Twidale, M.B., “Matchmaking and privacy in the digital library: striking the right balance”, In Proceedings of the 4th UK/International Conference on Electronic Library M.D. Dunlop 9-7