=Paper=
{{Paper
|id=Vol-34/paper-10
|storemode=property
|title=Development and Evaluation of Clustering Techniques for Finding People
|pdfUrl=https://ceur-ws.org/Vol-34/dunlop.pdf
|volume=Vol-34
}}
==Development and Evaluation of Clustering Techniques for Finding People==
Development and evaluation of clustering techniques
for finding people
M. D. Dunlop1
Centre for Human-Machine Interaction
Systems Analysis Department., Risø National Laboratory, P.O. Box 49
4000 Roskilde, Denmark
mailto:mark.dunlop@risoe.dk http://www.chmi.dk/people/mdd/
Abstract the work of the people. Two possible scenarios where
this form of matching would be helpful were used as
Typically in a large organisation much the main motivation behind this work: 1
expertise and knowledge is held informally
within employees’ own memories. When In any organisation a large amount of information
employees leave an organisation many concerning large projects is typically not documented
documented links that go through that person but held only in the staff’s memories. Such
are broken and no mechanism is usually undocumented information typically includes details of
available to overcome these broken links. This who was involved in a project in consultation roles,
matchmaking problem is related to the what the problems of the project were and how these
problem of finding potential work partners in a were solved (final project documentation often only
large and distributed organisation. This paper records the final result and not the process, which is
reports a comparative investigation into using arguably the most important information for reuse). In
standard information retrieval techniques to the engineering domain, Hertzum & Pejtersen [1999]
group employees together based on their web have identified that people typically interleave
pages. This information can, hopefully, be searching for people with searching for documents: “we
subsequently used to redirect broken links to find that engineers search for documents to find people,
people who worked closely with a departed search for people to get documents, and interact
employee or used to highlight people, say in socially to get information without engaging in explicit
different departments, who work on similar searches”. Furthermore, they identified that “design
topics. The paper reports the design and documentation seems to be biased toward technical
positive results of an experiment conducted at aspects of the chosen solution, while information about
Risø National Laboratory comparing four the context of the design process is typically not
different IR searching and clustering available. Hence, people become a critical source of
approaches using real users’ web pages. information because they can explain and argue about
why specific decisions were made and what purpose is
served by individual parts of the design”. A
1 Motivation problematic implication of this observation is that
considerable knowledge about the context of a project
This paper addresses the problem of automatically is lost when staff leave an organisation. One of the
matching people to other people in an organisation main aims of this work is to support finding of
based on already existing written documents describing colleagues retrospectively, based on automatically
keeping records of who work together. Thus, for
example, when a key document is written by S Jones
The copyright of this paper belongs to the paper’s authors. Per- who has since left the company, a record will be
mission to copy without fee all or part of this material is granted available of whose work was closest to Jones at the
provided that the copies are not made or distributed for direct time the document was written. This colleague should
commercial advantage.
Proc. of the Third Int. Conf. on Practical Aspects of
1 Now at: Computer Science Department, Univerity of Strathclyde,
Knowledge Management (PAKM2000)
Glasgow, G1 1XH, Scotland
Basel, Switzerland, 30-31 Oct. 2000, (U. Reimer, ed.) mailto: mark.dunlop@cs.strath.ac.uk
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-34 http://www.cs.strath.ac.uk/~mdd/
M.D. Dunlop 9-1
then, hopefully, be able to act as a contact point for developed over many years to support searching for
finding other people and documents concerning the documents [e.g. Van Rijsbergen 1979, Baeza-Yates &
project. Ribeiro-Neto 1999].
In large and distributed organisations, it is highly
possible that two people will have similar interests or 2.1 Searching
be working on similar projects without begin aware of
each other. As an example, many Universities have The first technique used here to match people is based
several locations in which related work may be carried on indexing staff web pages then performing searches
out in different departments and from different based on finding the most similar web pages for each
backgrounds. It is hoped that the models presented in user. For example, the content of user Ui’s web page is
this paper will help to bring together people who have used as the query for a search of all other staff home
similar research interests or work areas. This has been pages within an organisation, resulting in a ranked list,
identified as one of the major social implications of a L, of those users who are closest to user Ui (i.e. L1 is
move from physical libraries to digital ones: one no the closest person to Ui, L2 the next closest, etc.) as
longer fortuitously meets fellow researchers at the defined by the content of their home pages.
shelves. In a preface to their work on supporting
interaction with and awareness of others in digital The experiments were run using a baseline ranked IR
libraries, Robertson and Reese [1999] state that engine developed in house using standard IR
“libraries are hubs for social and intellectual techniques [e.g. Salton & McGill 1983, Frakes &
interactions in communities and organisations. Virtual Baeza-Yates 1992, Sparck Jones & Willett 1997] and
libraries should serve the same purpose, yet virtual implemented in Java 1.1. Documents were indexed
libraries often focus simply on making their holdings using:
available”. Twidale and Nichols [1998] give a short
• tf/idf weighting, which weights terms proportional
overview of, so called, matchmaking systems and link
to how often they occur in the current document
these with other work on computer supported co-
but inversely to how often they occur in the
operative work in information retrieval (expanded
collection as a whole [Sparck Jones 1972];
description of their work on matchmaking in digital
libraries can be found in [Twidale and Nichols 1997]). • a simple stop-word list based on the collection
This paper reports an investigation into the use of itself, the 30 most common words in the collection
information retrieval (IR) techniques to automatically were not indexed;
match people based on their web pages. The resulting • Porter’s stemming algorithm, an algorithmic
matches could be used to highlight people who are stemmer that conflates variants of a words into the
working closely together and this information could be
same base form, e.g. walking, walks etc all
used to highlight potential collaborations now, or used
conflate to walk [Porter 1980];
historically to point searchers to colleagues of staff who
have, say, left the company. The paper starts by • and the cosine matching function, an IR standard
describing the IR and clustering techniques used in the that takes into account term weights and document
experiments, then it describes the experimental lengths [Salton and McGill 1983].
framework, the results of the experiment and, finally,
presents a discussion of potential extensions to the The IR engine was designed to index web pages: it only
model. indexes content baring sections, omitting HTML tags,
and gives greater weight to words in the title of the
2 The techniques page.
With the advent of large search engines over the World 2.2 Clustering
Wide Web, searching techniques are increasingly used
to find people based on their name and, less frequently, Clustering techniques have long been used in IR to
based on similar research portfolios. The experiments improve the performance of search engines, both in
reported here target finding people by similar topic and terms of timing and quality or results [e.g. Jardine and
are based around testing four different solutions to the Van Rijsbergen 1971, Van Rijsbergen and Croft 1979
problem. These solutions can be classified into two and Griffiths, Luckhurst and Willett 1986]. This work
categories: simple searching (one approach) and three follows from the observation, known as the cluster
approaches to cluster based matching. All of the hypothesis, that relevant documents are more like one
approaches are based on indexing users’ home pages another than they are to non-relevant documents [Van
using standard IR techniques. IR techniques have been Rijsbergen & Sparck Jones 1973]. The work in this
M.D. Dunlop 9-2
paper investigated the use of clustering techniques to • documents are clustered in a ‘stable’ manner -
improve the performance of people matching. Three clusters of three documents are permitted when
clustering techniques were used: balanced clustering, both documents in a lower pair on the ranked list
single link clustering and group average clustering. All have stronger links with already paired documents;
these clustering algorithms are hierarchic agglomerative
algorithms, meaning a hierarchical structure of clusters • the resulting hierarchy is tight and fairly balanced
and sub-clusters is created by starting with small with a maximum outage at any node of 3 and a
clusters and adding documents and merging clusters normal minimum of 2 (occasionally single node
until a single cluster remains. The clustering techniques clusters are formed)
were used to produce a hierarchical clustering of the
users, H, that, hopefully, has similar users grouped 2.2.2 Single Link Clustering
together on the lower levels of the hierarchy. Single
link and group average were chosen for showing Single link clustering is based on creating a hierarchical
significantly different performance in comparative tree by continually inserting an additional node that
experiments for document retrieval [Griffiths, satisfies the following criteria:
Luckhurst and Willett 1986] with balanced clustering the new node is currently outside the hierarchy;
being added following their observation that small
clusters appear to be a strong factor in performance of of all similarities between nodes inside and outside
clustering algorithms. the hierarchy, the new node is selected that has the
strongest similarity. It is then added to the
2.2.1 Balanced Clustering hierarchy at a level based on how strong the
similarity is.
In the balanced clustering approach each user Ui is
grouped with another one or two users based on the This approach is fairly fast and results in hierarchies
similarity between the nodes (as defined by the same IR where the closest nearest neighbours are at lower levels
engine and indexing approaches used in the baseline of the hierarchy. However, it leads to non-balanced
searching method). These pairs or triples are then clusters and does not yield a binary hierarchy – many
grouped together with the most similar other group node-node comparisons can have the same strength of
based on the average vector for the groupings (this is similarity thus many documents can be linked at the
essentially a balanced variation of group average same level in the hierarchy.
clustering discussed later). Again these grouping are
further grouped into pairs or triples until the second top The implementation was based on the pseudo code
level where a pair is forced. The process is more clearly shown below. The pseudo code based closely on that
described by the following pseudo code: from [Voorhees 1996], where the reader is directed for
a more complete description.
repeat
take current set of documents or most recent set of clusters // initialise hierarchy and insert document one into it
calculate all descriptor-descriptor comparisons for (i=2 to collectionSize)
insert descriptor-descriptor pairs into a list sorted by weight info[i].sim = 0;
for each pair working down list info[i].inHierarchy = false;
if neither element has been assigned then info[i].nn = UNDEF;
record this pair as a new cluster currentID = 1;
else if both elements are assigned
& both would prefer a higher cluster // place the document having maximum similarity with
& those higher clusters are currently pairs // a document in the hierarchy into the hierarchy until
& the higher clusters are different // all documents are in the hierarchy
add both as a 3rd to appropriate higher clusters while (currentID ≠ UNDEF)
else info[currentID].inHierarchy = TRUE;
ignore this pairing // they can’t be assigned just now ComputeSims(currentID);
until only one cluster remains maxSim = 0; nextID = UNDEF;
// update nearest neighbour for docs outside hierarchy
Although a relatively slow algorithm, this approach for (i=1 to collectionSize)
if (not info[i].inHierarchy)
gives the following characteristics: if (sims[i] > info[i].sim)
info[i].sim = sims[i]; info[i].nn = currentID;
• the closest two documents are grouped together if (info[i].sim > maxSim)
first, then the next available pair, and so on so that maxSim = info[i].sim; nextID = i;
the strongest matches are honoured; if (nextID ≠ UNDEF)
currentID = nextID;
M.D. Dunlop 9-3
2.2.3 Group Average Clustering counting how many intermediate internal nodes there
are in the hierarchy on the path through the hierarchy
Group average link clustering is based on creating a from one leaf node to the other, see figure 1 for an
hierarchical tree by initially creating a singleton cluster example). The list L was then based on these distances,
for each document and marking these as “active”. The smallest highest in ranking.
clustering then repeats the following until only one
cluster remains active:
merge the two clusters with most similar cluster
representatives. Where the cluster representative is
the mean vector of all document vectors in the
cluster (with singleton clusters being self
representing);
make the new pairing active and the two clusters
3 3 1 Ui 4 4
which formed the pair non-active.
Again, pseudo code and implementation were based on Figure 1: Distances from user Ui
that extracted from [Voorhees 1996]:
The following four subsections describe in more detail
//initialise the four approaches taken: searching, balanced
maxSim = 0; clustering, single link clustering and group average
for (i = 1 to collectionSize)
clustering.
//create singleton clusters
info[i].representative = document[i].representative
computeSim(i, nn, sim) 3 Experimental setting
info[i].nn = nn, info[i].sim = sim; info[i].size = 1;
if (sim > maxSim) To evaluate the performance of the different people
id1 = i; id2 = nn; maxSim = sim; finding algorithms, an experiment was conducted based
numActive = collectionSize; on the web pages for the Systems Analysis Department
for (i = 1 to numActive) active [i] = i; at Risø National Laboratory. The Risø S.A.D. web
contains home pages for 60 staff within the department.
//merge clusters until only 1 left or remaining sims are zero
while (maxSim > 0 & numActive >1)
To complete the test collection, each member of the
smaller = min(id1,id2); larger = max(id1,id2); department was given a form in which they were asked
info[smaller].centroid = mergeCentroids(smaller,larger); to assess, on a four point scale, how closely they
info[smaller].size = info[smaller].size + info[larger].size; worked with each other person in the department. They
a = index of larger in active; were given the instructions to tick:
active[a] = active[numActive]; numActive--;
mergeClusters(smaller, larger, maxSim) “3 boxes for those you work very closely with;
maxSim = 0;
for (each cluster a in active) 2 boxes for those you work with;
if (info[a].nn = larger | info[a].nn = smaller)
findMaxSim(a, nn, sim); 1 box for those whose work is related mildly;
info[a].nn = nn; info[a].sim = sim;
if (info[a].sim>maxSim) 0 boxes if your work is unconnected (or you don't
id1 = a; id2 = info[a].nn; maxSim = info[a].sim; know who the person is!).”
Group average clustering is slower than single-link To prevent biasing the staff towards defining these
clustering, but is known to produce better clustering for terms closer to definitions that would match the
document retrieval, guarantees to produce binary trees clustering algorithms, no explanation of the terms in the
and keep closely related documents together in the instructions were given (e.g. the terms “work with”,
initial pairs. Group average is essentially a non- “closely” and “unconnected” were left undefined).
balanced, purely binary, version of balanced clustering. A total of 27 forms were returned with a mean of 11.8
people marked per form (minimum 1, maximum 33,
2.2.4 Evaluation mean 11.76, standard deviation 6.74) and 21.6 ticks per
For consistent comparison with simple retrieval, a list form (min 2, max 53, mean 21.60, standard deviation
of matching documents was required for each user. For 13.80).
each user Ui, each node Uj in the hierarchy H was Following standard IR practice, precision and recall
scored based on how far Uj was from Ui (based on figures were calculated. However, these were based on
M.D. Dunlop 9-4
relevance weight rather than the more common
approach of simply counting how many relevant 1
Group Average
Balanced
documents were found (following definition in [Reid 0.9
Single Link
2000] for non-binary test collections). For these 0.8 Searching
experiments, relevance weight is defined as how many 0.7
ticks were marked divided by the maximum number of
0.6
ticks (e.g. 0.333 for 1 tick and 1 for 3 ticks). This
0.5
allows the results to highlight how well the system is at
0.4
finding those people who users work closely with over
0.3
those they simply work with. For a ranked list L with
0.2
the best match at position 1, second at position 2 etc.,
0.1
precision and recall at position p were defined as
follows: 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Relevance weight in L1...L p for user Ui Recall
recalli,p =
Total relevance weight for user Ui
Figure 1: Results from experiments with full lists
Relevance weight in L1...L p for user Ui
precisioni,p =
p Figure 2 shows the same results plotted by ranked
position – showing how many ticks were accumulated,
For consistency of evaluation, each algorithm produced on average, by each rank position up to the tenth rank
a full ranking of all users (i.e. L contains an entry for position. This shows that for the best approach, group
every user in the collection bar Ui). Individual recall average clustering, an average of 1.72 ticks were found
precision graphs were then combined using standard at rank position one (57% perfect) whereas the worst
macro-evaluation as defined in Van Rijsbergen [1979 approach, straight searching, was only achieving 0.88
pp 152-153]. ticks (29% accuracy). For group average clustering, the
success rate rose to an average of 3.08 ticks by rank
4 Results position 2 – which could be considered as “one close
colleague equivalent” within the first two suggestions
Figure 1 shows the results for the four algorithms. It of the system.
clearly shows that for this collection clustering-based
approaches are more effective at matching people than
simple searching and that overall, for basic IR 8
techniques, the performance of the system is good. Of 7
the clustering algorithms, group average performs best
6
for low recall (0..0.27 approximately). This is the
region in which people matching programmes are most 5
likely to have their main impact – usually looking for
4
one or two substitute names rather than, say, 70% of
colleagues. However, balanced retrieval is only slightly 3
poorer and, probably because of the more balanced 2 group average clustering
hierarchy leading to more normalised distances within balanced clustering
1
the tree, is better than Group Average Clustering as the single link clustering
straight searching
recall levels are increased. 0
1 2 3 4 5 6 7 8 9 10
Rank position
Figure 2: Mean number of ticks by rank position
To examine how resilient each approach was to storing
limited information on people who each user works
closely with, the lists L created by each of the four
matching techniques were limited to nine elements each
and the evaluation repeated (this simulates, say, a
monthly recording of the nine closes people for each
member of staff so involves storing 9u records as
M.D. Dunlop 9-5
opposed to u2 records, where u is the number of users). such as documents written by the staff (possibly
Figure 3 shows that this results in a considerable drop including weekly diaries, which are common in many
in performance for all methods at higher recall levels, engineering settings) and pages for
but relatively little drop in the 0...0.2 precision range. In organisations/activities the user is involved in.
particular the performance of group average clustering
is almost unaffected in this range by storing limited One problem highlighted in the experiment reported
information. Considering that the 0.2 recall point here is that of assessing the different interpretations of
implies finding 20% of the colleagues and, in many “works with”. For example, staff completing the data
settings, we are only likely to be looking for one/two - capture form were varied in how they reacted to
this is a promising result for reduced storage. secretaries being on the list of staff as well as other
research colleagues and department managers. It would
be worth investigation classification methods so that
1 searches can be restricted, or at least rank positions
0.9 affected by, matching users who are at roughly the
0.8
same level in an organisation. This is likely to take
0.7
place somewhat automatically as those users home
0.6
pages are likely to contain more in common than users
Precision
0.5
who are at drastically different levels in the
0.4
organisation but these claims need further investigation.
0.3
As the motivational scenarios for this work did not
0.2
require fast and frequent clustering of staff, only high
0.1 quality clustering algorithms were considered. It may
0 be worth investigating the use of faster and lighter
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
methods, such as scatter/gather, to compare their
Recall
performance with the tested algorithms.
Searching Balanced Single Link Group Average
6 Conclusions
Figure 3: Results from experiments with lists limited to
The results of this experiment show that IR techniques
10 entries
can be used to match users home pages with those of
other users to find colleagues who work in similar areas
5 Extensions with a fair level of success. In the case of the
experiment reported here, precision approximately 0.6
As documents rarely exist in isolation, approaches to can be achieved for low levels of recall where the
document retrieval that make use of the hypertext approach is most likely to be used. At rank position 1
network in which documents are found [e.g. Dunlop an average of 1.72 ticks were found, where 2 ticks
1991, Dunlop and Van Rijsbergen 1993, Frei, and represents a staff declaration of someone that they work
Stieger 1992] could be used to augment home pages with (but not closely) – this rises to a total of 3.01 ticks
with connected documents (such as linked pages and by rank position two, equivalent to finding one close
subpages). These hypertext-IR techniques have been colleague in the first two suggestions from the system.
shown successful in improving retrieval for standard Furthermore, use of limited length lists shows that
searching [Savoy 1996] and for accessing documents storing only the nine closest predictions has little effect
that cannot be indexed directly [Harmandas et al 1997]. on the performance of the system at low recall while
As well as providing the usual benefits of hypertext-IR drastically reducing storage requirements for historical
indexing, if used for indexing staff home pages, the recording. The experiment compared straight IR
approaches will also solve many of the problems of searching to match users with potential colleagues with
these pages. The Risø home pages used in this three different clustering approaches (balanced, single
experiment were fairly consistent corporate style mini- link and group average), all clustering approaches
CVs. Less consistent pages, such as those typically performed better with group average being the best
found in universities, have many problems such as overall (and noticeably more stable at low recall when
frame based front pages (which don’t actually have any using reduced length lists). This indicates that it is
content on the base home page URL), very simple front better to connect people based on the best overall
pages with linked pages and drastically different styles arrangement (clustering approaches create a single best
and quality and quantity of information provided. The cluster hierarchy for the whole department then rank for
use of hypertext-IR techniques could also bring in individual users) rather than the best arrangement for
many other sources of evidence as to users’ activities,
M.D. Dunlop 9-6
each individual person as performed in straight and Visual Information Research (ELVIRA 4),
searching. Milton Keynes, Aslib: London, UK, pp 31-38, May
1997.
Further work is planned to investigate the results here
in a corporate setting using different sources of Porter, M. F., “An algorithm for suffix stripping”,
documentation, over a longer timescale and using more Program, v 14(3), pp130-137, July 1980
users as the test base. Hypertext-IR approaches will (reproduced in Sparck Jones and Willett 1997)
also be investigated to see if they improve the Reid, J., “A task oriented non-interactive evaluation
effectiveness of the clustering approaches and make methodology for information retrieval systems”,
them more amenable to highly variable styles of home Information Retrieval, v 2(1), 2000 to appear.
pages visible in some organisations.
Robertson, S., and Reese, K., “A virtual library for
building community and sharing knowledge”,
Acknowledgements International Journal of Human-Computer Studies,
Many thanks are due to the staff of Risø's Systems vol 51, pp 663-685, 1999.
Analysis Department who completed questionnaires as Salton, G., and McGill, M.J., Introduction to modern
part of the evaluation work. information retrieval, McGraw-Hill, 1983.
The project was supported by the Danish Research Savoy, J., "An Extended Vector-Processing Scheme for
Foundation's Centre for Human Machine Interaction. Searching Information in Hypertext Systems",
Information Processing and Management, v32(2),
155-170, March 1996.
References
Sparck Jones, K., “A statistical interpretation of term
Baeza-Yates, R., & Ribeiro-Neto, B., Modern specificity and its application in retrieval”, Journal
Information Retrieval, ACM Press 1999. of Documentation, v28, pp 11-21, 1972.
Dunlop, MD, Multimedia Information Retrieval, PhD Sparck Jones, K., and Willett, P. (Eds), Readings in
Thesis, Glasgow University Computing Science Information Retrieval, Morgan Kaufmann, 1997.
Research Report 1991/ R21, October 1991.
Twidale, M.B., and Nichols, D.M., “Computer
Dunlop, M.D., and Van Rijsbergen, C.J., "Hypermedia supported cooperative work in information search
and free text retrieval", Information Processing and and retrieval”, in Williams, E.M. (Ed), Annual
Management, vol 29(3), May 1993. Review of Information Science and Technology
Frakes, W.B., and Baeza-Yates, R. (Eds), Information (ARIST),v33, pp259-319,Association of Information
Retrieval: Data Structures and Algorithms, Prentice Science, 1999.
Hall, 1992. Van Rijsbergen, C.J. Information Retrieval (second
Frei H.P., and Stieger D., “Making use of hypertext edition). Butterworths, 1979.
links when retrieving information”, Proceedings Van Rijsbergen, C.J., and Sparck Jones, K., “A test for
ACM-ECHT’92, Milan, Italy, pp. 102-111, 1992. the separation of relevant and non-relevant
Griffiths, A., Luckhurst, C., and Willett, P., “Using documents in experimental retrieval collections”,
inderdocument similarity information in document Journal of Documentation, v29, pp251-7, 1973.
retrieval systems”, Journal of American Society for Voorhees, E.M., Implementing Agglomerative
Information Science, v37, p3-11, 1986 (reproduced Hierarchic Clustering Algorithms for Use in
in Sparck Jones and Willett 1997). Document Retrieval, Technical report TR 86-765,
Harmandas, V., Sanderson, M., and M. D. Dunlop, Department of Computer Science, Cornell
M.D:, "Image retrieval by hypertext links", University, July 1986.
Proceedings of SIGIR-97, 1997.
Hertzum, M. and Pejtersen, A.M., “The information-
seeking practices of engineers: Searching for
documents as well as for people”. To appear in
Information Processing and Management.
Nichols, D.M., and Twidale, M.B., “Matchmaking and
privacy in the digital library: striking the right
balance”, In Proceedings of the 4th
UK/International Conference on Electronic Library
M.D. Dunlop 9-7