Development and evaluation of clustering techniques
                                for finding people

                                              M. D. Dunlop1
                                  Centre for Human-Machine Interaction
                    Systems Analysis Department., Risø National Laboratory, P.O. Box 49
                                         4000 Roskilde, Denmark
                     mailto:mark.dunlop@risoe.dk     http://www.chmi.dk/people/mdd/


                             Abstract                                 the work of the people. Two possible scenarios where
                                                                       this form of matching would be helpful were used as
      Typically in a large organisation much                           the main motivation behind this work:               1


      expertise and knowledge is held informally
      within employees’ own memories. When                             In any organisation a large amount of information
      employees leave an organisation many                             concerning large projects is typically not documented
      documented links that go through that person                     but held only in the staff’s memories. Such
      are broken and no mechanism is usually                           undocumented information typically includes details of
      available to overcome these broken links. This                   who was involved in a project in consultation roles,
      matchmaking problem is related to the                            what the problems of the project were and how these
      problem of finding potential work partners in a                  were solved (final project documentation often only
      large and distributed organisation. This paper                   records the final result and not the process, which is
      reports a comparative investigation into using                   arguably the most important information for reuse). In
      standard information retrieval techniques to                     the engineering domain, Hertzum & Pejtersen [1999]
      group employees together based on their web                      have identified that people typically interleave
      pages. This information can, hopefully, be                       searching for people with searching for documents: “we
      subsequently used to redirect broken links to                    find that engineers search for documents to find people,
      people who worked closely with a departed                        search for people to get documents, and interact
      employee or used to highlight people, say in                     socially to get information without engaging in explicit
      different departments, who work on similar                       searches”. Furthermore, they identified that “design
      topics. The paper reports the design and                         documentation seems to be biased toward technical
      positive results of an experiment conducted at                   aspects of the chosen solution, while information about
      Risø National Laboratory comparing four                          the context of the design process is typically not
      different IR searching and clustering                            available. Hence, people become a critical source of
      approaches using real users’ web pages.                          information because they can explain and argue about
                                                                       why specific decisions were made and what purpose is
                                                                       served by individual parts of the design”. A
1 Motivation                                                           problematic implication of this observation is that
                                                                       considerable knowledge about the context of a project
This paper addresses the problem of automatically                      is lost when staff leave an organisation. One of the
matching people to other people in an organisation                     main aims of this work is to support finding of
based on already existing written documents describing                 colleagues retrospectively, based on automatically
                                                                       keeping records of who work together. Thus, for
                                                                       example, when a key document is written by S Jones
The copyright of this paper belongs to the paper’s authors. Per-       who has since left the company, a record will be
mission to copy without fee all or part of this material is granted    available of whose work was closest to Jones at the
provided that the copies are not made or distributed for direct        time the document was written. This colleague should
commercial advantage.

Proc. of the Third Int. Conf. on Practical Aspects of
                                                                       1 Now at: Computer Science Department, Univerity of Strathclyde,
Knowledge Management (PAKM2000)
                                                                                 Glasgow, G1 1XH, Scotland
Basel, Switzerland, 30-31 Oct. 2000, (U. Reimer, ed.)                            mailto: mark.dunlop@cs.strath.ac.uk
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-34             http://www.cs.strath.ac.uk/~mdd/


M.D. Dunlop                                                                                                                          9-1
then, hopefully, be able to act as a contact point for       developed over many years to support searching for
finding other people and documents concerning the            documents [e.g. Van Rijsbergen 1979, Baeza-Yates &
project.                                                     Ribeiro-Neto 1999].
In large and distributed organisations, it is highly
possible that two people will have similar interests or      2.1 Searching
be working on similar projects without begin aware of
each other. As an example, many Universities have            The first technique used here to match people is based
several locations in which related work may be carried       on indexing staff web pages then performing searches
out in different departments and from different              based on finding the most similar web pages for each
backgrounds. It is hoped that the models presented in        user. For example, the content of user Ui’s web page is
this paper will help to bring together people who have       used as the query for a search of all other staff home
similar research interests or work areas. This has been      pages within an organisation, resulting in a ranked list,
identified as one of the major social implications of a      L, of those users who are closest to user Ui (i.e. L1 is
move from physical libraries to digital ones: one no         the closest person to Ui, L2 the next closest, etc.) as
longer fortuitously meets fellow researchers at the          defined by the content of their home pages.
shelves. In a preface to their work on supporting
interaction with and awareness of others in digital          The experiments were run using a baseline ranked IR
libraries, Robertson and Reese [1999] state that             engine developed in house using standard IR
“libraries are hubs for social and intellectual              techniques [e.g. Salton & McGill 1983, Frakes &
interactions in communities and organisations. Virtual       Baeza-Yates 1992, Sparck Jones & Willett 1997] and
libraries should serve the same purpose, yet virtual         implemented in Java 1.1. Documents were indexed
libraries often focus simply on making their holdings        using:
available”. Twidale and Nichols [1998] give a short
                                                             •    tf/idf weighting, which weights terms proportional
overview of, so called, matchmaking systems and link
                                                                  to how often they occur in the current document
these with other work on computer supported co-
                                                                  but inversely to how often they occur in the
operative work in information retrieval (expanded
                                                                  collection as a whole [Sparck Jones 1972];
description of their work on matchmaking in digital
libraries can be found in [Twidale and Nichols 1997]).       •    a simple stop-word list based on the collection
This paper reports an investigation into the use of               itself, the 30 most common words in the collection
information retrieval (IR) techniques to automatically            were not indexed;
match people based on their web pages. The resulting         •    Porter’s stemming algorithm, an algorithmic
matches could be used to highlight people who are                 stemmer that conflates variants of a words into the
working closely together and this information could be
                                                                  same base form, e.g. walking, walks etc all
used to highlight potential collaborations now, or used
                                                                  conflate to walk [Porter 1980];
historically to point searchers to colleagues of staff who
have, say, left the company. The paper starts by             •    and the cosine matching function, an IR standard
describing the IR and clustering techniques used in the           that takes into account term weights and document
experiments, then it describes the experimental                   lengths [Salton and McGill 1983].
framework, the results of the experiment and, finally,
presents a discussion of potential extensions to the         The IR engine was designed to index web pages: it only
model.                                                       indexes content baring sections, omitting HTML tags,
                                                             and gives greater weight to words in the title of the
2 The techniques                                             page.

With the advent of large search engines over the World       2.2 Clustering
Wide Web, searching techniques are increasingly used
to find people based on their name and, less frequently,     Clustering techniques have long been used in IR to
based on similar research portfolios. The experiments        improve the performance of search engines, both in
reported here target finding people by similar topic and     terms of timing and quality or results [e.g. Jardine and
are based around testing four different solutions to the     Van Rijsbergen 1971, Van Rijsbergen and Croft 1979
problem. These solutions can be classified into two          and Griffiths, Luckhurst and Willett 1986]. This work
categories: simple searching (one approach) and three        follows from the observation, known as the cluster
approaches to cluster based matching. All of the             hypothesis, that relevant documents are more like one
approaches are based on indexing users’ home pages           another than they are to non-relevant documents [Van
using standard IR techniques. IR techniques have been        Rijsbergen & Sparck Jones 1973]. The work in this


M.D. Dunlop                                                                                                       9-2
paper investigated the use of clustering techniques to                  •      documents are clustered in a ‘stable’ manner -
improve the performance of people matching. Three                              clusters of three documents are permitted when
clustering techniques were used: balanced clustering,                          both documents in a lower pair on the ranked list
single link clustering and group average clustering. All                       have stronger links with already paired documents;
these clustering algorithms are hierarchic agglomerative
algorithms, meaning a hierarchical structure of clusters                •      the resulting hierarchy is tight and fairly balanced
and sub-clusters is created by starting with small                             with a maximum outage at any node of 3 and a
clusters and adding documents and merging clusters                             normal minimum of 2 (occasionally single node
until a single cluster remains. The clustering techniques                      clusters are formed)
were used to produce a hierarchical clustering of the
users, H, that, hopefully, has similar users grouped                    2.2.2       Single Link Clustering
together on the lower levels of the hierarchy. Single
link and group average were chosen for showing                          Single link clustering is based on creating a hierarchical
significantly different performance in comparative                      tree by continually inserting an additional node that
experiments for document retrieval [Griffiths,                          satisfies the following criteria:
Luckhurst and Willett 1986] with balanced clustering                         the new node is currently outside the hierarchy;
being added following their observation that small
clusters appear to be a strong factor in performance of                      of all similarities between nodes inside and outside
clustering algorithms.                                                         the hierarchy, the new node is selected that has the
                                                                               strongest similarity. It is then added to the
2.2.1       Balanced Clustering                                                hierarchy at a level based on how strong the
                                                                               similarity is.
In the balanced clustering approach each user Ui is
grouped with another one or two users based on the                      This approach is fairly fast and results in hierarchies
similarity between the nodes (as defined by the same IR                 where the closest nearest neighbours are at lower levels
engine and indexing approaches used in the baseline                     of the hierarchy. However, it leads to non-balanced
searching method). These pairs or triples are then                      clusters and does not yield a binary hierarchy – many
grouped together with the most similar other group                      node-node comparisons can have the same strength of
based on the average vector for the groupings (this is                  similarity thus many documents can be linked at the
essentially a balanced variation of group average                       same level in the hierarchy.
clustering discussed later). Again these grouping are
further grouped into pairs or triples until the second top              The implementation was based on the pseudo code
level where a pair is forced. The process is more clearly               shown below. The pseudo code based closely on that
described by the following pseudo code:                                 from [Voorhees 1996], where the reader is directed for
                                                                        a more complete description.
    repeat
      take current set of documents or most recent set of clusters          // initialise hierarchy and insert document one into it
      calculate all descriptor-descriptor comparisons                       for (i=2 to collectionSize)
      insert descriptor-descriptor pairs into a list sorted by weight          info[i].sim = 0;
      for each pair working down list                                          info[i].inHierarchy = false;
         if neither element has been assigned then                             info[i].nn = UNDEF;
            record this pair as a new cluster                               currentID = 1;
         else if both elements are assigned
             & both would prefer a higher cluster                           // place the document having maximum similarity with
             & those higher clusters are currently pairs                    // a document in the hierarchy into the hierarchy until
             & the higher clusters are different                            // all documents are in the hierarchy
            add both as a 3rd to appropriate higher clusters                while (currentID ≠ UNDEF)
         else                                                                  info[currentID].inHierarchy = TRUE;
            ignore this pairing // they can’t be assigned just now             ComputeSims(currentID);
    until only one cluster remains                                             maxSim = 0; nextID = UNDEF;
                                                                               // update nearest neighbour for docs outside hierarchy
Although a relatively slow algorithm, this approach                            for (i=1 to collectionSize)
                                                                                  if (not info[i].inHierarchy)
gives the following characteristics:                                                 if (sims[i] > info[i].sim)
                                                                                        info[i].sim = sims[i]; info[i].nn = currentID;
•      the closest two documents are grouped together                                if (info[i].sim > maxSim)
       first, then the next available pair, and so on so that                           maxSim = info[i].sim; nextID = i;
       the strongest matches are honoured;                                     if (nextID ≠ UNDEF)
                                                                                  currentID = nextID;


M.D. Dunlop                                                                                                                              9-3
2.2.3    Group Average Clustering                                 counting how many intermediate internal nodes there
                                                                  are in the hierarchy on the path through the hierarchy
Group average link clustering is based on creating a              from one leaf node to the other, see figure 1 for an
hierarchical tree by initially creating a singleton cluster       example). The list L was then based on these distances,
for each document and marking these as “active”. The              smallest highest in ranking.
clustering then repeats the following until only one
cluster remains active:
   merge the two clusters with most similar cluster
     representatives. Where the cluster representative is
     the mean vector of all document vectors in the
     cluster (with singleton clusters being self
     representing);
   make the new pairing active and the two clusters
                                                                          3       3       1      Ui       4      4
     which formed the pair non-active.
Again, pseudo code and implementation were based on                           Figure 1: Distances from user Ui
that extracted from [Voorhees 1996]:
                                                                  The following four subsections describe in more detail
 //initialise                                                     the four approaches taken: searching, balanced
 maxSim = 0;                                                      clustering, single link clustering and group average
 for (i = 1 to collectionSize)
                                                                  clustering.
    //create singleton clusters
    info[i].representative = document[i].representative
    computeSim(i, nn, sim)                                        3 Experimental setting
    info[i].nn = nn, info[i].sim = sim; info[i].size = 1;
    if (sim > maxSim)                                             To evaluate the performance of the different people
       id1 = i; id2 = nn; maxSim = sim;                           finding algorithms, an experiment was conducted based
 numActive = collectionSize;                                      on the web pages for the Systems Analysis Department
 for (i = 1 to numActive) active [i] = i;                         at Risø National Laboratory. The Risø S.A.D. web
                                                                  contains home pages for 60 staff within the department.
 //merge clusters until only 1 left or remaining sims are zero
 while (maxSim > 0 & numActive >1)
                                                                  To complete the test collection, each member of the
   smaller = min(id1,id2); larger = max(id1,id2);                 department was given a form in which they were asked
   info[smaller].centroid = mergeCentroids(smaller,larger);       to assess, on a four point scale, how closely they
   info[smaller].size = info[smaller].size + info[larger].size;   worked with each other person in the department. They
   a = index of larger in active;                                 were given the instructions to tick:
   active[a] = active[numActive]; numActive--;
   mergeClusters(smaller, larger, maxSim)                            “3 boxes for those you work very closely with;
   maxSim = 0;
   for (each cluster a in active)                                    2 boxes for those you work with;
     if (info[a].nn = larger | info[a].nn = smaller)
        findMaxSim(a, nn, sim);                                      1 box for those whose work is related mildly;
        info[a].nn = nn; info[a].sim = sim;
     if (info[a].sim>maxSim)                                         0 boxes if your work is unconnected (or you don't
        id1 = a; id2 = info[a].nn; maxSim = info[a].sim;               know who the person is!).”

Group average clustering is slower than single-link               To prevent biasing the staff towards defining these
clustering, but is known to produce better clustering for         terms closer to definitions that would match the
document retrieval, guarantees to produce binary trees            clustering algorithms, no explanation of the terms in the
and keep closely related documents together in the                instructions were given (e.g. the terms “work with”,
initial pairs. Group average is essentially a non-                “closely” and “unconnected” were left undefined).
balanced, purely binary, version of balanced clustering.          A total of 27 forms were returned with a mean of 11.8
                                                                  people marked per form (minimum 1, maximum 33,
2.2.4    Evaluation                                               mean 11.76, standard deviation 6.74) and 21.6 ticks per
For consistent comparison with simple retrieval, a list           form (min 2, max 53, mean 21.60, standard deviation
of matching documents was required for each user. For             13.80).
each user Ui, each node Uj in the hierarchy H was                 Following standard IR practice, precision and recall
scored based on how far Uj was from Ui (based on                  figures were calculated. However, these were based on


M.D. Dunlop                                                                                                             9-4
relevance weight rather than the more common
approach of simply counting how many relevant                           1
                                                                                                                                    Group Average
                                                                                                                                    Balanced
documents were found (following definition in [Reid                    0.9
                                                                                                                                    Single Link
2000] for non-binary test collections). For these                      0.8                                                          Searching
experiments, relevance weight is defined as how many                   0.7
ticks were marked divided by the maximum number of
                                                                       0.6
ticks (e.g. 0.333 for 1 tick and 1 for 3 ticks). This
                                                                       0.5
allows the results to highlight how well the system is at
                                                                       0.4
finding those people who users work closely with over
                                                                       0.3
those they simply work with. For a ranked list L with
                                                                       0.2
the best match at position 1, second at position 2 etc.,
                                                                       0.1
precision and recall at position p were defined as
follows:                                                                0
                                                                             0   0.1   0.2   0.3   0.4    0.5      0.6    0.7       0.8       0.9        1

                  Relevance weight in L1...L p for user Ui                                               Recall
    recalli,p =
                    Total relevance weight for user Ui
                                                                  Figure 1: Results from experiments with full lists
                    Relevance weight in L1...L p for user Ui
  precisioni,p =
                                       p                       Figure 2 shows the same results plotted by ranked
                                                               position – showing how many ticks were accumulated,
For consistency of evaluation, each algorithm produced         on average, by each rank position up to the tenth rank
a full ranking of all users (i.e. L contains an entry for      position. This shows that for the best approach, group
every user in the collection bar Ui). Individual recall        average clustering, an average of 1.72 ticks were found
precision graphs were then combined using standard             at rank position one (57% perfect) whereas the worst
macro-evaluation as defined in Van Rijsbergen [1979            approach, straight searching, was only achieving 0.88
pp 152-153].                                                   ticks (29% accuracy). For group average clustering, the
                                                               success rate rose to an average of 3.08 ticks by rank
4 Results                                                      position 2 – which could be considered as “one close
                                                               colleague equivalent” within the first two suggestions
Figure 1 shows the results for the four algorithms. It         of the system.
clearly shows that for this collection clustering-based
approaches are more effective at matching people than
simple searching and that overall, for basic IR                    8
techniques, the performance of the system is good. Of              7
the clustering algorithms, group average performs best
                                                                   6
for low recall (0..0.27 approximately). This is the
region in which people matching programmes are most                5
likely to have their main impact – usually looking for
                                                                   4
one or two substitute names rather than, say, 70% of
colleagues. However, balanced retrieval is only slightly           3

poorer and, probably because of the more balanced                  2                                                     group average clustering
hierarchy leading to more normalised distances within                                                                    balanced clustering
                                                                   1
the tree, is better than Group Average Clustering as the                                                                 single link clustering
                                                                                                                         straight searching
recall levels are increased.                                       0
                                                                             1   2      3     4      5      6        7          8         9         10
                                                                                                   Rank position


                                                                  Figure 2: Mean number of ticks by rank position

                                                               To examine how resilient each approach was to storing
                                                               limited information on people who each user works
                                                               closely with, the lists L created by each of the four
                                                               matching techniques were limited to nine elements each
                                                               and the evaluation repeated (this simulates, say, a
                                                               monthly recording of the nine closes people for each
                                                               member of staff so involves storing 9u records as


M.D. Dunlop                                                                                                                                              9-5
opposed to u2 records, where u is the number of users).                                       such as documents written by the staff (possibly
Figure 3 shows that this results in a considerable drop                                       including weekly diaries, which are common in many
in performance for all methods at higher recall levels,                                       engineering       settings)       and       pages for
but relatively little drop in the 0...0.2 precision range. In                                 organisations/activities the user is involved in.
particular the performance of group average clustering
is almost unaffected in this range by storing limited                                         One problem highlighted in the experiment reported
information. Considering that the 0.2 recall point                                            here is that of assessing the different interpretations of
implies finding 20% of the colleagues and, in many                                            “works with”. For example, staff completing the data
settings, we are only likely to be looking for one/two -                                      capture form were varied in how they reacted to
this is a promising result for reduced storage.                                               secretaries being on the list of staff as well as other
                                                                                              research colleagues and department managers. It would
                                                                                              be worth investigation classification methods so that
                1                                                                             searches can be restricted, or at least rank positions
               0.9                                                                            affected by, matching users who are at roughly the
               0.8
                                                                                              same level in an organisation. This is likely to take
               0.7
                                                                                              place somewhat automatically as those users home
               0.6
                                                                                              pages are likely to contain more in common than users
   Precision


               0.5
                                                                                              who are at drastically different levels in the
               0.4
                                                                                              organisation but these claims need further investigation.
               0.3
                                                                                              As the motivational scenarios for this work did not
               0.2
                                                                                              require fast and frequent clustering of staff, only high
               0.1                                                                            quality clustering algorithms were considered. It may
                0                                                                             be worth investigating the use of faster and lighter
                     0    0.1    0.2    0.3   0.4    0.5     0.6    0.7    0.8   0.9      1
                                                                                              methods, such as scatter/gather, to compare their
                                                    Recall
                                                                                              performance with the tested algorithms.
                     Searching         Balanced       Single Link         Group Average

                                                                                              6 Conclusions
Figure 3: Results from experiments with lists limited to
                                                                                              The results of this experiment show that IR techniques
                      10 entries
                                                                                              can be used to match users home pages with those of
                                                                                              other users to find colleagues who work in similar areas
5 Extensions                                                                                  with a fair level of success. In the case of the
                                                                                              experiment reported here, precision approximately 0.6
As documents rarely exist in isolation, approaches to                                         can be achieved for low levels of recall where the
document retrieval that make use of the hypertext                                             approach is most likely to be used. At rank position 1
network in which documents are found [e.g. Dunlop                                             an average of 1.72 ticks were found, where 2 ticks
1991, Dunlop and Van Rijsbergen 1993, Frei, and                                               represents a staff declaration of someone that they work
Stieger 1992] could be used to augment home pages                                             with (but not closely) – this rises to a total of 3.01 ticks
with connected documents (such as linked pages and                                            by rank position two, equivalent to finding one close
subpages). These hypertext-IR techniques have been                                            colleague in the first two suggestions from the system.
shown successful in improving retrieval for standard                                          Furthermore, use of limited length lists shows that
searching [Savoy 1996] and for accessing documents                                            storing only the nine closest predictions has little effect
that cannot be indexed directly [Harmandas et al 1997].                                       on the performance of the system at low recall while
As well as providing the usual benefits of hypertext-IR                                       drastically reducing storage requirements for historical
indexing, if used for indexing staff home pages, the                                          recording. The experiment compared straight IR
approaches will also solve many of the problems of                                            searching to match users with potential colleagues with
these pages. The Risø home pages used in this                                                 three different clustering approaches (balanced, single
experiment were fairly consistent corporate style mini-                                       link and group average), all clustering approaches
CVs. Less consistent pages, such as those typically                                           performed better with group average being the best
found in universities, have many problems such as                                             overall (and noticeably more stable at low recall when
frame based front pages (which don’t actually have any                                        using reduced length lists). This indicates that it is
content on the base home page URL), very simple front                                         better to connect people based on the best overall
pages with linked pages and drastically different styles                                      arrangement (clustering approaches create a single best
and quality and quantity of information provided. The                                         cluster hierarchy for the whole department then rank for
use of hypertext-IR techniques could also bring in                                            individual users) rather than the best arrangement for
many other sources of evidence as to users’ activities,


M.D. Dunlop                                                                                                                                           9-6
each individual person as performed in straight              and Visual Information Research (ELVIRA 4),
searching.                                                   Milton Keynes, Aslib: London, UK, pp 31-38, May
                                                             1997.
Further work is planned to investigate the results here
in a corporate setting using different sources of         Porter, M. F., “An algorithm for suffix stripping”,
documentation, over a longer timescale and using more        Program, v 14(3), pp130-137, July 1980
users as the test base. Hypertext-IR approaches will         (reproduced in Sparck Jones and Willett 1997)
also be investigated to see if they improve the           Reid, J., “A task oriented non-interactive evaluation
effectiveness of the clustering approaches and make          methodology for information retrieval systems”,
them more amenable to highly variable styles of home         Information Retrieval, v 2(1), 2000 to appear.
pages visible in some organisations.
                                                          Robertson, S., and Reese, K., “A virtual library for
                                                            building community and sharing knowledge”,
Acknowledgements                                            International Journal of Human-Computer Studies,
Many thanks are due to the staff of Risø's Systems          vol 51, pp 663-685, 1999.
Analysis Department who completed questionnaires as       Salton, G., and McGill, M.J., Introduction to modern
part of the evaluation work.                                 information retrieval, McGraw-Hill, 1983.
The project was supported by the Danish Research          Savoy, J., "An Extended Vector-Processing Scheme for
Foundation's Centre for Human Machine Interaction.           Searching Information in Hypertext Systems",
                                                             Information Processing and Management, v32(2),
                                                             155-170, March 1996.
References
                                                          Sparck Jones, K., “A statistical interpretation of term
Baeza-Yates, R., & Ribeiro-Neto, B., Modern                  specificity and its application in retrieval”, Journal
   Information Retrieval, ACM Press 1999.                    of Documentation, v28, pp 11-21, 1972.
Dunlop, MD, Multimedia Information Retrieval, PhD         Sparck Jones, K., and Willett, P. (Eds), Readings in
  Thesis, Glasgow University Computing Science               Information Retrieval, Morgan Kaufmann, 1997.
  Research Report 1991/ R21, October 1991.
                                                          Twidale, M.B., and Nichols, D.M., “Computer
Dunlop, M.D., and Van Rijsbergen, C.J., "Hypermedia         supported cooperative work in information search
  and free text retrieval", Information Processing and      and retrieval”, in Williams, E.M. (Ed), Annual
  Management, vol 29(3), May 1993.                          Review of Information Science and Technology
Frakes, W.B., and Baeza-Yates, R. (Eds), Information        (ARIST),v33, pp259-319,Association of Information
   Retrieval: Data Structures and Algorithms, Prentice      Science, 1999.
   Hall, 1992.                                            Van Rijsbergen, C.J. Information Retrieval (second
Frei H.P., and Stieger D., “Making use of hypertext         edition). Butterworths, 1979.
   links when retrieving information”, Proceedings        Van Rijsbergen, C.J., and Sparck Jones, K., “A test for
   ACM-ECHT’92, Milan, Italy, pp. 102-111, 1992.            the separation of relevant and non-relevant
Griffiths, A., Luckhurst, C., and Willett, P., “Using       documents in experimental retrieval collections”,
   inderdocument similarity information in document         Journal of Documentation, v29, pp251-7, 1973.
   retrieval systems”, Journal of American Society for    Voorhees, E.M., Implementing Agglomerative
   Information Science, v37, p3-11, 1986 (reproduced        Hierarchic Clustering Algorithms for Use in
   in Sparck Jones and Willett 1997).                       Document Retrieval, Technical report TR 86-765,
Harmandas, V., Sanderson, M., and M. D. Dunlop,             Department of Computer Science, Cornell
   M.D:, "Image retrieval by hypertext links",              University, July 1986.
   Proceedings of SIGIR-97, 1997.
Hertzum, M. and Pejtersen, A.M., “The information-
   seeking practices of engineers: Searching for
   documents as well as for people”. To appear in
   Information Processing and Management.
Nichols, D.M., and Twidale, M.B., “Matchmaking and
   privacy in the digital library: striking the right
   balance”, In Proceedings of the 4th
   UK/International Conference on Electronic Library


M.D. Dunlop                                                                                                       9-7