Data mining, interactive semantic structuring,
and collaboration: A diversity-aware method for
            sense-making in search

            Mathias Verbeke, Bettina Berendt and Siegfried Nijssen

     Department of Computer Science, K.U. Leuven, B-3001 Heverlee, Belgium;
                     http://www.cs.kuleuven.be/~berendt


      Abstract. We present the Damilicious method and tool which help
      users in sense-making of the results of their literature searches on the
      Web: on an individual level, by supporting the construction of seman-
      tics of the domain described by their search term, and on the collective
      level, by encouraging users to explore and selectively re-use other users’
      semantics. We use a combination of clustering, classification and inter-
      activity to obtain and apply diverse semantics of thematic areas, and
      to identify diversity between users. This tool can help users take differ-
      ent perspectives on search results and thereby reflect more deeply about
      resources on the Web and their meaning. In addition, the method can
      help to develop quantitative measures of diversity; we propose diversity
      of resource groupings and diversity of users as two examples.
      Keywords: Diversity aware classification and clustering technology; Di-
      versity driven information aggregation technologies; Semantic clustering;
      Diversity aware search and semantic search;


1   Introduction
Search is one of the main applications on today’s Web and other repositories,
and it is supported by more and more advanced techniques. However, these tech-
niques largely ignore an important component of search: the further processing of
search-result sets that humans invariably undertake when dealing seriously with
a list of documents provided by a search engine – and the diversity in which
this is done by different people. An important form of such further processing is
the grouping of result sets into subsets, which can be seen as investing an undif-
ferentiated list of results with “semantics”: a structuring into sets of documents
each instantiating a concept (which is a subconcept of the overall query). On the
Web, several search engines provide an automatic clustering of search results.
However, regardless of how good the employed clustering algorithm is, a “glo-
bally optimal” clustering solution is generally impossible: There is no single best
way of structuring a list of items; instead, the “optimal” grouping will depend
on the context, tasks, previous knowledge, etc. of the searcher(s).
    On a truly Social Web, users should be empowered to see and work with a
diversity of possible structurings and their associated sense-makings. A prere-
quisite for such diversity-aware tools is that users are able to perform individual
sense-making, structuring search result sets according to their needs and know-
ledge. The problem that we study is thus how to make users aware of alternative
ways to group a set of documents. The approach that we take is to provide users
with a tool to cluster documents resulting from a query. In this tool, Damili-
cious (DAta MIning in LIterature Search engines), the user can, starting from
an automatically generated clustering, group a search result document set into
meaningful groups, and she can learn about alternative groupings determined
by other users. To transfer a clustering of one set of documents to another
set of documents, the tool learns a model for this clustering, which can be
applied to cluster alternative sets of documents. We refer to this model as the
clustering’s intension, as opposed to its extension, which is the original, unan-
notated grouping of the documents. This approach supports various measures
of diversity. Such measures can be used to make recommendations and present
new, possibly interesting viewpoints of structuring the result set. The domain of
literature search was chosen to ensure a high level of user interest in reflection
and meaningful results; the methods can easily be transferred to other domains.
    The paper is structured as follows: Section 2 gives a brief overview of related
work. In Section 3, we introduce the method of semantic clustering and classifi-
cation. In Sections 4 and 5, we describe the individual and collaborative uses of
the tool. Section 6 concludes with an outlook.


2   Background and related work

Our work extends theoretical and practical results from various areas.
    To present search results in a more meaningful way, search engines such as
www.clusty.com, www.kartoo.com or search.carrot2.org cluster search re-
sults based on distance measures on documents. Subgroups are labelled based
on top keywords or phrases. However, users have no possibility of improving on
the quality or context-relatedness of the results. (Social-tagging applications
such as www.delicio.us, www.citeulike.org or www.bibsonomy.org are in a
sense the opposite: Users are completely free in grouping and labelling resources
they have found on the Web, but in general the only help they obtain from ma-
chine intelligence are equality matches with other users’ tags or tag proposals.)
    In knowledge engineering, these problems are addressed by techniques for
semi-automatic ontology learning approaches where system-derived clus-
ters can be modified and annotated by users, e.g. in the tool [9]; see also [7]. In
CiteSeerCluster [3], we built on these ideas to improve the search for scientific lit-
erature. This tool employs textual clustering methods and builds on established
bibliometric analyses and clustering algorithms, cf. [11]. Users create, modify
and annotate groupings of the documents returned by a query, starting from an
automatic clustering, and they engage in sense-making of the domain.
    CiteSeerCluster has two limitations. First, each search episode stands alone:
The results (document groups as extensions and automatically derived keywords
as well as manually assigned labels and annotations as intensions of the concepts
created by a structuring activity) can be saved and loaded, but not re-used or
built on in new search episodes. Second, support for collaboration is therefore
also limited to the exchange of combined episode intensions and extensions, i.e.
“commented literature lists”, but not intensions separately.
    To overcome these limitations, we turned to recent work on conceptual
and predictive clustering [4, 16]. The key idea that we use from this work is
to a) form clusters of elements, then b) learn classifiers that reconstruct these
clusters, c) validate these classifiers by investigating the reconstruction quality,
and d) apply the classifier for further result sets.
    Research on ontology re-use, in particular adaptive ontology re-use [14], in-
vestigates the modelling and mapping steps needed for re-using given ontologies,
for whatever purpose. In contrast, we concentrate on re-use for grouping/clas-
sifying new objects, and on how to find the most suitable ontologies for re-use.
Our aim is to derive, from this procedure, measures of diversity and to use
these to support search. We build on the notion of diversity of result sets as
used in recommender systems, e.g. [17], and information retrieval, e.g. [1], and
on the notion of (cultural) diversity of people, cf. [10, 2].


3   Semantic clustering, classification, and interactivity

In this section, we describe the basic combination of clustering and classification
methods for deriving and re-using a semantics of search that has been created
by user and system together. We describe the system’s general workflow and our
measure of diversity between document groupings.
    Clustering, classification and regrouping proceed through five steps. All steps,
as well as the whole process, can be iterated:
1. Query: A user query is forwarded to a search engine or repository, and
the list of results is shown in the Damilicious interface. We use the CiteSeerX
repository1 because of its broad coverage and rich structure, and also because it
offers an OAI interface.2 The output is a set of document IDs, document details
and their texts.
2. Automatic clustering: The system clusters the documents with the Lingo
algorithm3 . For each of the resulting clusters, an intensional definition is calcu-
lated, where the set of intensional definitions of the clusters is the intensional
definition of the clustering. These intensional definitions provide the criteria
based on which the documents are clustered. They can thus be seen as a classi-
fier, and can be used to classify new documents in the future.
3. Manual regrouping: Since the automatic clustering is only a suggestion
for the structure, each user can move and delete documents (the extensions)
between the different clusters.
4. Description of alternative ways of grouping: Different ways in which a
result set can be grouped are compared and shown.
5. Transfer: A grouping solution can be used to group an enlarged or a different
document set. This is done by applying the intension as a classifier. This re-use
1
  http://citeseerx.ist.psu.edu/
2
  an interface for harvesting metadata from separate repositories, see
  www.openarchives.org
3
  http://project.carrot2.org
can be done using groupings produced by the user herself (“individual”, see
Section 4), or produced by another user (“collaborative”, see Section 5).

Measures of grouping similarity and diversity Step 4 of the workflow requires a
measure of the similarity/diversity of groupings. We use normalized mutual in-
formation, because it has desirable properties: it has a value between 0 and 1, and
it behaves well when calculated for clusterings with different numbers of clusters
[13]. The normalized mutual information (specifically, NMI 4 [15])p   of two group-
ings F, G is defined as N M I(F, G) = (H(F ) + H(G) − H(F, G))/ H(F )H(G),
where H(G) is the entropy of grouping G and H(F, G) the joint entropy of F
and G together.4 For clustering algorithms like Lingo that can generate overlap-
ping clusters, the probabilities used in the entropy calculations are normalized
by division by the number of documents in each cluster. We treat N M I as a
measure of the pairwise similarity of groupings, and (1 − N M I) as a measure
of the (pairwise) diversity of groupings.
    One use of this measure was to choose the best clustering and classification
algorithms for step 2 of the workflow. Our aim was a choice that reconstructs
a “ground-truth” clustering as well as possible, i.e. has the highest N M I with
it. We investigated two settings experimentally. The first is a ‘sanity check’:
cluster document set D; learn a classifier, apply this classifier to D, compare the
classes with the clustering. The second setting models the basic case of re-use,
the issuing of the same query at a later time towards a grown database: cluster
document sets D0 and D ⊂ D0 , to obtain clusterings G0 and G. Learn a classifier
from G, apply it to D0 , compare the classes with G0 . We compared k-means and
Lingo for clustering and top-10 TF.IDF words, Lingo phrases, and Ripper [6]
rules for classification. The combination of Lingo with Lingo phrases gave the
best results and is therefore implemented in the Damilicious user interface.

4   Individual sense-making: Interactive semantics
The use of Damilicious by one individual serves two goals: (1) to support in-
dividual sense-making and (2) to re-use one’s own prior search episodes, more
specifically the semantics created in them. (1) is realized through steps 1–3 of
Section 3; results can be saved and re-loaded. Goal (2) is realized in step 5 by
applying a classifier obtained in a previous search episode of the user.5 Here,
we illustrate step 4: the use of interactive cluster graphs to show the current
(machine-generated or manually regrouped) extensions and intensions.

Example: Towards an individualised semantics of “web mining” Figure 1 illus-
trates how Damilicious supports conceptual thinking in search: As in other forms
of semi-automatic ontology learning such as [9], the groups are viewed as con-
cepts, with the extension of the concept given by the set of documents in the
4
  for clustersP
              Ci and p the distribution of documents
                                                  P over Pthem:
  H(G) = − Ci ∈G p(i)log2 p(i) and H(F, G) = − Ci ∈F Cj ∈G p(i, j)log2 p(i, j).
5
  The resulting ‘stability’ of document groups was highly appreciated by the student
  participants of a small user study.
                         Fig. 1. Damilicious: result screen


group. The intensions of the concepts are given here by the Lingo phrases; in [9,
3], they are given by a combination of top TF.IDF terms and manually assigned
labels. The layout ensures that large, ‘important’ clusters that overlap with many
others are shown in the centre, while small and ‘isolated’ clusters move to the
margins. In the example, the system solution has identified “Web agents” as well
as “Web personalization” as subfields of the field specified by the query, “Web
mining”. Unavoidably, in this automatic solution the top content area is not
split up into subgroups by the same criterion; thus for example, “Clustering” (a
method which could well be used for building agents or doing personalization)
is another cluster. Users can improve on this structure: by moving and deleting
documents between groups, they can modify the groupings to better reflect their
intentions, background knowledge, interests, etc.

5   Collaborative sense-making: Diversity
The collaborative use of Damilicious serves two goals: (1) to get an overview of
user diversity and to localise oneself in this space, and (2) to re-use other users’
semantics. (1) builds on steps 1–3 of Section 3. Goal (2) is realized in step (5),
more specifically, in the transfer of a foreign model to a new search episode. In
this section, we illustrate step 4: the use of a measure and description of user
diversity by a semantic space of users. First, the measure is explained; then one
possible depiction is described; an alternative is to use a series of cluster graphs
as in the previous section.

A measure of user diversity Above, we have proposed to use 1 − N M I(F, G)
as a measure of the diversity between groupings F and G. In our setting, the
instance set is a set of documents returned by a search engine operating on a
given state of a database, in response to a query q. The grouping is either the
result of a clustering/classification algorithm A (with parameters set by a user
or optimised according to some strategy), or it is a manual regrouping done by a
user A. To emphasize these two determinants, we replace F (or G) by gr(A, q).
    From this, we derive the measure gdiv(A, B, q) of the diversity of users A, B
(with respect to the way in which they structure the search result of q) as

                  gdiv(A, B, q) = 1 − N M I(gr(A, q), gr(B, q)).                (1)

Based on this, a measure of the (pairwise) diversity of users gdiv(A, B)
can be defined by an aggregation over queries they both worked on. A simple
example of an aggregation operator is the average:

           gdiv(A, B) = avgq:∃gr(A,q)∧∃gr(B,q) (gdiv(gr(A, q), gr(B, q)).       (2)

Example (contd.): Diversity in conceptualising “Web mining” To demonstrate
the use of the user-diversity measure, we simulated a small user study. Five users
searched for “web mining” and restricted Damilicious to retrieving 50 documents.
User U0 did not change the system clustering. User U1 regrouped documents
to produce a better fit of the document groups to the cluster intensions, with
a total of five document regroupings. User U2 attempted to move everything
that did not fit well into the remainder group “Other topics”, and in addition
performed some regroupings to achieve a better fit; this resulted in ten regrouping
operations. User U3 had the reverse goal: In a total of five regrouping operations,
she distributed documents from “Other topics” into matching real groups. User
U4 pursued a very different strategy, regrouping by author and institution; she
also performed five regrouping operations.
    For q = “web mining” and each pair of users A, B, N M I(gr(A, q), gr(B, q))
was computed. Applying equation 1, we obtain a 5 × 5 matrix of differences
gdiv(A, B, q). Values ranged from 0.52 to 0.58. Assuming q to be the only com-
monly worked-on query and applying equation 2, these differences are equal to
gdiv(A, B) and together form the user-diversity matrix GDIV .
    This matrix can be interpreted as giving rise to a “semantic space of users”, in
which users who group in similar ways are close to one another, but far from users
who group differently. Neighbourhood and distribution in this semantic space
can be regarded as indicators of diversity: pairwise diversity (distances) between
users, as well as overall distribution of population diversity. Multi-dimensional
scaling (MDS) is a popular way for visualising such distance-based semantic
spaces, e.g. [9]. Figure 2 (a) shows an MDS visualization6 of GDIV .
    At the bottom right-of-centre, user U0 is shown as “System” (the system
solution). The locations of the other users in this space can be interpreted as
follows. U2 produced the most different solution, possibly through her most ac-
tive re-grouping. U1’s regrouping differed more from the System’s than U3’s,
possibly because U1 took a high-level semantic view in his attempt to produce
better-fitting groups, while U3 was less ambitious in his focus on re-distributing
documents (only) from the “Other Topics” cluster. Interestingly, U4’s attempt
6
    coordinates generated with Talisman, http://talisman.sourceforge.jp/mds
Fig. 2. Sample semantic spaces of users, showing their diversity: (a) web mining; (b)
data mining; (c) RFID; (d) average of a,b,c.


to regroup by bibliographical data on author and institution (based on differ-
ent data than the System that uses textual information from the document
abstracts) resulted in the grouping most similar to the System’s. It is possible
that this is due to the fact that the same authors will use very similar wordings
throughout their papers7 , such that the textual similarity is higher than that of
other papers on the same subject (which was the focus of U1’s, U2’s and U3’s
restructurings). In addition to these pairwise observations, the results also show
a simple form of population diversity: U1, U2 and U3, who had similar intentions
when regrouping, appear in one “region” of this semantic space of user diversity
(in the MDS solution of Fig. 2: to the left of System), clearly distinct from U4
who appears in a region of her own (to the right of System).
    Figure 2 (b) and (c) show the diversities generated by the same user strategies
on a semantically related query (b: data mining) and on an unrelated query (c:
RFID). Interestingly, the more similar query (b) also appears to generate a
more similar user diversity space. Averaging over all three queries, (d) shows the
commonalities of all three: the similarity of U1 and U3 vs. U2 and U4.

6    Conclusions and outlook
In this paper, we have presented the Damilicious tool which helps users in sense-
making of the results of their literature searches on the Web: on an individual
level, by supporting the construction of semantics of the domain described by
their search term, and on the collective level, by encouraging users to explore
the diversity of other users and to re-use other users’ semantics.
    Many open issues remain to be solved. They include (a) how to measure not
only the diversity in clustering of search results, but also the diversity in results
for different queries, and in general different result sets; (b) how to best present
diversity of groupings and users (should all users be shown, only the nearest
neighbour, or a ranking of neighbours? should other groupings rather than other
users be shown? should “aggregated groupings” be shown, and how can these
be defined? how should this functionality be integrated into a comprehensive
environment supporting user and community contexts [12]? how can appropri-
ate interpretations of MDS be supported and others discouraged [5]?); (c) how
to motivate users to take advantage of regrouping and diversity functionalities
7
    cf. also the modelling, in CiteSeer(X), of specific forms of textual similarity in same-
    author papers: http://citeseerx.ist.psu.edu/help/glossary
(specific literature-search tasks may be beneficial, as suggested by the evalua-
tion results of [3]); (d) how to find the best balance between similarity (“re-use
solutions from users like me”) and diversity (“re-use solutions from users unlike
me so I can broaden my horizon”); (e) which measures of grouping diversity are
most meaningful to users (our currently used measure N M I is purely exten-
sional or instance-based; more intensional, structure-based or hybrid measures
from the ontology matching field [8] could also be useful); and (f) which other
sources of user diversity to leverage (for example, affiliation or research area). In
future work, we aim to explore these issues theoretically and in user studies.
Acknowledgments S. Nijssen was supported by the Science Foundation – Flanders.

References
 1. R. Agrawal, S. Gollapudi, A. Halverson, & S. Ieong. Diversifying search results.
    In Proc. WSDM ’09, 5–14, New York, NY, USA, 2009. ACM.
 2. B. Berendt & A. Kralisch. From world-wide-web mining to worldwide webmining:
    Understanding people’s diversity for effective knowledge discovery. In From Web
    to Social Web, LNAI 4737, 102–121. Springer, 2007.
 3. B. Berendt, B. Krause, & S. Kolbe-Nusser. Intelligent scientific authoring tools:
    Interactive data mining for constructive uses of citation networks. Information
    Processing & Management, in press.
 4. H. Blockeel, L. De Raedt, & J. Ramon. Top-down induction of clustering trees. In
    In Proc. of the 15th ICML, 55–63. Morgan Kaufmann, 1998.
 5. I. Borg & J. Lingoes. Multidimensional Similarity Structure Analysis. Springer,
    New York, 1987.
 6. W.W. Cohen. Fast Effective Rule Induction. In ICML, 115–123. 1995.
 7. D. R. Cutting, J. O. Pedersen, D. R. Karger, & J. W. Tukey. Scatter/Gather: A
    cluster-based approach to browsing large document collections. In Proc. SIGIR,
    318–329. ACM, 1992.
 8. J. Euzenat & P. Shvaiko. Ontology Matching. Springer, Berlin/Heidelberg, 2006.
 9. B. Fortuna, D. Mladenic, & M. Grobelnik. Semi-automatic construction of topic
    ontologies. In Semantics, Web and Mining, LNCS 4289, 121–131. Springer, 2006.
10. G. Hofstede. Cultures and organizations: software of the mind. Intercultural coop-
    eration and its importance for survival. HarperCollins, London, 1994.
11. F. Janssens, W. Glänzel, & B. De Moor. A hybrid mapping of information science.
    Scientometrics, 75(3):607–631, 2008.
12. C. Niederée, A. Stewart, C. Muscogiuri, M. Hemmje, & T. Risse. Understanding
    and Tailoring Your Scientific Information Environment: A Context-Oriented View
    on E-Science Support. In From Integrated Publication and Information Systems to
    Virtual Information and Knowledge Environments, 289–298, 2005.
13. D. Pfitzner, R. Leibbrandt, & D. Powers. Characterization and evaluation of simi-
    larity measures for pairs of clusterings. Knowledge and Information Systems, 2009.
14. R. Stecher, C. Niederée, W. Nejdl, & P. Bouquet. Adaptive ontology re-use: finding
    and re-using sub-ontologies. IJWIS, 4(2): 198-214, 2008.
15. A. Strehl & J. Ghosh. Cluster ensembles – a knowledge reuse framework for
    combining multiple partitions. Machine Learning, 3:583–617, 2003.
16. B. Zenko, S. Dzeroski, & J. Struyf. Learning predictive clustering rules. In In
    Proc. 4th Int. Worksh. on Knowledge Discovery in Inductive Databases, LNCS
    3933, 234–250. Springer, 2005.
17. M. Zhang & N. Hurley. Statistical modeling of diversity in top-n recommender
    systems. In Proc. of ACM/WIC/IEEE WI-IAT’09, 490–497. 2009.