<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>New Research Directions in Search Results Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudio Carpineto</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Bernardini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimiliano D'Amico</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianni Romano Fondazione Ugo Bordoni Rome</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>carpinet</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>abernardini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>romano}@fub.it mas.damico@gmail.com</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>We discuss which are the main research themes in the efild of search results clustering and report some recent results achieved by the Information Mining group at Fondazione Ugo Bordoni.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>SEARCH RESULTS CLUSTERING</title>
      <p>Search results clustering organizes search results by topic,
thus providing a complementary view to the aflt list returned
by document ranking systems. This approach is especially
useful when document ranking fails. Besides allowing direct
subtopic access, search results clustering reduces
information overlook, helps lfitering out irrelevant items, and favors
exploration of unknown or dynamic domains.</p>
      <p>Search results clustering is related to, but distinct from,
conventional document clustering. When clustering takes
place as a post-processing step on the set of results retrieved
by an information retrieval system on a query, it may be
both more ecffiient, because the input consists of few
hundred of snippets, and more eeffctive, because query-specific
text features are used. On the other hand, search results
clustering must fullfil a number of more stringent
requirements raised by the nature of the application in which it
is embedded; e.g., meaningful cluster labels, low response
times, short input data description, unknown number of
clusters, overlapping clusters.</p>
      <p>
        A comprehensive survey of search results clustering,
including issues, techniques, and systems is given in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In
the remainder of this paper we point out interesting research
directions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Description-centric clustering algorithms</title>
      <p>Given that search results clustering systems are primarily
intended for browsing retrieval, a critical part is the quality
of cluster labels, as opposed to optimizing only the clustering
structure. In fact, the algorithms for performing search
results clustering cover a spectrum ranging from data-centric
to description-centric techniques, depending on whether the
priority is given to cluster formation or cluster labeling.</p>
      <p>
        One of the most recent examples of the latter category is
KeySRC (Keyphrase-based Search Results Clustering),
described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This system generates clusters labeled by
keyphrases. The keyphrases are extracted from the
generalized suxffi tree built from the search results and merged
through an improved hierarchical agglomerative clustering
procedure, representing each phrase as a weighted
document vector and making use of a variable dendrogram cut-off
value. KeySRC is available at http://keysrc.fub.it.
1.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Performance evaluation measures</title>
      <p>Internal validity measures and comparison with ground
truth results are two common ways of evaluating clustering
partitions, but they have the disadvantage that the
performance of the system in which the document partition is
encompassed is not explicitly taken into account. As the
intended use of search results clustering is to nfid documents
relevant to the single query’s subtopic, it may be more
convenient to evaluate the performance on a retrieval oriented
task. However, the classical measures related to subtopic
retrieval, such as subtopic recall, subtopic precision, and
subtopic MRR, assume that the system output consists of
a ranked list and thus they are not directly or easily
applicable to clustered results, Furthermore, they strictly focus
on subtopic coverage; i.e., retrieving at least one relevant
document per subtopic.</p>
      <p>
        To address these limitations, we presented a new
evaluation measure inspired by Cooper’s expected search length:
Subtopic Search Length under k document sufficiency (kSSL).
The idea is to consider the number of elements (cluster
labels or search results) that the user must examine to retrieve
a specified number ( k) of documents relevant to the single
subtopics of a query. The shorter the search length, the
better the system performance. It is assumed that both cluster
labels and search results are read sequentially from top to
bottom, and that only cluster with labels relevant to the
subtopic at hand are opened. The main advantages of kSSL
are that it is suitable for both ranked lists and clustered
results and that it allows evaluation of full subtopic retrieval
(i.e., retrieval of multiple documents relevant to a query’s
subtopic). A full description of kSSL is given in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
1.3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Test collections</title>
      <p>There is almost a complete lack of test collections with
subtopic relevance judgments. Two exceptions are the
collections developed at the TREC Interactive track, which is
small and primarily focuses on the instances of a given
concept (e.g., ‘what tropical storms – hurricanes and typhoons
– have caused property damage and/or loss of life’), and
at Image CLEF, which is mainly about geographical
diversity of photos associated with a given topic (e.g., ‘images of
beaches in Brazil’).</p>
      <p>
        We created two new test collections for evaluating subtopic
retrieval, namely AMBIENT and ODP-239. AMBIENT
(AMBIguous ENTries) consists of 44 topics extracted from
the ambiguous Wikipedia entries, each with a set of subtopics
and a list of 100 ranked search results manually annotated
with subtopic relevance judgments. AMBIENT is fully
described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and is available at http://credo.fub.it/ambient.
      </p>
      <p>ODP-239 consists of 239 topics, each with about 10 subtopics
and 100 documents associated with the subtopics. The
topics, subtopics, and their associated documents were selected
from the Open Directory Project (www.dmoz.org). The
distribution of documents across subtopics reeflcts the relative
importance of subtopics. ODP-239 can be downloaded from
http://credo.fub.it/odp239.
1.4</p>
    </sec>
    <sec id="sec-5">
      <title>Applications in mobile search</title>
      <p>The features of search results clustering appear very
suitable for mobile information retrieval, where a minimization
of user actions (such as scrolling and typing), device
resources, and amount of data to be downloaded are primary
concerns. Furthermore, such features seem to nicely comply
with the most recently observed usage patterns of mobile
searchers.</p>
      <p>
        We implemented two mobile clustering engines (for PDAs
and cellphones) and evaluated their retrieval performance
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We found that mobile clustering engines can be faster
and more accurate than the corresponding mobile search
engines, especially for subtopic retrieval tasks. We also found
that although mobile retrieval becomes, in general, less
effective as the search device gets smaller, the adoption of
clustering may help expand the usage patterns beyond mere
informational search while mobile.
1.5
      </p>
    </sec>
    <sec id="sec-6">
      <title>Meta search results clustering</title>
      <p>Just as the results of several search engines can be
combined into a meta search engine, the outputs produced by
distinct clustering engines can be merged into a meta
clustering engine. Currently, there are many dieffrent web
clustering engines but no attempts has still been made to combine
them, to the best of our knowledge.</p>
      <p>
        We studied the problem of meta search results clustering,
that has unique features with respect to the relatively well
understood field of general meta clustering. After showing
that the combination of multiple search results clustering
algorithms is empirically justiefid, we developed a novel meta
clustering algorithm that maximizes the agreement between
the outputs produced by the input clustering algorithms [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
The novel meta clustering algorithm applied to web search
results is both ecffiient and eeffctive.
1.6
      </p>
    </sec>
    <sec id="sec-7">
      <title>Clustering versus diversification of search results</title>
      <p>Re-ranking search results to promote diversity of top
elements is another approach to subtopic retrieval that has
received much attention lately. Clustering and
diversicfiation of search results are thus dieffrent techniques with a
similar goal, i.e., addressing the limitations of the
probabilistic ranking principle when a topic has multiple aspects
of potential interest and the relevance criterion alone is not
sucffiient.</p>
      <p>
        These two techniques have not been compared so far. We
performed a systematic evaluation of several clustering and
diversification algorithms using multiple test collections and
evaluation measures [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It turns out that diversicfiation
works well when one wants to get a quick overview of
documents relevant to distinct subtopics, whereas clustering is
more useful when one is interested in retrieving multiple
documents relevant to each subtopic.
1.7
      </p>
    </sec>
    <sec id="sec-8">
      <title>Other research directions</title>
      <p>There are further directions that have started to be
explored recently by other research groups. They mainly aim
to improve the quality and eeffctiveness of the search results
clustering process. A non-exhaustive list is given below.
– Personalized search results clustering
– Integrating external knowledge (e.g., thesauri,
metadata, folksonomies, past queries) with search results
clustering
– Semi-supervised search results clustering
– Temporal search results clustering
– Visualization of clustered search results
– Search results clustering and faceted hierarchies
2.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. D'Amico</surname>
          </string-name>
          .
          <article-title>Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering</article-title>
          .
          <source>In Proceedings of 2009 IEEE/WIC/ACM International Conference on Web Intelligence</source>
          , Milan, Italy, pages
          <fpage>206</fpage>
          -
          <lpage>213</lpage>
          . IEEE Computer Society,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D'Amico</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Romano</surname>
          </string-name>
          .
          <article-title>Evaluating subtopic retrieval system performance: clustering versus diversicfiation</article-title>
          . Submitted.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          , G. Romano, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Snidero</surname>
          </string-name>
          .
          <article-title>Mobile Information Retrieval with Search Results Clustering: Prototypes and Evaluations</article-title>
          .
          <source>Journal of American Society for Information Science and Technology (JASIST)</source>
          ,
          <volume>60</volume>
          (
          <issue>5</issue>
          ):
          <fpage>877</fpage>
          -
          <lpage>895</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          , S. Osni´ski, G. Romano, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>A survey of Web clustering engines</article-title>
          .
          <source>ACM Computing Survey</source>
          ,
          <volume>41</volume>
          (
          <issue>3</issue>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Romano. Optimal Meta Search Results Clustering</surname>
          </string-name>
          . Submitted.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>