<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Summarization and Expansion of Search Facets</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science Katholieke Universiteit Leuven Leuven</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>26</volume>
      <issue>2013</issue>
      <abstract>
        <p>We present a novel method for summarization and expansion of search facets. To dynamically extract key facets, the ranked list of search results generated from a keyword search is coupled with the spatial distribution of relevant documents in a hierarchical taxonomy of subject classes. An evaluation of the method based on the relevance and diversity of the produced facets indicates its e ectiveness for both summarization and expansion.</p>
      </abstract>
      <kwd-group>
        <kwd>Selection of search facets</kwd>
        <kwd>Expansion</kwd>
        <kwd>Summarization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The combination of a `keyword' and a `faceted' search has
the potential to enhance user experience by providing a
better arrangement of search results and aiding further search
exploration. However, such a framework poses two key
problems: 1) a given query may cover several facets, requiring
an aggregation or summarization of the most relevant ones;
and 2) a query may cover too few facets necessitating an
expansion to include additional facets. We exploit the spatial
distribution of topics relevant to a query in a hierarchy
together with the relevance ranking of the documents for the
query, in order to select search facets that optimize diversity
and relevance.</p>
    </sec>
    <sec id="sec-2">
      <title>SELECTING SEARCH FACETS</title>
      <p>We assume that the search results of a query are
annotated with subject classes (here facets or nodes) obtained
from a hierarchical taxonomy. In the experiments below,
the DMOZ hierarchy is used. For each query, we de ne: a
set of activated nodes that have documents relevant to the
query and a set of presentation nodes that will be presented
to the user as facets relevant for the query.</p>
      <p>When the user presents a query, the DMOZ facets
associated with the result of the query are rst extracted, i.e.,
the activated nodes are identi ed. Next, if the number of
activated facets associated with the query is larger than ky,
Full paper published at CORIA 2013 [3]
http://www.dmoz.org
yk is chosen based on the size of the interface medium and
the cognitive load acceptable for a user
the set of activated nodes or facets is summarized by
picking the best k candidates. If the number of activated facets
is less than k, then the set is expanded by adding related
facets. The summarization and expansion are carried out
using the `Subtree density' model (Section 3) which takes
as input a set of activated nodes and produces the
presentation nodes. For some queries, DMOZ activates not only
the lowest level facets, but also some of their ancestors. In
such a case to ensure presentation of as many distinct facets
as possible, the summarization uses only the descendants,
while the expansion uses only the ancestors.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>SUBTREE DENSITY MODEL</title>
      <p>This model nds nodes which represent dense clusters of
facets, each having many search results important for the
query. First, the subtrees associated with the relevant set
of activated nodes are extracted. The subtree S for a node
v comprises the node and its descendants (children, grand
children etc. until the last level).</p>
      <p>Then, one possible candidate to represent a subtree is the
medoid identi ed as the node with the minimum average
distance to all the other nodes of the subtree. The
distances between nodes in the subtree are computed using a
distance metric that captures semantic distances between
topics in a hierarchy. Since the basic relations in the
taxonomy are the parent-child relations, distance between any two
nodes is represented using the connection weights between
the parent-child pairs associated. In taxonomy T with root
at level 0, the connection weight D between node vi at level
l and its child vj at level l + 1 is as follows:</p>
      <p>D(vi; vj ) = 2 l
Using this metric, the distance between any two nodes vm
and vn in T is de ned as the sum of connection weights
between all nodes vx spanning the path between vm and vn.</p>
      <p>
        Once the medoids of the subtree have been identi ed, we
must rank them to identify the best k medoids that will be
presented. This is done using a score computed in Eq. 2
density(S)
score(m) = (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
      </p>
      <p>distance(m; S)
density(S) =
where density(S) is given by</p>
      <p>
        Pv2S importance(v; R)
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
i=rank(d);i&gt;1;d2R
      </p>
      <p>reli
log2(i)
(4)
where R is a ranked list of documents retrieved for the query
obtained from a search engine, i is the position of the
retrieved document in the list, and reli = 1 if the ith document
belongs to facet v and 0 otherwise.</p>
      <p>The idea of this score is as follows:</p>
      <p>A node that has lesser distance from every other node
of the subtree is a better representative of the subtree;
A subtree that has a higher density is an important
one for the query.</p>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>Two sets of queries have been used for evaluation. The
rst query set Q1 contains titles of English Wikipedia
articles. The second query set Q2 comprises real user queries
collected by Torres et al. [1]. The queries were submitted to
the Bing search engine, restricting the search results to the
Web pages from the DMOZ Kids and Teens subdirectory.
The subtree density model has been benchmarked against
two baseline (BL) models, one for summarization and the
other for expansion. The baseline model for
summarization uses the top k distinct activated nodes from the ranked
results from a search engine, while the baseline model for
expansion uses the siblings of the activated nodes for
presentation.</p>
      <p>The evaluation is based on two aspects- relevance and
diversity. First, the facets selected by the model for each query
of the two query sets were presented to ve Crowd owerz
evaluators, who were asked to judge whether the facets
produced were relevant to the query. Next, to evaluate
diversity of the summarization, we put together two clusters of
related facets (that were judged relevant by Crowd ower
evaluators)- one for each summarization model, per query
for the queries in Q1 and Q2. Then, Crowd ower
evalutors were asked to rank these clusters on a scale of 1 to 5
based on the diversity of the facets in the clusters, with rank
zhttp://crowd ower.com/
1 corresponding to `Very diverse'. For both relevance and
diversity evaluations, only queries for which the agreement
among Crowd ower evaluators was over 80% (as reported
by Crowd ower) were retained.</p>
      <p>The number of queries used for evaluation, the precision
and diversity of the model have been indicated in Table 1
and Figure 1. From Figure 1, it is evident that the subtree
density model performs better than the baselines in terms
of precision (measured by relevance), both in summarization
and expansion. Table 1b indicates that the subtree density
model also outperforms the baseline (based on ranked
results) in terms of diversity. These results are explained by
the fact that in our model, important facets come from dense
clusters of search results in the taxonomy.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>In this paper we have presented the subtree density model
for summarizing and expanding search results mapped to a
subject taxonomy. Evaluation of the method using human
evaluators indicates that it is e ective as it optimizes both
relevance and diversity. A next step in our research is to
develop navigation models for interactive browsing consisting
of the presented facets and their corresponding Web pages.
Acknowledgements This research was funded by the Puppy
IR project EU FP7 231507.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Duarte Torres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          .
          <article-title>An analysis of queries intended to search information for children</article-title>
          .
          <source>In Third Symposium on Information Interaction in Context. ACM</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ja</surname>
          </string-name>
          <article-title>rvelin and</article-title>
          <string-name>
            <surname>J. Keka</surname>
          </string-name>
          <article-title>lainen. IR evaluation methods for retrieving highly relevant documents</article-title>
          .
          <source>In SIGIR Conference on Research and Development in Information Retrieval. ACM</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. Nurani</given-names>
            <surname>Venkitasubramanian and M.-F. Moens</surname>
          </string-name>
          .
          <article-title>Selection of search facets</article-title>
          .
          <source>In CORIA 2013 - 10th Francophone Information Retrieval Conference</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>