Summarization and Expansion of Search Facets∗

                      Aparna Nurani Venkitasubramanian                              Marie-Francine Moens
                                                      Department of Computer Science
                                                       Katholieke Universiteit Leuven
                                                              Leuven, Belgium
                        {aparna.nuranivenkitasubramanian,sien.moens}@cs.kuleuven.be


ABSTRACT                                                                the set of activated nodes or facets is summarized by pick-
We present a novel method for summarization and expan-                  ing the best k candidates. If the number of activated facets
sion of search facets. To dynamically extract key facets,               is less than k, then the set is expanded by adding related
the ranked list of search results generated from a keyword              facets. The summarization and expansion are carried out
search is coupled with the spatial distribution of relevant             using the ‘Subtree density’ model (Section 3) which takes
documents in a hierarchical taxonomy of subject classes. An             as input a set of activated nodes and produces the presen-
evaluation of the method based on the relevance and diver-              tation nodes. For some queries, DMOZ activates not only
sity of the produced facets indicates its effectiveness for both        the lowest level facets, but also some of their ancestors. In
summarization and expansion.                                            such a case to ensure presentation of as many distinct facets
                                                                        as possible, the summarization uses only the descendants,
Keywords                                                                while the expansion uses only the ancestors.
Selection of search facets, Expansion, Summarization
                                                                        3.   SUBTREE DENSITY MODEL
                                                                           This model finds nodes which represent dense clusters of
1.    INTRODUCTION                                                      facets, each having many search results important for the
  The combination of a ‘keyword’ and a ‘faceted’ search has             query. First, the subtrees associated with the relevant set
the potential to enhance user experience by providing a bet-            of activated nodes are extracted. The subtree S for a node
ter arrangement of search results and aiding further search             v comprises the node and its descendants (children, grand
exploration. However, such a framework poses two key prob-              children etc. until the last level).
lems: 1) a given query may cover several facets, requiring                 Then, one possible candidate to represent a subtree is the
an aggregation or summarization of the most relevant ones;              medoid identified as the node with the minimum average
and 2) a query may cover too few facets necessitating an ex-            distance to all the other nodes of the subtree. The dis-
pansion to include additional facets. We exploit the spatial            tances between nodes in the subtree are computed using a
distribution of topics relevant to a query in a hierarchy to-           distance metric that captures semantic distances between
gether with the relevance ranking of the documents for the              topics in a hierarchy. Since the basic relations in the taxon-
query, in order to select search facets that optimize diversity         omy are the parent-child relations, distance between any two
and relevance.                                                          nodes is represented using the connection weights between
                                                                        the parent-child pairs associated. In taxonomy T with root
2.    SELECTING SEARCH FACETS                                           at level 0, the connection weight D between node vi at level
   We assume that the search results of a query are anno-               l and its child vj at level l + 1 is as follows:
tated with subject classes (here facets or nodes) obtained                                      D(vi , vj ) = 2−l                  (1)
from a hierarchical taxonomy. In the experiments below,
                                                                        Using this metric, the distance between any two nodes vm
the DMOZ∗ hierarchy is used. For each query, we define: a
                                                                        and vn in T is defined as the sum of connection weights
set of activated nodes that have documents relevant to the
                                                                        between all nodes vx spanning the path between vm and vn .
query and a set of presentation nodes that will be presented
                                                                          Once the medoids of the subtree have been identified, we
to the user as facets relevant for the query.
                                                                        must rank them to identify the best k medoids that will be
   When the user presents a query, the DMOZ facets asso-
                                                                        presented. This is done using a score computed in Eq. 2
ciated with the result of the query are first extracted, i.e.,
                                                                                                      density(S)
the activated nodes are identified. Next, if the number of                             score(m) =                               (2)
activated facets associated with the query is larger than k† ,                                      distance(m, S)

∗Full paper published at CORIA 2013 [3]                                 where density(S) is givenPby
∗
  http://www.dmoz.org                                                                               v∈S importance(v, R)
                                                                                  density(S) =                                    (3)
†
  k is chosen based on the size of the interface medium and                                                 |S|
the cognitive load acceptable for a user
                                                                        where |S| is the size of the subtree in terms of number of
                                                                        nodes v ∈ S and importance(v, R) is computed using the
DIR 2013, April 26, 2013, Delft, The Netherlands.
Copyright remains with the authors and/or original copyright holders.
                                                                        Discounted Cumulative Gain (DCG) [2] over the retrieved
.                                                                       Web pages assigned to facet v.
    Figure 1: Precision of the summarization and expansion for the                 Table 1: (a) Statistics of the query sets
    nine highest ranked facets for query sets Q1 and Q2                            (b) Diversity of facets produced by sum-
                                                                                   marization (% of facet clusters at rank
                                                                                   1..5)


                                       X               reli           1 corresponding to ‘Very diverse’. For both relevance and
     importance(v, R) = rel1 +                                  (4)
                                                     log2 (i)         diversity evaluations, only queries for which the agreement
                                 i=rank(d),i>1,d∈R
                                                                      among Crowdflower evaluators was over 80% (as reported
where R is a ranked list of documents retrieved for the query         by Crowdflower) were retained.
obtained from a search engine, i is the position of the re-              The number of queries used for evaluation, the precision
trieved document in the list, and reli = 1 if the ith document        and diversity of the model have been indicated in Table 1
belongs to facet v and 0 otherwise.                                   and Figure 1. From Figure 1, it is evident that the subtree
   The idea of this score is as follows:                              density model performs better than the baselines in terms
    • A node that has lesser distance from every other node           of precision (measured by relevance), both in summarization
      of the subtree is a better representative of the subtree;       and expansion. Table 1b indicates that the subtree density
    • A subtree that has a higher density is an important             model also outperforms the baseline (based on ranked re-
      one for the query.                                              sults) in terms of diversity. These results are explained by
                                                                      the fact that in our model, important facets come from dense
                                                                      clusters of search results in the taxonomy.
4.     EXPERIMENTS AND RESULTS
   Two sets of queries have been used for evaluation. The             5.   CONCLUSIONS
first query set Q1 contains titles of English Wikipedia arti-            In this paper we have presented the subtree density model
cles. The second query set Q2 comprises real user queries             for summarizing and expanding search results mapped to a
collected by Torres et al. [1]. The queries were submitted to         subject taxonomy. Evaluation of the method using human
the Bing search engine, restricting the search results to the         evaluators indicates that it is effective as it optimizes both
Web pages from the DMOZ Kids and Teens subdirectory.                  relevance and diversity. A next step in our research is to de-
The subtree density model has been benchmarked against                velop navigation models for interactive browsing consisting
two baseline (BL) models, one for summarization and the               of the presented facets and their corresponding Web pages.
other for expansion. The baseline model for summariza-
tion uses the top k distinct activated nodes from the ranked          Acknowledgements This research was funded by the Puppy
results from a search engine, while the baseline model for            IR project EU FP7 231507.
expansion uses the siblings of the activated nodes for pre-
sentation.
   The evaluation is based on two aspects- relevance and di-          6.   REFERENCES
versity. First, the facets selected by the model for each query       [1] S. Duarte Torres, D. Hiemstra, and P. Serdyukov. An
of the two query sets were presented to five Crowdflower‡                 analysis of queries intended to search information for
evaluators, who were asked to judge whether the facets pro-               children. In Third Symposium on Information
duced were relevant to the query. Next, to evaluate diver-                Interaction in Context. ACM, 2010.
sity of the summarization, we put together two clusters of            [2] K. Järvelin and J. Kekäläinen. IR evaluation methods
related facets (that were judged relevant by Crowdflower                  for retrieving highly relevant documents. In SIGIR
evaluators)- one for each summarization model, per query                  Conference on Research and Development in
for the queries in Q1 and Q2. Then, Crowdflower evalu-                    Information Retrieval. ACM, 2000.
tors were asked to rank these clusters on a scale of 1 to 5           [3] A. Nurani Venkitasubramanian and M.-F. Moens.
based on the diversity of the facets in the clusters, with rank           Selection of search facets. In CORIA 2013 - 10th
‡
    http://crowdflower.com/                                               Francophone Information Retrieval Conference, 2013.