Site Search Using Profile-Based Document Summarisation

                       Azhar Alhindi                                   Udo Kruschwitz                            Chris Fox
                     University of Essex,                              University of Essex,                 University of Essex,
                       Colchester, UK                                    Colchester, UK                       Colchester, UK
                 ahalhi@essex.ac.uk                                  udo@essex.ac.uk                       foxcj@essex.ac.uk


ABSTRACT                                                                             information needs. Our hypothesis is that profile-based sum-
Text summarisation is the process of distilling the most im-                         marisation can help a user in this process and guide the user
portant information from a source to produce an abridged                             to the right documents more easily (e.g. by presenting the
version for a particular user or task. This demo presents                            summaries instead of or alongside snippets).
the use of profile-based summarisation to provide contextu-
alisation and interactive support for site search and enter-                         2.   METHODS AND EXAMPLES
prise search. We employ log analysis to acquire continuously                            The demo presents an integrated Solr-based search sys-
updated profiles to provide profile-based summarisations of                          tem applying a number of different methods for building
search results. These profiles could be capturing an individ-                        summaries for search results. The first two algorithms were
ual’s interests or those of a group of users. Here we look at                        designed for traditional (generic) summarisation, and they
acquiring profiles for groups of users.                                              represent widely used baselines, e.g. [12]. The other three
                                                                                     are all variations of an approach that has been proposed
1.     MOTIVATION                                                                    in the literature for building an adaptive community pro-
   Summarisation is a broad area of research [8]. The sort                           file/domain model, a ”biologically inspired model based on
of information contained in a summary differs according to                           ant colony optimisation applied to query logs as an adaptive
the mechanism used in the summarisation process: It may                              learning process” [1]. The approach is simple to implement,
highlight the basic idea (generic summarisation), or it may                          the idea here is that query logs are segmented into sessions
highlight the specific user’s individual area of interest (per-                      and then turned into a graph structure. Figure 1 gives an
sonalised summarisation). One of the techniques used to                              example of part of the profile as it has been derived from
achieve personalisation is user profiling. User profiles may                         our query logs. We used the log files collected on the exist-
include the preferences or interests of a single user or a group                     ing search engine over a period of three years1 to bootstrap
of users and may also include demographic information [4].                           this ant colony optimisation (ACO) model, i.e. our profile.
Normally, a user profile contains topics of interest to that                         The example illustrates the domain-specific nature of the
single user. We are interested in capturing profiles not of                          derived profiles, e.g. the University library is named after
single but groups of users.                                                          Albert Sloman.
   We utilise query and click logs to acquire a profile re-
flecting the population’s search patterns and this profile is
being automatically updated in a continuous learning cycle.
We are then applying the acquired profiles in the summari-
sation process to support users searching a document col-
lection. The potential of personalised summarisation over
generic summaries has already been demonstrated, e.g. [3],
but summarisation of Web documents is typically based on
the query rather than a full profile, e.g. [11, 9]. Our spe-
cific interest lies in enterprise search which is different from
Web search and has attracted less attention [5]. The benefit
of this context is that we can expect a more homogeneous
population of searchers who are likely to share interests and

                                                                                      Figure 1: Partial profile derived from query logs.

Permission to make digital or hard copies of all or part of this work for              A profile-based (extractive) summary of a document is
personal or classroom use is granted without fee provided that copies are            then generated by turning the profile into a flat list of terms
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
                                                                                     (we use three different methods to do this as explained fur-
republish, to post on servers or to redistribute to lists, requires prior specific   ther down) and selecting those sentences from the document
permission and/or a fee.                                                             1
DIR 2013, April 26, 2013, Delft, The Netherlands.                                      More than 1.5 million queries, described in more detail else-
 .                                                                                   where [6]
                                                                         • Random: We provide you with first class library fa-
                                                                           cilities to complement and assist your studies. The
                                                                           Library allows you to access more than one million
                                                                           books, periodicals and microfilms. Our collections cov-
                                                                           ering Latin America, Russia and Eastern Europe are
                                                                           of national significance. Viewing facilities for DVDs
                                                                           and videos are also available.
                                                                         • Centroid: We provide you with first class library fa-
                                                                           cilities to complement and assist your studies. You can
                                                                           find out more by accessing our University library ser-
                                                                           vices. Our Albert Sloman Library is just a few minutes’
                                                                           walk from teaching buildings and student accommoda-
                                                                           tion at our Essex Campus.
                                                                         • ACO query refinements: In addition, 110 networked
                                                                           PCs and terminals provide access to over 47,000 on-
                                                                           line journals, databases, e-books and library catalogues.
                                                                           Students at our Essex Campus can visit the Albert Slo-
                                                                           man Library or borrow books from its collection via a
                                                                           daily dispatch service. The Albert Sloman Library has
                                                                           long opening hours, a total of 84 hours over seven days
                                                                           a week during term and 42.5 to 84 hours in vacations.

                                                                      Obviously, the actual usefulness of such summaries can
                                                                    only be assessed in a realistic search setting. In a pilot study
Figure 2:  Architecture of profile-based single-                    we found that the ACO-based summaries have the potential
document summariser.                                                of outperforming the different baselines [2]. A task-based
                                                                    evaluation using TREC Interactive Track guidelines is cur-
                                                                    rently being conducted. As the immediate next step, we are
that are most similar to the profile using cosine similarity.       interested in investigating how the profile can be integrated
Figure 2 shows an architectural diagram for our profile-based       into multi-document summarisation.
summarisation system. Following DUC 2002 convention we
select 100-word abstracts [7]. This gives us the following five
methods:
                                                                    3.    REFERENCES
                                                                     [1] M-D. Albakour, U. Kruschwitz, N. Nanas, D. Song, M. Fasli,
                                                                         and A. De Roeck. Exploring ant colony optimisation for
     1. Random: Selects sentences from the document ran-                 adaptive interactive search. In Proceedings of ICTIR, pages
                                                                         213–224. Springer, 2011.
        domly [12].
                                                                     [2] A. Alhindi, U. Kruschwitz, and C. Fox. A pilot study on using
     2. Centroid: A centroid-based approach to summari-                  profile-based summarisation for interactive search assistance. In
        sation [10]. This algorithm takes into account first-            Proceedings of ECIR, pages 672–675, 2013.
        sentence overlap and positional value, which are then        [3] A. Dı́az and P. Gervás. User-model based personalized
                                                                         summarization. Information Processing & Management,
        used to generate a coherent summary.                             43(6):1715–1734, 2007.
     3. ACO: A query graph built by processing the log data          [4] S. Gauch, M. Speretta, A. Chandramouli, and A. Micarelli.
        according to [1]. The entire model is turned into a flat         User profiles for personalized information access. The Adaptive
        list of terms for summarisation.                                 Web, pages 54–89, 2007.
                                                                     [5] D. Hawking. Enterprise Search. In R. Baeza-Yates and
     4. ACO trimmed: Starting with ACO we trim all those                 B. Ribeiro-Neto, editors, Modern Information Retrieval, pages
        edges whose weights fall below the overall average weight        641–683. Addison-Wesley, 2nd edition, 2011.
        of an edge. The remaining model is turned into a flat        [6] U. Kruschwitz, D. Lungley, M-D. Albakour, and D. Song.
        list of terms for summarisation.                                 Deriving Query Suggestions for Site Search. JASIST, 2013.
                                                                         Forthcoming.
     5. ACO query refinements: The list of terms used for            [7] C.Y. Lin and E. Hovy. Automatic evaluation of summaries
        summarisation are all those that are directly linked to          using n-gram co-occurrence statistics. In Proceedings of
        the query node in the ACO model.                                 HLT-NAACL, pages 71–78. ACL, 2003.
                                                                     [8] A. Nenkova and K. McKeown. Automatic summarization. Now
                                                                         Publishers, 2011.
   Note that ACO and ACO trimmed are query-inde-
                                                                     [9] S. Park. Personalized summarization agent using non-negative
pendent as they are using the entire model to generate the               matrix factorization. PRICAI 2008: Trends in Artificial
summary, whereas ACO query refinements is query-spe-                     Intelligence, pages 1034–1038, 2008.
cific (for the frequently submitted query “library” the se-         [10] D.R. Radev, H. Jing, M. Stys, and D. Tam. Centroid-based
                                                                         summarization of multiple documents. Information Processing
lected terms are library, albert sloman library, library home-           & Management, 40(6):919–938, 2004.
page, library opening times and catalogue, see Figure 1).           [11] C. Wang, F. Jing, L. Zhang, and H.J. Zhang. Learning
   To illustrate the different summaries obtained using three            query-biased web page summarization. In Proceedings of
of the summarisation methods we apply the methods to the                 CIKM, 2007.
                                                                    [12] R. Yan, J.Y. Nie, and X. Li. Summarize what you are
University of Essex Library homepage2 and get the following              interested in: An optimization framework for interactive
summaries:                                                               personalized summarization. In In Proceedings of EMNLP,
2
                                                                         pages 1342–1351, 2011.
    http://www.essex.ac.uk/life/library/