Site Search Using Profile-Based Document Summarisation Azhar Alhindi Udo Kruschwitz Chris Fox University of Essex, University of Essex, University of Essex, Colchester, UK Colchester, UK Colchester, UK ahalhi@essex.ac.uk udo@essex.ac.uk foxcj@essex.ac.uk ABSTRACT information needs. Our hypothesis is that profile-based sum- Text summarisation is the process of distilling the most im- marisation can help a user in this process and guide the user portant information from a source to produce an abridged to the right documents more easily (e.g. by presenting the version for a particular user or task. This demo presents summaries instead of or alongside snippets). the use of profile-based summarisation to provide contextu- alisation and interactive support for site search and enter- 2. METHODS AND EXAMPLES prise search. We employ log analysis to acquire continuously The demo presents an integrated Solr-based search sys- updated profiles to provide profile-based summarisations of tem applying a number of different methods for building search results. These profiles could be capturing an individ- summaries for search results. The first two algorithms were ual’s interests or those of a group of users. Here we look at designed for traditional (generic) summarisation, and they acquiring profiles for groups of users. represent widely used baselines, e.g. [12]. The other three are all variations of an approach that has been proposed 1. MOTIVATION in the literature for building an adaptive community pro- Summarisation is a broad area of research [8]. The sort file/domain model, a ”biologically inspired model based on of information contained in a summary differs according to ant colony optimisation applied to query logs as an adaptive the mechanism used in the summarisation process: It may learning process” [1]. The approach is simple to implement, highlight the basic idea (generic summarisation), or it may the idea here is that query logs are segmented into sessions highlight the specific user’s individual area of interest (per- and then turned into a graph structure. Figure 1 gives an sonalised summarisation). One of the techniques used to example of part of the profile as it has been derived from achieve personalisation is user profiling. User profiles may our query logs. We used the log files collected on the exist- include the preferences or interests of a single user or a group ing search engine over a period of three years1 to bootstrap of users and may also include demographic information [4]. this ant colony optimisation (ACO) model, i.e. our profile. Normally, a user profile contains topics of interest to that The example illustrates the domain-specific nature of the single user. We are interested in capturing profiles not of derived profiles, e.g. the University library is named after single but groups of users. Albert Sloman. We utilise query and click logs to acquire a profile re- flecting the population’s search patterns and this profile is being automatically updated in a continuous learning cycle. We are then applying the acquired profiles in the summari- sation process to support users searching a document col- lection. The potential of personalised summarisation over generic summaries has already been demonstrated, e.g. [3], but summarisation of Web documents is typically based on the query rather than a full profile, e.g. [11, 9]. Our spe- cific interest lies in enterprise search which is different from Web search and has attracted less attention [5]. The benefit of this context is that we can expect a more homogeneous population of searchers who are likely to share interests and Figure 1: Partial profile derived from query logs. Permission to make digital or hard copies of all or part of this work for A profile-based (extractive) summary of a document is personal or classroom use is granted without fee provided that copies are then generated by turning the profile into a flat list of terms not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to (we use three different methods to do this as explained fur- republish, to post on servers or to redistribute to lists, requires prior specific ther down) and selecting those sentences from the document permission and/or a fee. 1 DIR 2013, April 26, 2013, Delft, The Netherlands. More than 1.5 million queries, described in more detail else- . where [6] • Random: We provide you with first class library fa- cilities to complement and assist your studies. The Library allows you to access more than one million books, periodicals and microfilms. Our collections cov- ering Latin America, Russia and Eastern Europe are of national significance. Viewing facilities for DVDs and videos are also available. • Centroid: We provide you with first class library fa- cilities to complement and assist your studies. You can find out more by accessing our University library ser- vices. Our Albert Sloman Library is just a few minutes’ walk from teaching buildings and student accommoda- tion at our Essex Campus. • ACO query refinements: In addition, 110 networked PCs and terminals provide access to over 47,000 on- line journals, databases, e-books and library catalogues. Students at our Essex Campus can visit the Albert Slo- man Library or borrow books from its collection via a daily dispatch service. The Albert Sloman Library has long opening hours, a total of 84 hours over seven days a week during term and 42.5 to 84 hours in vacations. Obviously, the actual usefulness of such summaries can only be assessed in a realistic search setting. In a pilot study Figure 2: Architecture of profile-based single- we found that the ACO-based summaries have the potential document summariser. of outperforming the different baselines [2]. A task-based evaluation using TREC Interactive Track guidelines is cur- rently being conducted. As the immediate next step, we are that are most similar to the profile using cosine similarity. interested in investigating how the profile can be integrated Figure 2 shows an architectural diagram for our profile-based into multi-document summarisation. summarisation system. Following DUC 2002 convention we select 100-word abstracts [7]. This gives us the following five methods: 3. REFERENCES [1] M-D. Albakour, U. Kruschwitz, N. Nanas, D. Song, M. Fasli, and A. De Roeck. Exploring ant colony optimisation for 1. Random: Selects sentences from the document ran- adaptive interactive search. In Proceedings of ICTIR, pages 213–224. Springer, 2011. domly [12]. [2] A. Alhindi, U. Kruschwitz, and C. Fox. A pilot study on using 2. Centroid: A centroid-based approach to summari- profile-based summarisation for interactive search assistance. In sation [10]. This algorithm takes into account first- Proceedings of ECIR, pages 672–675, 2013. sentence overlap and positional value, which are then [3] A. Dı́az and P. Gervás. User-model based personalized summarization. Information Processing & Management, used to generate a coherent summary. 43(6):1715–1734, 2007. 3. ACO: A query graph built by processing the log data [4] S. Gauch, M. Speretta, A. Chandramouli, and A. Micarelli. according to [1]. The entire model is turned into a flat User profiles for personalized information access. The Adaptive list of terms for summarisation. Web, pages 54–89, 2007. [5] D. Hawking. Enterprise Search. In R. Baeza-Yates and 4. ACO trimmed: Starting with ACO we trim all those B. Ribeiro-Neto, editors, Modern Information Retrieval, pages edges whose weights fall below the overall average weight 641–683. Addison-Wesley, 2nd edition, 2011. of an edge. The remaining model is turned into a flat [6] U. Kruschwitz, D. Lungley, M-D. Albakour, and D. Song. list of terms for summarisation. Deriving Query Suggestions for Site Search. JASIST, 2013. Forthcoming. 5. ACO query refinements: The list of terms used for [7] C.Y. Lin and E. Hovy. Automatic evaluation of summaries summarisation are all those that are directly linked to using n-gram co-occurrence statistics. In Proceedings of the query node in the ACO model. HLT-NAACL, pages 71–78. ACL, 2003. [8] A. Nenkova and K. McKeown. Automatic summarization. Now Publishers, 2011. Note that ACO and ACO trimmed are query-inde- [9] S. Park. Personalized summarization agent using non-negative pendent as they are using the entire model to generate the matrix factorization. PRICAI 2008: Trends in Artificial summary, whereas ACO query refinements is query-spe- Intelligence, pages 1034–1038, 2008. cific (for the frequently submitted query “library” the se- [10] D.R. Radev, H. Jing, M. Stys, and D. Tam. Centroid-based summarization of multiple documents. Information Processing lected terms are library, albert sloman library, library home- & Management, 40(6):919–938, 2004. page, library opening times and catalogue, see Figure 1). [11] C. Wang, F. Jing, L. Zhang, and H.J. Zhang. Learning To illustrate the different summaries obtained using three query-biased web page summarization. In Proceedings of of the summarisation methods we apply the methods to the CIKM, 2007. [12] R. Yan, J.Y. Nie, and X. Li. Summarize what you are University of Essex Library homepage2 and get the following interested in: An optimization framework for interactive summaries: personalized summarization. In In Proceedings of EMNLP, 2 pages 1342–1351, 2011. http://www.essex.ac.uk/life/library/