New Research Directions in Search Results Clustering Claudio Carpineto, Andrea Bernardini, Massimiliano D’Amico, Gianni Romano Fondazione Ugo Bordoni Rome, Italy {carpinet, abernardini, romano}@fub.it mas.damico@gmail.com ABSTRACT to description-centric techniques, depending on whether the We discuss which are the main research themes in the field priority is given to cluster formation or cluster labeling. of search results clustering and report some recent results One of the most recent examples of the latter category is achieved by the Information Mining group at Fondazione KeySRC (Keyphrase-based Search Results Clustering), de- Ugo Bordoni. scribed in [1]. This system generates clusters labeled by keyphrases. The keyphrases are extracted from the gener- alized suffix tree built from the search results and merged Categories and Subject Descriptors through an improved hierarchical agglomerative clustering H.3.3 [Information Storage and Retrieval]: Information procedure, representing each phrase as a weighted docu- Search and Retrieval—Clustering ment vector and making use of a variable dendrogram cut-off value. KeySRC is available at http://keysrc.fub.it. 1. SEARCH RESULTS CLUSTERING Search results clustering organizes search results by topic, 1.2 Performance evaluation measures thus providing a complementary view to the flat list returned Internal validity measures and comparison with ground by document ranking systems. This approach is especially truth results are two common ways of evaluating clustering useful when document ranking fails. Besides allowing direct partitions, but they have the disadvantage that the perfor- subtopic access, search results clustering reduces informa- mance of the system in which the document partition is tion overlook, helps filtering out irrelevant items, and favors encompassed is not explicitly taken into account. As the in- exploration of unknown or dynamic domains. tended use of search results clustering is to find documents Search results clustering is related to, but distinct from, relevant to the single query’s subtopic, it may be more con- conventional document clustering. When clustering takes venient to evaluate the performance on a retrieval oriented place as a post-processing step on the set of results retrieved task. However, the classical measures related to subtopic by an information retrieval system on a query, it may be retrieval, such as subtopic recall, subtopic precision, and both more efficient, because the input consists of few hun- subtopic MRR, assume that the system output consists of dred of snippets, and more effective, because query-specific a ranked list and thus they are not directly or easily appli- text features are used. On the other hand, search results cable to clustered results, Furthermore, they strictly focus clustering must fulfill a number of more stringent require- on subtopic coverage; i.e., retrieving at least one relevant ments raised by the nature of the application in which it document per subtopic. is embedded; e.g., meaningful cluster labels, low response To address these limitations, we presented a new evalua- times, short input data description, unknown number of tion measure inspired by Cooper’s expected search length: clusters, overlapping clusters. Subtopic Search Length under k document sufficiency (kSSL). A comprehensive survey of search results clustering, in- The idea is to consider the number of elements (cluster la- cluding issues, techniques, and systems is given in [4]. In bels or search results) that the user must examine to retrieve the remainder of this paper we point out interesting research a specified number (k) of documents relevant to the single directions. subtopics of a query. The shorter the search length, the bet- ter the system performance. It is assumed that both cluster 1.1 Description-centric clustering algorithms labels and search results are read sequentially from top to Given that search results clustering systems are primarily bottom, and that only cluster with labels relevant to the intended for browsing retrieval, a critical part is the quality subtopic at hand are opened. The main advantages of kSSL of cluster labels, as opposed to optimizing only the clustering are that it is suitable for both ranked lists and clustered re- structure. In fact, the algorithms for performing search re- sults and that it allows evaluation of full subtopic retrieval sults clustering cover a spectrum ranging from data-centric (i.e., retrieval of multiple documents relevant to a query’s subtopic). A full description of kSSL is given in [1]. 1.3 Test collections There is almost a complete lack of test collections with Appears in the Proceedings of the 1st Italian Information Retrieval subtopic relevance judgments. Two exceptions are the col- Workshop (IIR’10), January 27–28, 2010, Padova, Italy. lections developed at the TREC Interactive track, which is http://ims.dei.unipd.it/websites/iir10/index.html Copyright owned by the authors. small and primarily focuses on the instances of a given con- bilistic ranking principle when a topic has multiple aspects cept (e.g., ‘what tropical storms – hurricanes and typhoons of potential interest and the relevance criterion alone is not – have caused property damage and/or loss of life’), and sufficient. at Image CLEF, which is mainly about geographical diver- These two techniques have not been compared so far. We sity of photos associated with a given topic (e.g., ‘images of performed a systematic evaluation of several clustering and beaches in Brazil’). diversification algorithms using multiple test collections and We created two new test collections for evaluating subtopic evaluation measures [2]. It turns out that diversification retrieval, namely AMBIENT and ODP-239. AMBIENT works well when one wants to get a quick overview of doc- (AMBIguous ENTries) consists of 44 topics extracted from uments relevant to distinct subtopics, whereas clustering is the ambiguous Wikipedia entries, each with a set of subtopics more useful when one is interested in retrieving multiple and a list of 100 ranked search results manually annotated documents relevant to each subtopic. with subtopic relevance judgments. AMBIENT is fully de- scribed in [3] and is available at http://credo.fub.it/ambient. 1.7 Other research directions ODP-239 consists of 239 topics, each with about 10 subtopics There are further directions that have started to be ex- and 100 documents associated with the subtopics. The top- plored recently by other research groups. They mainly aim ics, subtopics, and their associated documents were selected to improve the quality and effectiveness of the search results from the Open Directory Project (www.dmoz.org). The dis- clustering process. A non-exhaustive list is given below. tribution of documents across subtopics reflects the relative – Personalized search results clustering importance of subtopics. ODP-239 can be downloaded from – Integrating external knowledge (e.g., thesauri, meta- http://credo.fub.it/odp239. data, folksonomies, past queries) with search results clus- tering 1.4 Applications in mobile search – Semi-supervised search results clustering The features of search results clustering appear very suit- – Temporal search results clustering able for mobile information retrieval, where a minimization – Visualization of clustered search results of user actions (such as scrolling and typing), device re- – Search results clustering and faceted hierarchies sources, and amount of data to be downloaded are primary concerns. Furthermore, such features seem to nicely comply 2. REFERENCES with the most recently observed usage patterns of mobile [1] A. Bernardini, C. Carpineto, and M. D’Amico. searchers. Full-Subtopic Retrieval with Keyphrase-Based Search We implemented two mobile clustering engines (for PDAs Results Clustering. In Proceedings of 2009 and cellphones) and evaluated their retrieval performance IEEE/WIC/ACM International Conference on Web [3]. We found that mobile clustering engines can be faster Intelligence, Milan, Italy, pages 206–213. IEEE and more accurate than the corresponding mobile search en- Computer Society, 2009. gines, especially for subtopic retrieval tasks. We also found [2] C. Carpineto, M. D’Amico, and G. Romano. Evaluating that although mobile retrieval becomes, in general, less ef- subtopic retrieval system performance: clustering fective as the search device gets smaller, the adoption of versus diversification. Submitted. clustering may help expand the usage patterns beyond mere [3] C. Carpineto, S. Mizzaro, G. Romano, and M. Snidero. informational search while mobile. Mobile Information Retrieval with Search Results 1.5 Meta search results clustering Clustering: Prototypes and Evaluations. Journal of American Society for Information Science and Just as the results of several search engines can be com- Technology (JASIST), 60(5):877–895, 2009. bined into a meta search engine, the outputs produced by [4] C. Carpineto, S. Osiński, G. Romano, and D. Weiss. A distinct clustering engines can be merged into a meta cluster- survey of Web clustering engines. ACM Computing ing engine. Currently, there are many different web cluster- Survey, 41(3), 2009. ing engines but no attempts has still been made to combine them, to the best of our knowledge. [5] C. Carpineto and G. Romano. Optimal Meta Search We studied the problem of meta search results clustering, Results Clustering. Submitted. that has unique features with respect to the relatively well understood field of general meta clustering. After showing that the combination of multiple search results clustering al- gorithms is empirically justified, we developed a novel meta clustering algorithm that maximizes the agreement between the outputs produced by the input clustering algorithms [5]. The novel meta clustering algorithm applied to web search results is both efficient and effective. 1.6 Clustering versus diversification of search results Re-ranking search results to promote diversity of top el- ements is another approach to subtopic retrieval that has received much attention lately. Clustering and diversifica- tion of search results are thus different techniques with a similar goal, i.e., addressing the limitations of the proba-