Exploiting Information Needs and Bibliographics for Polyrepresentative Document Clustering Muhammad Kamran Abbasi and Ingo Frommholz Institute for Research in Applicable Computing University of Bedfordshire {muhammad.abbasi|ingo.frommholz}@beds.ac.uk Abstract. In this paper we explore the potential of combining the prin- ciple of polyrepresentation with document clustering. Our idea is dis- cussed and evaluated for polyrepresentation of information needs as wells as for document-based polyrepresentation where bibliographic informa- tion is used as representation. The main idea is to present the user with the highly ranked polyrepresentative clusters to support the search pro- cess. Our evaluation suggests that our approach is capable of increasing retrieval performance, but performance varies for queries with a high or low number of relevant documents. 1 Introduction The key objective of information retrieval systems is to satisfy the user’s in- formation need in the provided context. In particular Interactive Information Retrieval (IIR) is supposed to support the user beyond just typing in queries. In this respect the principle of polyrepresentation is a highly recognized approach to IIR [1]. It suggests to use various information/data representations to integrate the context and interpretation of different actors into the information retrieval process. The representations can come from the same actor but for different pur- poses and are functionally different, or they can come from various actors and are cognitively different. Polyrepresentation may refer to the information space, the user’s cognitive space or a combination thereof. In Bibliometrics, references cit- ing another source establish a cognitively different representation of a document. Examples for different information need representations are work task descrip- tions and the query. The principle of polyrepresentation in IR could be described as follows: if multiple representations are pointing towards an information object it is more likely to be relevant to the user’s information need. This is depicted in Figure 1. Let us assume R represents the relevance of a representation, hence R1 denotes the documents relevant to representation 1, R2 to representation 2 and so on, so the documents in R1 , R2 and R3 are only relevant to these indi- vidual representations. The intersection of the two representations i.e, R12 R13 and R23 holds the documents relevant to the two respective representations, and the intersection of all three representation, R123 is the total cognitive overlap. According to the principle of polyrepresentation this set is supposed to hold the most relevant documents as evaluated in [2,3,4]. Fig. 1. Polyrepresentation and Clustering. The left hand side shows the relevance sets w.r.t. combinations of different representations. The right hand side shows the induced rankings. Small circles denote relevant documents. The principle of polyrepresentation has been evaluated so far in ad hoc re- trieval. However, while different combinations of representations were evaluated in a more static fashion, some open problems remain from a user perspective. A system may be able to combine different representations, but the system ini- tially does not know whether and to what degree the user prefers some of them. A weighting mechanism as proposed in [5] may mitigate the situation, but still the system needs additional information. We argue that instead of presenting users with a ranked list of results, we may present them with clusters they can choose from. Clustering and polyrepresentation both create a partitioning of the information objects under consideration. One approach to combine document clustering and polyrepresentation is described by a “polyrepresentation cluster hypothesis”: documents relevant to the same representation should appear in the same cluster [6]. The possible application of the above discussed approach could be presenting the clusters to the user in a way depicted in Figure 1. In this case, we present the user with a cluster containing a ranked list of documents representing the total cognitive overlap first. We assume the user browses the top k documents and then moves on to the next cluster (one where only a subset of all representations is relevant), depending on her representation preferences, where she examines again some top k documents. 2 Bibliometrics, Polyrepresentation and Document Clustering The science modeling and the bibliometrics provides the means to analyze and quantify the structure and process of scholarly communication [7]. The methods range from citation analysis [8], co-citation clustering [9] to bradfordizing [10]. The developments in e-publishing and availability of the full text and meta-data regarding the scientific information objects increased the scope and applicability of the bibliometrics, hence more refined measures are needed [7]. The connec- tions between IR, bibliometrics and relevance theory are discussed in [11]. The citation networks and clustering methods suitable for block modelling in similar context are discussed in [12]. The suitability of the bibliometirc measures for en- hancement of retrieval in scholarly systems is presented and evaluated in [13,10]. The authors have evaluated bibliometrics approaches like bradfordizing for re- ranking and co-word model for query expansion in search term recommender and report performance improvement. The principle of polyrepresentation in similar context has been discussed in [14] using the references and citation information. We look at the principle of polyrepresentation as a method go along with the bib- liometric approaches, because it allows the use of various representations, hence the bibliographic information i.e. authors, references and the citation context could be exploited as representations. The Optimum Clustering Framework (OCF) [15] provides a theoretical jus- tification for clustering and is based on the notion of using so-called query sets. We propose to apply the OCF for inferring the suitable candidate clusters rep- resenting the various polyrepresentation sets R. The OCF operates on the prob- ability of relevance Pr(R|d, qi ) of the document d with respect to a query qi in the query set; since in polyrepresentation we are dealing with measuring the degree or probability of relevance for each representation, OCF is a suitable framework for our ideas. From this, each document is represented by the vector τ T (d) = (Pr(R|d, q1 ), . . . , Pr(R|d, qn )) with n as the number of queries in the query set. The vectors are now used to cluster documents with the choice of clustering function depending on the overall setup. If each term in the collec- tion is regarded as a query qi the OCF simply models classical clustering using term-based similarity. To set up the OCF for polyrepresentation we need to distinguish between polyrepresentation of information needs on the one hand and polyrepresentation of documents on the other hand. In order to apply clustering to information need polyrepresentation let REPin be the representations of an information need in. Pr(R|d, ri ) is computed for each document d and ri ∈ REPin . From this we create a vector τ T (d) = (Pr(R|d, r1 ), . . . , Pr(R|d, rn )) with n = |REPin |. When applying polyrepresentation of documents, REPd consists of the different repre- sentations rdi of a document d. In our case we assume that the information need is represented by the query q alone1 . We therefore need to compute Pr(R|rdi , q) and we get τ T (d) = (Pr(R|rd1 , q), . . . , Pr(R|rdn , q)) with n = |REPd |. 1 The combination of document and information need polyrepresentation is subject to future work. 3 Evaluation Collection and clustering approach In order to evaluate the proposed ap- proach, the PF part of the iSearch2 [16] collection is used. This sub collection contains full text articles related to the physics domain. The collection comes with 65 search tasks, each search task comprises upon five information need representations i.e. Search Terms, Work Task, Current Information Need, Ideal Answer and Background Knowledge. These five representation make REPin for the R set for information need based polyrepresentation. In a first step to esti- mate Pr(R|d, ri ) BM25 based document weights were computed for every infor- mation need representation ri using the Terrier IR platform [17]. The weights for each representation were combined using CombSum [18] to get a retrieval weight for each document based on the single representations. OCF-based clus- tering was performed with k-means. k (in k-means) was set to 2|REPin | to create exactly as many clusters as there are combinations of relevant representation in polyrepresentation. The second exploration has been about document based polyrepresentation, for this the representations i.e title, abstract, body, context, and references were extracted from the documents. The context c of a document d has been ex- tracted from all the articles cited in d, initially only the title and abstract of the cited document were combined and used as a context. Hence, the context for a document d becomes the concatenation of all abstracts and titles of the cited documents in it, as depicted in Figure 2. The derived representations in this Fig. 2. Document Context part make the REPd part for the R set. The representation here has also been indexed and the BM25 weights were computed as discussed above. The Search Term part of the all 65 topics in the iSearch collection has been used as a query. Baseline and Polyrepresentative Clustering The goal of our evaluation is to verify the effect of clustering on polyrepresentation. Therefore our baseline is to combine the BM25 weights mentioned above from each representation using CombSum, which creates a ranked list that utilises different representations but no clustering. 2 http://itlb.dbit.dk/~isearch/ To compare our polyrepresentative clustering approach against the baseline we create a new cluster-based ranking. As mentioned before we assume a system that presents the user with the total cognitive overlap cluster first, i.e. the user is presented with a ranked list of documents in this cluster. We further assume that the user examines the top k documents and then moves on to a different cluster. To simulate this behaviour for evaluation purposes our strategy is as follows. We first create a clustering C as described above. From this we produce an artificial ranking as described in Algorithm 1. The ranking simulates the documents that a user would see along the path when examining the top k documents of each cluster. We rank the clusters in C so that the top ranked cluster is an approximation of the total cognitive overlap (please see [6] for a further discussion). We now process the cluster ranking in descending order. We rank the documents in each cluster and append the topk documents to the ranking. Require: Clustering C, k r ← () {The ranking, initially an empty list} C ← ranked list of clusters in C (using eF or SD) for all cluster c ∈ C do l ← ranked list of documents in c {process C in descending weight order} for i = 1 to k do r ← r + l[i] {append document at rank i to r} end for end for return r Algorithm 1: Cluster-based ranking To create the ranking of the clusters, we have used two measures. One is the OCF based expected F-measure (eF ) as described in [15]. The second is the sparsity density of the REP weight matrix constituting the cluster C, with the idea to identify the clusters where many or all representations have contributed and the corresponding cluster point matrix is less or not sparse. The sparsity density (SD) of the cluster matrix is computed by counting the non zero elements in the cluster matrix divided by the size of the matrix. In this approach the matrix size is equal to the number of documents in the cluster (rows) multiplied with the number of REP s (columns). The range of the density of the cluster is between 0 and 1 where values closer to 1 show higher density. We have used eF and SD to rank the clusters. The top k documents then were extracted form each cluster and merged to create the rank for the trec eval, as described in the Algorithm 1. For comparing the baseline and proposed approach we used k = 5 and k = 10. We extracted binary relevance judgements from the grades iSearch ones with a value > 1 meaning relevance. Table 1. Queries with high number of relevant documents runid IN eF SD eF SD DOC eF SD eF SD BM25 k=5 k=5 k=10 k=10 BM25 k=5 k=5 k=10 k=10 High High num q 19 19 19 19 19 19 19 19 19 19 num ret 27277933040 3040 6080 6080 2442335 3040 3040 5856 5856 num rel 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130 num rel ret 1128 23 23 33 33 1120 193 193 255 255 map 0.0153 0.0072 0.0072 0.0081 0.0081 0.0976 0.0637 0.0637 0.0709 0.0709 P@5 0.0421 0.0632 0.0632 0.0632 0.0632 0.3263 0.3053 0.3053 0.3053 0.3053 P@10 0.0316 0.0474 0.0474 0.0368 0.0368 0.3 0.2895 0.2895 0.3 0.3 P@15 0.0246 0.0386 0.0386 0.0351 0.0351 0.2877 0.2702 0.2702 0.2772 0.2772 P@20 0.0263 0.0342 0.0342 0.0368 0.0368 0.2447 0.2395 0.2395 0.2447 0.2447 P@30 0.0246 0.0263 0.0263 0.0316 0.0316 0.214 0.2018 0.2018 0.2123 0.2123 Table 2. Queries with low number of relevant documents runid IN eF SD eF SD DOC eF SD eF SD BM25 k=5 k=5 k=10 k=10 BM25 k=5 k=5 k=10 k=10 Low Low num q 46 46 46 46 46 46 46 46 46 46 num ret 64606057200 7200 14400 14400 4015889 7185 7185 13953 13952 num rel 246 268 246 246 246 246 246 246 246 246 num rel ret 246 3 2 7 7 238 94 94 112 111 map 0.0027 0.0001 0.0001 0.0016 0.0016 0.0745 0.0792 0.0792 0.0805 0.0694 P@5 0 0 0 0 0 0.08 0.0844 0.0844 0.0844 0.08 P@10 0.0022 0 0 0.0022 0.0022 0.0689 0.0756 0.0756 0.0756 0.0733 P@15 0.0015 0.0015 0 0.0015 0.0015 0.0593 0.0696 0.0696 0.0681 0.0667 P@20 0.0022 0.0011 0 0.0022 0.0022 0.0544 0.0633 0.0633 0.0611 0.06 P@30 0.0015 0.0007 0 0.0015 0.0015 0.0489 0.0489 0.0489 0.0519 0.0511 Results The iSearch collection consists of queries with a high and a low number of relevant documents. It turns out the number of relevant documents for a query has an effect on the results. To document this, we provide figures for queries with a high and a low number of relevant documents as well as for all queries. The evaluation results for information need and document-based polyrepresentation are shown in Table 1 for the queries where positive relevance judgments for 20 or more documents are available (we call this High). The Table 2 holds the results for the queries where positive relevance assessment for less than 20 documents were available, we refer to this as Low. The results for all the queries combined are presented in Table 3, we refer to that as All. In the left half the tables the results for information need based polrepresentation are given and the right half holds the results for document and context-based polyrepresentation. In Table1 for information need based polyrepresentation, the proposed method performs better than the BM25 baseline (IN BM25) at prec@n, prec@10, prec@20 and so on, for both eF and SD, for k = 5 and k = 10. For the document-based polyrepresentation part the performance of eF and SD at prec@5, prec@10, prec@20 for both k = 5 and k = 10 is poorer than the baseline (DOC BM25). In Table 2 for the Low queries for k = 5 and k = 10 the values for all prec@n are very poor compared the baseline for information need polyrepresentation. In the document-based polyrepresentation part the eF and SD for both k = 5 and k = 10 show a performance improvement over the baseline at all prec@n. In Table 3 for the information need part there is no improvement, for the document-based polyrepresentation part we can see a slight improvement at prec@5, prec@10, prec@15 and prec@20 but at prec@30 the performance is bit poorer. Discussion The results show that our approach is able to improve the effec- tiveness, but there are interesting differences when it comes to queries with a low and a high number of relevant documents. For queries with high numbers of relevant documents, polyrepresentative clustering based on documents and bibliographic context does not perform well, whereas information need polyrep- resentation reacts well on clustering. It seems that with polyrepresentation of information needs relevant documents are not necessarily found in the total cog- nitive overlap and are unearthed by means of clustering. An explanation may be that some combinations of information needs representations are not benefi- cial [19] and clustering is a means to mitigate this. We see a different picture for queries with only few relevant documents. Here, document-based polyrepresen- tative clustering outperforms the polyrepresentation baseline, but information need based clustering does not. For this reason the approach delivers a rather mixed result when we consider all queries. We also observe that clustering has a negative effect on mean average precision (map) and recall, which is due to the fact that we only select the top k documents from each cluster and drop the rest. Table 3. All the queries runid IN eF SD eF SD DOC eF SD eF SD BM25 k=5 k=5 k=10 k=10 BM25 k=5 k=5 k=10 k=10 All All num q 65 65 65 65 65 65 65 65 65 65 num ret 918841610240 10240 20480 20480 5708875 10225 10225 19809 19808 num rel 1376 1376 1376 1376 1376 1376 1376 1376 1376 1376 num rel ret 1376 25 25 40 40 1317 287 287 367 366 map 0.007 0.0022 0.0022 0.0035 0.0035 0.0816 0.0746 0.0746 0.0776 0.0698 P@5 0.0187 0.0187 0.0187 0.0187 0.0187 0.1469 0.15 0.15 0.15 0.1469 P@10 0.0125 0.0141 0.0141 0.0125 0.0125 0.1375 0.1391 0.1391 0.1422 0.1406 P@15 0.0104 0.0115 0.0115 0.0115 0.0115 0.124 0.1292 0.1292 0.1302 0.1292 P@20 0.0102 0.0102 0.0102 0.0125 0.0125 0.1117 0.1156 0.1156 0.1156 0.1148 P@30 0.0094 0.0078 0.0078 0.0104 0.0104 0.1 0.0943 0.0943 0.0995 0.099 4 Conclusion In this study we have evaluated the suitability of combining principle of polyrep- resentation with document clustering. The idea is that instead of a ranked list we present the user with clustering reflecting the relevance of documents w.r.t. different representations. We report some insights on the performance of adding bibliographic information as the representation as well as looking at informa- tion need polyrepresentation. Apparently the aspect of presenting the results to the user when various representations of information need as well as informa- tion object come into play is a complex task. The initial exploration show that presenting information to the user in illustrated way has its potentials and draw- backs. We intend to test the approach for variable size k and the comparison of the cluster ranking strategies, i.e. arithmetic mean, geometric mean etc, in the future. References 1. Ingwersen, P., Järvelin, K.: The turn: integration of information seeking and re- trieval in context. Springer-Verlag New York, Inc., Secaucus, NJ, USA (2005) 2. Kelly, D., Fu, X.: Eliciting better information need descriptions from users of information search systems. Information Processing & Management 43(1) (2007) 30–46 3. Skov, M., Larsen, B., Ingwersen, P.: Inter and intra-document contexts applied in polyrepresentation for best match IR. Information Processing & Management 44(5) (2008) 1673–1683 4. Larsen, B., Ingwersen, P., Kekäläinen, J.: The polyrepresentation continuum in IR. In: Proceedings IIiX 2006, New York, NY, USA, ACM (2006) 88–96 5. Frommholz, I., Larsen, B., Piwowarski, B., Lalmas, M., Ingwersen, P., van Rijs- bergen, K.: Supporting Polyrepresentation in a Quantum-inspired Geometrical Retrieval Framework. In: Proceedings IIiX 2010, ACM (2010) 115–124 6. Frommholz, I., Abbasi, M.K.: On clustering and polyrepresentation. In: Proceed- ings ECIR2014. To appear. 7. Borgman, C., Furner, J.: Scholarly communication and bibliometrics. Annual Review of Information Science and Technology 36 (2002) 3–72 8. Garfield, E.: Citation analysis as a tool in journal evaluation. Science (New York, N.Y.) 178(4060) (November 1972) 471–9 9. Chen, C., Ibekwe-SanJuan, F., and Jianhua Hou: The structure and dynamics of cocitation clusters: A multipleperspective cocitation analysis. JASIST 61(7) (2010) 1386–1409 10. Mayr, P., Mutschke, P.: Bibliometric-enhanced retrieval models for big scholarly information systems. IEEE International Conference on Big Data (2013) 11. White, H.: Combining bibliometrics, information retrieval, and relevance theory, Part 2: Some implications for information science. JASIST 58(4) (2007) 583–605 12. Doreian, P., Batagelj, V., Ferligoj, A.: Generalized blockmodeling. (2005) 13. Mutschke, P., Mayr, P., Schaer, P., Sure, Y.: Science models as value-added services for scholarly information systems. Scientometrics 89(1) (June 2011) 349–364 14. Larsen, B., Ingwersen, P.: The Boomerang Effect : Retrieving Scientific Documents via the Network of References and Citations. (2002) 2–3 15. Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The Optimum Clustering Frame- work: Implementing the Cluster Hypothesis. Information Retrieval 14 (2011) 16. Lykke, M., Larsen, B., Lund, H., Ingwersen, P.: Developing a Test Collection for the Evaluation of Integrated Search. In: Proceedings ECIR 2010. (2010) 627–630 17. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A High Performance and Scalable Information Retrieval Platform. In: Proceedings SIGIR Workshop on Open Source Information Retrieval (OSIR 2006). (2006) 18. Wu, S.: Data Fusion in Information Retrieval. Volume 13. Springer (2012) 19. Lioma, C., Larsen, B., Ingwersen, P.: Preliminary experiments using subjective logic for the polyrepresentation of information needs. In: Proceedings IIiX 2012, ACM (2012) 174–183