From Entities to Geometry: Towards exploiting Multiple Sources to Predict Relevance Emanuele Di Buccio Mounia Lalmas Massimo Melucci Department of Information Department of Computing Department of Information Engineering Science Engineering University of Padua, Italy University of Glasgow, UK University of Padua, Italy dibuccio@dei.unipd.it mounia@acm.org melo@dei.unipd.it ABSTRACT described in terms of interaction features – display time, The goal of an Information Retrieval (IR) system is to pre- click-through data, amount of scrolling, or other features dict which information objects can help users in satisfy- e.g. [2]. These features have been adopted as sources of ev- ing their information needs, i.e. predict relevance. Differ- idence to estimate relevance, e.g. display-time in [3], click- ent sources of evidence can be exploited for this purpose. through data in [4], or a combination of several features These sources are the properties of the different entities in- in [5, 6]. Nowadays commercially available devices, e.g. mo- volved when retrieving and accessing information, where ex- bile phones, are equipped with tools that can capture infor- amples of entities include the information objects, the task, mation about the user location and from the surrounding the user, or the location. The main hypothesis of this pa- environment, besides having access to all the information per is that, to exploit the variety of entities and sources, provided by the web or the user personal data. it is necessary to model the relationships existing between The various sources may not have the same impact in the entities and those existing between the properties of the predicting relevance, and as such their relative contributions entities. Such relationships are themselves possible sources should be investigated. For instance ranking algorithms that that can be used to predict relevance. This paper proposes are based on different object representations will usually re- a methodology that supports the design of an IR system turn sets of relevant information objects with little over- able to model in a uniform way the properties of the enti- lap [8]. It is therefore important, as stated in [8], to “ex- ties involved, the properties of their relationships and the plicitly describe and combine multiple sources of evidence relationships between the different properties. The method- about relevance” when developing ranking algorithms. More ology is structured in four steps, aiming, respectively, at sup- precisely, it is important to explicitly consider the relation- porting the selection of the sources, collecting the evidence, ships existing between sources. However, the design and modeling the sources and their relationships, and using the the implementation of distinct ranking algorithms, one for latter two to predict relevance. Sources and relationships each type of sources, may not allow for considering relation- are modeled and then exploited through a previously pro- ships between sources. It is thus important to investigate posed geometric framework, which provides a uniform and approaches that combine evidences rather than approaches concrete representation in terms of vector subspaces. that combine ranking algorithms. This would allow for the relationships between sources to be explicitly integrated in the ranking algorithm. 1. INTRODUCTION This paper proposes a methodology that supports the de- The goal of an IR system is to predict which information sign of an IR system able to model in a uniform way the objects can help users in satisfying their information needs. properties of the entities involved, the properties of their re- For instance, if the information need is expressed by the user lationships and the relationships between the different prop- as a textual query, the IR system has to predict which doc- erties. The methodology is structured in four steps, aiming, uments are relevant to the formulated query. According to respectively, at supporting the selection of the sources, col- this interpretation, IR can be framed as a problem of evi- lecting the evidence, modeling the sources and their relation- dence and prediction [1]. The prediction can be performed ships, and using the latter two to predict relevance. The last through the different sources of evidence involved in the re- two steps are based on the geometric framework proposed trieval process. Content, meta-data and annotations of the in [9], which provides a uniform and concrete representa- information objects are examples of such sources, and have tion of the sources and their relationships in terms of vector been used by many retrieval systems. subspaces. These sources have been shown to be effective to predict The methodology aims at being general, in the sense that relevance, but other sources exist. An example is the be- it is not related to a specific source or set of sources. How- havior of the user during the search process, for instance ever, for illustration purpose, two sources will be considered in this paper, namely, the content of the information objects to be ranked and the behavior of the users when accessing or retrieving information. The former has been selected be- cause past research in IR provides a number of representa- tions of the content that have been shown to lead to effective Appears in the Proceedings of the 1st Italian Information Retrieval retrieval [8]. The latter has been extensively investigated in Workshop (IIR’10), January 27–28, 2010, Padova, Italy. http://ims.dei.unipd.it/websites/iir10/index.html Copyright owned by the authors. Information Science (IS) and has in the last decade become a the properties to describe the relationship between the en- subject of investigation in IR. Indeed, experimental evalua- tity user and the entity result; such property constitutes tion has shown how usage data stored in transaction logs [3, a source that can be exploited to predict relevance. In- 4, 6, 10] or so-called interactive IR systems [11, 12] can effec- deed, research in Interactive IR has shown that a retrieval tively predict relevance. The use of the Entity-Relationship system can benefit from evidence gathered from the infor- database model for describing IR objects was introduced mation seeking activities of a user. For example, Implicit in [13] for automatic hypertext construction purpose – this Relevance Feedback (IRF) algorithms [10] exploit the infor- paper enlarges that view and connect the entities and re- mation gathered from the interactions between the user and lationship at the conceptual level to a mathematical model the documents to recommend query expansion terms or to which provides a language at the logical level. re-rank documents. Even the concept of relevance can be defined as “a relation between a document and a person, relative to a given information need” [1], the document and 2. MOTIVATIONS AND METHODOLOGY the person being two entities. RATIONALE The set of entities and relationships, and their properties, IR systems can exploit the evidence provided by differ- are neither fixed nor unique, as they depend on the specific ent sources to improve retrieval effectiveness. In [8] the retrieval application – e.g. the entity location is crucial for author considers several document representations and dis- search carried out on a mobile phone or to customize search cusses approaches to combine the contribution provided by results according to the country where the search originates. each representation. In [14] the inference network framework Therefore, the selection of the sources is an important issue is adopted to combine link-based evidence with content- that needs to be addressed. based evidence for web retrieval. Evidence on the structure Once the appropriate sources have been identified, each of the documents can be incorporated, for instance, using of them has to be modeled, so that to be exploited for re- the Dempster-Shafer theory of evidence [15]. However, the trieval. In this work, we refer to the model of a source as a different document representations are only a subset of the dimension. A first step to obtain a dimension is to identify available sources. a set of features that describe it. Feature here refers to the Let us consider, for instance, the scenario where a user information obtained by the observation of a property of an is looking for information about restaurants in London. If entity or a relationship. For an entity “location” described Venice is the location where the search is performed, this by the dimension “GPS position”, the features are the GPS probably suggests that the user is planning a trip in Lon- position components. For a “web result” entity, the keywords don, and restaurants in an arbitrary London area may be in the title, the snippet or the URL of the result are example of interest. If the search is performed on a mobile phone of features. Since the features constitute the evidence that and the GPS position indicates that the user is in London, model a source, a procedure to select and collect features has probably the user is more interested in restaurants near his to be designed and implemented. current position. We can see that in this scenario, other The description (model) of the sources is what get used units besides the information objects are involved. In this to predict relevance. In this work the framework adopted paper, we refer to units as entities. For instance, in our sce- to build the description is the vector subspace formalism nario, the entities involved are the user, the location, the proposed in [9]. The basic rationale for this is that we want task the user is performing when looking for information – to map the collected data, prepared in a matrix, in a new i.e. “travel in London” – and the specific topic within the vector space basis – the vector subspace spanned by the basis task1 – i.e. “finding restaurants in London”. is the model of the source. Each entity is characterized by a number of properties. Once a representation in terms of subspaces has been built When the entity is an “information object”, examples of both for the sources and the information objects, a trace- properties include content, meta-data and annotation. For based function, the one exploited in [9], can be adopted the entity “location”, instances of properties are the GPS to rank information objects by exploiting the information position or the IP address. about the different sources of evidence that have been mod- Each entity exists independently of the properties we can eled. In other words the trace-based function, which we observe about it, but the observed properties are the evi- briefly describe in Section 4, is a tool to handle the predic- dence that can be used to build a model of the entity, that tion problem. is to obtain a description of the entity – in this work a math- In summary, four steps have been identified, and each of ematical description – that can be used to predict relevance. them needs to be addressed to be able to predict relevance In other words, the properties of the entities are the sources using multiple sources of evidence, namely, sources selec- of evidence that can be exploited to help predicting the rele- tion, features collection, source modeling and relevance pre- vance of information objects. diction. Figure 1 illustrates these four steps for the relation- Not only the properties of the entities are sources of ev- ship between the entities “user” and “information objects”; idence, but also the relationships between entities (if any) here, the relationship is characterized by the source “user can provide additional evidence to predict relevance. Let behavior” described in terms of “interaction features”. us consider a list of results returned by an IR system in re- In this paper we will focus on two of the above steps, sponse to a query and the user who formulated the query. specifically evidence collection and source modelling, which The behavior of the user when examining a result is one of will be discussed respectively in Section 3 and Section 4; some remarks on the implementation of these methodology 1 We take the definition of task and topic from [2]: “Task steps and their evaluation are reported in Section 5. was defined for this study as the goal of information-seeking behavior, and topic was defined as the specific subject within a task.” Source Selection Feature Selection and Collection Source Modeling Prediction Selected Entities y (b) Matrix Transformation, e.g. PCA - User u11 ...... u1k (c) Selection of the eigenvectors - Documents Selected Source Raw data - User Behavior (a) (b) (c) yB un1 ...... unk L(B) Collection of f11 ...... f1k 5 Interaction Features Logical view of data 3 4 1 2 e.g. eigenvectors L(B) fn1 ...... fnk computed by PCA y: vector representation of the information object - k interaction features User behavior source yB: projection of the vector y - n visited documents modeled as subspace on the subspace L(B) Figure 1: Methodology steps and specific application to the user interaction behavior. 3. EVIDENCE COLLECTION Individual users and user groups, does not necessarily Let us return to the scenario of a user looking for infor- need to be considered as mutually exclusive sources for in- mation about restaurants in London. Let us suppose the teraction features. For instance, in [5] user behavior models user, to satisfy his information need, interacts with a search to predict user preferences for web ranking are learned by engine and submits the query “restaurants in London”. The exploiting simultaneously feature values derived from the in- search engine returns a ranked list of results. For simplic- dividual’s behavior and those aggregated across all the users ity, we focus on two entities only, namely, the user and the and search session for each query-URL pair. result. When examining the returned results, the user inter- The selection of the features of a source to then be gath- acts with them and with the information objects the results ered affects the modeling step, since they constitute the ev- refer to. In this scenario the behavior of the user when ex- idence used to build a model of the source. However, the amining and (eventually) accessing the results can be con- procedure to collect features is part of the design of the sidered as a property to describe the involved entities and, IR system, in particular, the components aimed at gather- particularly, as a source to assist relevance prediction. In ing the selected features and managing them. For instance, the above scenario another source available is the content of when interaction features have been selected as implicit in- the abstracts (title, snippet and URL) of the results and the dicators, a browser extension can be used to monitor the content of the corresponding information objects. gathering of such features. This is the approach adopted in Once the sources have been selected, the next step is to the Lemur Query Log Project2 , a study to gather the query collect the evidence to build the model of these sources. This logs from users of the Lemur Query Log Toolbar34 . It should step consists of selecting the features to be gathered to build be noted that the development of an extension that stores a model of these sources, and then the actual collection of the usage data on the client side may encourage the user to the selected features. adopt this monitoring tool since no personal data need to In the event of the source “user behavior” a possible choice, be provided to the server. as depicted in step two of Figure 1, is the adoption of so- called interaction features. This is for instance the approach 4. SOURCE MODELING AND PREDICTION adopted in [5, 6] where several interaction features are ex- Once the evidence has been gathered, the next step con- ploited simultaneously. In particular, in [6] a subset of the sists of modeling the evidence so that it can be used to pre- features gathered in the user study described in [2] was ex- dict relevance. In this work the mathematical construct of ploited to obtain a vector subspace representation of the user the vector subspace is used for this purpose. behavior. When using a representation personalized for each In this paper, the evidence gathered by the different sources user and tailored on the specific search task to re-rank the is exploited to rank information objects with respect to a documents, the keywords extracted from the top re-ranked given query. This is done by using the different representa- documents were shown to be effective as source for query ex- tions of the objects generated from the sources. For instance, pansion. The methodology proposed in that work assumed if the user “interaction behavior” is a considered source, an that the interaction features were available for all the doc- information object can be described in terms of the interac- uments to be re-ranked. But this assumption does not hold tion features monitored when a user is visiting the object — in our considered scenario, unless the documents have been e.g. an object being displayed for 30 seconds, clicked 3 times already visited with regard to past queries when performing and on which 5 scrolling actions have been performed, can the same task. Therefore, the availability of the interaction be represented as the vector y = (30, 3, 5). The same ob- features is an issue to address. A possible solution is not to ject, if the source “content” is considered, can be described consider the features with regard to a single user, but with as the vector of the TF·IDF weights of the terms appearing regard to a group of users, e.g. performing the same task. in it. The construct of the vector space basis is particularly Another reason to exploit group interaction data is the suitable to model these multiple representations. Indeed, reliability of the interaction features. The features need to be intuitively, the same information object can be represented reliable indicators of the user needs, interests or intents. To with regard to different sources in the same way the same clarify what we mean by “reliable feature”, let us consider the vector can be generated by different vector space basis. display-time: this feature, when considered in isolation and A second reason to adopt the construct of the vector space referring to a single user, is subject to variations. Exploiting basis is that some of the vector subspace representations this feature when predicting relevance may be difficult [3], thus making it not reliable. But in [3] the authors found 2 http://lemurstudy.cs.umass.edu/ that display-time, when used as implicit measure, is more 3 http://www.lemurproject.org/querylogtoolbar/ consistent when referring to multiple subjects performing 4 The goal of the study is to create a database of web search the same task, than when personalized to each user. activities that will be provided to the information retrieval research community. may reveal the logical structure underlying the collected ev- work, we consider that the relationships are themselves sources. idence. The collected data, prepared in a matrix, is a vector Therefore, it is better to not consider distinct mappings, one representation of the source. This data often may be noisy. for each source, but to compute a single vector space basis A matrix transformation, namely a change of basis, can be to represent the relationships between sources. applied to map the original view of the data to one that is The model of the sources can be used in the retrieval pro- less noisy. Let us consider the re-evaluation of the Vector cess once the information objects have been represented by Space Model (VSM) proposed in [16]. The authors point out the features selected to describe the sources. Indeed, the how some assumptions underlying the traditional VSM [17] measure of the degree to which the modeled source occurs – e.g. that the terms are orthogonal – may suggest that in an information object can be computed as the distance the vector was interpreted as a data structure and not as a between the vector representation of the information object, logical construct. Subsequent developments show how the which corresponds to a one-dimensional subspace, and the vector can be used as a logical construct able to capture de- subspace modeling the source(s) spanned by the vector space pendencies between terms and between documents [16, 18]. basis computed in the source modeling step. This motivates The “latent semantics” [18] of the terms in the documents, the function proposed in [9], where the author showed how that is the dependencies between terms, was used as a source such function can be interpreted as a trace-based function for implementing a Pseudo Relevance Feedback algorithm [9] and that the measure is a probability measure. The idea of and an Explicit Relevance Feedback algorithm [19] based on using trace in IR, and in particular the density operators, the geometric framework adopted in this work. was originally introduced in [21], and one of its important To explain the role of the matrix transformation tech- consequence – subsequently exploited in [9] – was to “es- niques in the modeling step, we use the example of informa- tablish a link between geometry and probability in vector tion behavior as a source, where the latter is described in spaces” [21]. terms of interaction features. A matrix A can be prepared where the element (i, j) is feature j observed during the visit 5. IMPLEMENTATION AND EVALUATION of object i, e.g. a display-time of 30 seconds. The matrix The specific implementation we are investigating concern A, as mentioned above, can be a noisy vector-based repre- the two mentioned sources, that is, the user behavior and sentation of the observed data. A matrix transformation the latent semantic of the terms in the information objects. technique such as Principal Component Analysis (PCA) of With respect to user behavior, we are focusing on two AT A can be used to compute a new vector space basis – issues. The first is the selection of the source for interac- this is actually the approach proposed in [6]. PCA provides tion features since, as discussed in Section 3, both individ- a set of eigenvectors and a subset of them can be used to ual and user groups interaction data can be exploited to obtain the user interaction behavior dimension – the model prepare the matrix A and to build the source model. In par- of the source is the subspace spanned by the eigenvectors. ticular, we are investigating the difference between the two As suggested by this example, this geometric framework al- contributions in terms of retrieval effectiveness when PCA lows us to achieve one of our goals, which is to generate a is adopted as the matrix transformation technique. PCA representation of the properties of the relationships between allows handing dimensionality reduction and capturing the entities – in the example mentioned above the user behavior relationships among the features in an unsupervised manner. was the property to be modeled. However, as stated in [6], the problem is that the eigenvector The two mentioned approaches, that is the one adopted whose components best combine the interaction features, is in [9, 19] and that adopted in [6], provide a solution for not necessarily the first principal eigenvector, and the best two distinct sources. In the former case the modeled source performance are achieved when the eigenvector is manually is a property of an entity, namely the latent semantics of selected. For this reason we are investigating other unsuper- the terms in the documents. In the latter case, the mod- vised methods to obtain a vector subspace representation of eled source is a property of a relationships between entities, the interaction data. namely the user interaction behavior. However, we are also With respect to the latent semantics of terms, one issue interested in modeling relationships (if any) existing between under investigation is the selection of the terms in the feed- the properties of the entities, namely between sources, e.g. back documents. Indeed, if the terms appearing in these between the latent semantics of the terms and the user inter- documents are adopted as evidence to build a source model, action behavior – this is different from modeling properties one issue, particularly when real-time feedback is required, of relationships, e.g. the user interaction behavior. is to handle matrices whose dimensions are the number of Let us return to the scenario of a user looking for infor- distinct terms in the feedback documents. In this case a mation about restaurants in London and suppose the term possible solution is the selection of a subset of the terms, “jazz” appears in the abstract of one of the displayed re- e.g. the top weighted ones. However, this strategy has been sults. The user when examining the result may realize that shown to not be effective [19]; therefore, we are investigating he is more interested in jazz restaurants than in general ones. selection criteria for “good terms”. This example also emphasizes how different sources are not Since the main objective of the methodology is to model necessarily independent from each other. Indeed, the fea- relationships, we will look into the relationships between tures observed for a source (e.g. the user behavior) can be sources, and investigate their implementation using the pro- “entangled” with the features observed for another source posed geometric framework, and their impact on retrieval (e.g. the particular meaning of a query feature in the se- effectiveness. Two approaches are possible. The first ap- lected results). proach is to rank information objects separately according The design of one approach per source may not be able to different dimensions and then combine the rankings into to model relationships that may occur between sources and one. The second approach is to model all the sources as a consequently to exploit them, as reported in [20]. In this unique vector subspace and then rank the information ob- jects against such subspace. The latter approach has the Technology: Research and Development, 1(1):1–21, advantage of exploiting all the dimensions simultaneously, 1982. thus avoiding any loss of information that may arise from not [2] D. Kelly. Understanding implicit feedback and considering relationships between sources (which is the case document preference: a naturalistic user study. PhD thesis, New Brunswick, NJ, USA, 2004. with the first approach). In particular, as for the user be- [3] R. W. White and D. Kelly. A study on the effects of havior source, we are investigating unsupervised approaches personalization and task information on implicit to model relationships among sources. feedback performance. In Proceedings of CIKM’06, Evaluation is crucial to validate the implementation of pages 297–306, New York, NY, USA, 2006. ACM. the methodology. The main problem is the availability of [4] T. Joachims. Optimizing search engines using datasets where information about user interaction behavior, clickthrough data. In Proceedings of KDD ’02, pages the content of results and information objects are available. 133–142, New York, NY, USA, 2002. ACM. Transaction logs [7] can provide this data, but no explicit [5] E. Agichtein, E. Brill, S. Dumais, and R. Ragno. Learning user interaction models for predicting web relevance judgments are available to validate the effective- search result preferences. In Proceedings of SIGIR ’06, ness of the approaches under investigation; existing datasets pages 3–10, New York, NY, USA, 2006. ACM. with this information are not publicly available. [6] M. Melucci and R.W. White. Utilizing a geometry of context for enhanced implicit feedback. In Proceedings 6. CONCLUDING REMARKS of CIKM’07, pages 273–282, Lisbon, Portugal, 2007. [7] B. Jansen. Search log analysis: What it is, what’s The purpose of this work was the introduction of a method- been done, how to do it. Library & Information ology that aims at exploiting evidence coming from multiple Science Research, 28(3):407–432, 2006. sources to predict the relevance of information objects for [8] W.B. Croft. Combining approaches to information given queries. Four methodological steps are required to retrieval. Advances in information retrieval, 7:1–36, achieve this goal, namely, sources selection, features collec- 2000. tion, dimension modeling and relevance prediction. The ge- [9] M. Melucci. A basis for information retrieval in context. ACM TOIS, 26(3):1–41, 2008. ometric framework proposed in [9] was chosen to implement [10] D. Kelly and J. Teevan. Implicit feedback for inferring the last two steps because it provides a uniform model for user preference: a bibliography. SIGIR Forum, the sources, which can be used by to rank objects according 37(2):18–28, 2003. to their estimated relevance. [11] R.W. White, J.M. Jose, and I. Ruthven. An implicit Moreover, we discussed some issues to be addressed when feedback approach for interactive information implementing the methodology for two specific sources, that retrieval. IP&M, 42(1):166–190, 2006. is the user interaction behavior and the latent semantic of [12] N. Fuhr. A probability ranking principle for the terms in the information objects. The issues specifically interactive information retrieval. Information Retrieval, 11(3):251–265, 2008. concern the evidence collection and source modeling steps. [13] M. Agosti, and F. Crestani A methodology for the In future work we want to further investigate the concepts automatic construction of a hypertext for information adopted in this paper, namely, entity, relationship, dimen- retrieval Proc. of ACM SAC, 745–753, Indianapolis, sion and feature. We chose these concepts as they relate to Indiana, United States, 1993. the view of the world to be modeled – in our case in order [14] T. Tsikrika and M. Lalmas. Combining evidence for to predict relevance – which consists of entities and rela- web retrieval using the inference network model: an tionships, where the entities exists independently of their experimental study. IP&M, 40(5):751–772, 2004. properties. The properties, namely the sources, are the in- [15] M. Lalmas and I. Ruthven. Representing and retrieving structured documents using the formation that can be obtained by the observation of entities Dempster-Shafer theory of evidence: modelling and and relationships between them. This is the same view of the evaluation. Journal of Documentation, 54:529–565, world adopted in the Entity-Relationship (ER) model [22], 1998. the most widely used data model for the conceptual design [16] S. K. M. Wong and V. V. Raghavan. Vector space of databases. In the ER model the result of the observation model of information retrieval: a reevaluation. In is a value and the mapping from the entities set (or the rela- Proc. of SIGIR ’84, pages 167–185, Swinton, UK, tionship set) to the value set is named attribute. The notion 1984. British Computer Society. of feature adopted in this work can be compared to the ER [17] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the notion of value set. Moreover the notion of dimension can ACM, 18(11):613–620, 1975. be compared to the notion of attribute, since both refers to [18] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. properties of entities and relationships. Landauer, and R. Harshman. Indexing by latent The above discussion suggests investigate the relation- semantic analysis. JASIS, 41:391–407, 1990. ships among the ER model, the geometric framework pro- [19] E. Di Buccio and M. Melucci. University of Padua at posed in [9] and the methodology proposed in this paper. TREC 2009: Relevance Feedback Track. In Proc. of TREC 2009, Washington, DC, USA, 2009. To Appear. Acknowledgements This research is partly funded by a [20] E. Di Buccio and M. Melucci. Towards a Methodology for Contextual Information Retrieval. In Proc. of Royal Society International Joint Project (2008/R4). Mou- CIRSE 2009, Tolouse, France, 2009. nia Lalmas is currently funded by Microsoft Research/Royal [21] C.J. van Rijsbergen. The Geometry of Information Academy of Engineering. Retrieval. Cambridge University Press, New York, NY, USA, 2004. 7. REFERENCES [22] P.P. Chen. The entity-relationship model—toward a unified view of data. ACM TODS, 1(1):9–36, 1976. [1] S. E. Robertson, M. E. Maron, and W. S. Cooper. Probability of relevance: A unification of two competing models for document retrieval. Information