From Entities to Geometry: Towards exploiting Multiple
                  Sources to Predict Relevance

               Emanuele Di Buccio                                Mounia Lalmas                    Massimo Melucci
             Department of Information                      Department of Computing            Department of Information
                   Engineering                                       Science                         Engineering
             University of Padua, Italy                     University of Glasgow, UK          University of Padua, Italy
              dibuccio@dei.unipd.it                            mounia@acm.org                     melo@dei.unipd.it

ABSTRACT                                                                   described in terms of interaction features – display time,
The goal of an Information Retrieval (IR) system is to pre-                click-through data, amount of scrolling, or other features
dict which information objects can help users in satisfy-                  e.g. [2]. These features have been adopted as sources of ev-
ing their information needs, i.e. predict relevance. Differ-               idence to estimate relevance, e.g. display-time in [3], click-
ent sources of evidence can be exploited for this purpose.                 through data in [4], or a combination of several features
These sources are the properties of the different entities in-             in [5, 6]. Nowadays commercially available devices, e.g. mo-
volved when retrieving and accessing information, where ex-                bile phones, are equipped with tools that can capture infor-
amples of entities include the information objects, the task,              mation about the user location and from the surrounding
the user, or the location. The main hypothesis of this pa-                 environment, besides having access to all the information
per is that, to exploit the variety of entities and sources,               provided by the web or the user personal data.
it is necessary to model the relationships existing between                   The various sources may not have the same impact in
the entities and those existing between the properties of the              predicting relevance, and as such their relative contributions
entities. Such relationships are themselves possible sources               should be investigated. For instance ranking algorithms that
that can be used to predict relevance. This paper proposes                 are based on different object representations will usually re-
a methodology that supports the design of an IR system                     turn sets of relevant information objects with little over-
able to model in a uniform way the properties of the enti-                 lap [8]. It is therefore important, as stated in [8], to “ex-
ties involved, the properties of their relationships and the               plicitly describe and combine multiple sources of evidence
relationships between the different properties. The method-                about relevance” when developing ranking algorithms. More
ology is structured in four steps, aiming, respectively, at sup-           precisely, it is important to explicitly consider the relation-
porting the selection of the sources, collecting the evidence,             ships existing between sources. However, the design and
modeling the sources and their relationships, and using the                the implementation of distinct ranking algorithms, one for
latter two to predict relevance. Sources and relationships                 each type of sources, may not allow for considering relation-
are modeled and then exploited through a previously pro-                   ships between sources. It is thus important to investigate
posed geometric framework, which provides a uniform and                    approaches that combine evidences rather than approaches
concrete representation in terms of vector subspaces.                      that combine ranking algorithms. This would allow for the
                                                                           relationships between sources to be explicitly integrated in
                                                                           the ranking algorithm.
1.   INTRODUCTION                                                             This paper proposes a methodology that supports the de-
   The goal of an IR system is to predict which information                sign of an IR system able to model in a uniform way the
objects can help users in satisfying their information needs.              properties of the entities involved, the properties of their re-
For instance, if the information need is expressed by the user             lationships and the relationships between the different prop-
as a textual query, the IR system has to predict which doc-                erties. The methodology is structured in four steps, aiming,
uments are relevant to the formulated query. According to                  respectively, at supporting the selection of the sources, col-
this interpretation, IR can be framed as a problem of evi-                 lecting the evidence, modeling the sources and their relation-
dence and prediction [1]. The prediction can be performed                  ships, and using the latter two to predict relevance. The last
through the different sources of evidence involved in the re-              two steps are based on the geometric framework proposed
trieval process. Content, meta-data and annotations of the                 in [9], which provides a uniform and concrete representa-
information objects are examples of such sources, and have                 tion of the sources and their relationships in terms of vector
been used by many retrieval systems.                                       subspaces.
   These sources have been shown to be effective to predict                   The methodology aims at being general, in the sense that
relevance, but other sources exist. An example is the be-                  it is not related to a specific source or set of sources. How-
havior of the user during the search process, for instance                 ever, for illustration purpose, two sources will be considered
                                                                           in this paper, namely, the content of the information objects
                                                                           to be ranked and the behavior of the users when accessing
                                                                           or retrieving information. The former has been selected be-
                                                                           cause past research in IR provides a number of representa-
                                                                           tions of the content that have been shown to lead to effective
Appears in the Proceedings of the 1st Italian Information Retrieval
                                                                           retrieval [8]. The latter has been extensively investigated in
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.
http://ims.dei.unipd.it/websites/iir10/index.html
Copyright owned by the authors.
Information Science (IS) and has in the last decade become a        the properties to describe the relationship between the en-
subject of investigation in IR. Indeed, experimental evalua-        tity user and the entity result; such property constitutes
tion has shown how usage data stored in transaction logs [3,        a source that can be exploited to predict relevance. In-
4, 6, 10] or so-called interactive IR systems [11, 12] can effec-   deed, research in Interactive IR has shown that a retrieval
tively predict relevance. The use of the Entity-Relationship        system can benefit from evidence gathered from the infor-
database model for describing IR objects was introduced             mation seeking activities of a user. For example, Implicit
in [13] for automatic hypertext construction purpose – this         Relevance Feedback (IRF) algorithms [10] exploit the infor-
paper enlarges that view and connect the entities and re-           mation gathered from the interactions between the user and
lationship at the conceptual level to a mathematical model          the documents to recommend query expansion terms or to
which provides a language at the logical level.                     re-rank documents. Even the concept of relevance can be
                                                                    defined as “a relation between a document and a person,
                                                                    relative to a given information need” [1], the document and
2.   MOTIVATIONS AND METHODOLOGY                                    the person being two entities.
     RATIONALE                                                         The set of entities and relationships, and their properties,
   IR systems can exploit the evidence provided by differ-          are neither fixed nor unique, as they depend on the specific
ent sources to improve retrieval effectiveness. In [8] the          retrieval application – e.g. the entity location is crucial for
author considers several document representations and dis-          search carried out on a mobile phone or to customize search
cusses approaches to combine the contribution provided by           results according to the country where the search originates.
each representation. In [14] the inference network framework        Therefore, the selection of the sources is an important issue
is adopted to combine link-based evidence with content-             that needs to be addressed.
based evidence for web retrieval. Evidence on the structure            Once the appropriate sources have been identified, each
of the documents can be incorporated, for instance, using           of them has to be modeled, so that to be exploited for re-
the Dempster-Shafer theory of evidence [15]. However, the           trieval. In this work, we refer to the model of a source as a
different document representations are only a subset of the         dimension. A first step to obtain a dimension is to identify
available sources.                                                  a set of features that describe it. Feature here refers to the
   Let us consider, for instance, the scenario where a user         information obtained by the observation of a property of an
is looking for information about restaurants in London. If          entity or a relationship. For an entity “location” described
Venice is the location where the search is performed, this          by the dimension “GPS position”, the features are the GPS
probably suggests that the user is planning a trip in Lon-          position components. For a “web result” entity, the keywords
don, and restaurants in an arbitrary London area may be             in the title, the snippet or the URL of the result are example
of interest. If the search is performed on a mobile phone           of features. Since the features constitute the evidence that
and the GPS position indicates that the user is in London,          model a source, a procedure to select and collect features has
probably the user is more interested in restaurants near his        to be designed and implemented.
current position. We can see that in this scenario, other              The description (model) of the sources is what get used
units besides the information objects are involved. In this         to predict relevance. In this work the framework adopted
paper, we refer to units as entities. For instance, in our sce-     to build the description is the vector subspace formalism
nario, the entities involved are the user, the location, the        proposed in [9]. The basic rationale for this is that we want
task the user is performing when looking for information –          to map the collected data, prepared in a matrix, in a new
i.e. “travel in London” – and the specific topic within the         vector space basis – the vector subspace spanned by the basis
task1 – i.e. “finding restaurants in London”.                       is the model of the source.
   Each entity is characterized by a number of properties.             Once a representation in terms of subspaces has been built
When the entity is an “information object”, examples of             both for the sources and the information objects, a trace-
properties include content, meta-data and annotation. For           based function, the one exploited in [9], can be adopted
the entity “location”, instances of properties are the GPS          to rank information objects by exploiting the information
position or the IP address.                                         about the different sources of evidence that have been mod-
   Each entity exists independently of the properties we can        eled. In other words the trace-based function, which we
observe about it, but the observed properties are the evi-          briefly describe in Section 4, is a tool to handle the predic-
dence that can be used to build a model of the entity, that         tion problem.
is to obtain a description of the entity – in this work a math-        In summary, four steps have been identified, and each of
ematical description – that can be used to predict relevance.       them needs to be addressed to be able to predict relevance
In other words, the properties of the entities are the sources      using multiple sources of evidence, namely, sources selec-
of evidence that can be exploited to help predicting the rele-      tion, features collection, source modeling and relevance pre-
vance of information objects.                                       diction. Figure 1 illustrates these four steps for the relation-
   Not only the properties of the entities are sources of ev-       ship between the entities “user” and “information objects”;
idence, but also the relationships between entities (if any)        here, the relationship is characterized by the source “user
can provide additional evidence to predict relevance. Let           behavior” described in terms of “interaction features”.
us consider a list of results returned by an IR system in re-          In this paper we will focus on two of the above steps,
sponse to a query and the user who formulated the query.            specifically evidence collection and source modelling, which
The behavior of the user when examining a result is one of          will be discussed respectively in Section 3 and Section 4;
                                                                    some remarks on the implementation of these methodology
1
  We take the definition of task and topic from [2]: “Task          steps and their evaluation are reported in Section 5.
was defined for this study as the goal of information-seeking
behavior, and topic was defined as the specific subject within
a task.”
                                Source Selection                  Feature Selection and Collection                                Source Modeling                                       Prediction

                              Selected Entities                                                                                                                                              y
                                                                                                                                      (b) Matrix Transformation, e.g. PCA
                              - User
                                                                                                                 u11 ...... u1k       (c) Selection of the eigenvectors
                              - Documents
                              Selected Source                                        Raw data
                              - User Behavior                             (a)                         (b)                           (c)                                                     yB
                                                                                                                 un1 ...... unk                                                  L(B)
                                                       Collection of                 f11 ...... f1k                                                                                                              5
                                                   Interaction Features                                     Logical view of data                                                                         3
                                                                                                                                                                                                             4
                                                                                                                                                                                                     1 2
                                                                                                              e.g. eigenvectors
                                                                                                                                             L(B)
                                                                                     fn1 ...... fnk           computed by PCA                                               y:  vector representation of
                                                                                                                                                                                the information object
                                                                                - k interaction features                                  User behavior source              yB: projection of the vector y
                                                                                - n visited documents                                     modeled as subspace                   on the subspace L(B)


                    Figure 1: Methodology steps and specific application to the user interaction behavior.


3.   EVIDENCE COLLECTION                                                                                                      Individual users and user groups, does not necessarily
   Let us return to the scenario of a user looking for infor-                                                               need to be considered as mutually exclusive sources for in-
mation about restaurants in London. Let us suppose the                                                                      teraction features. For instance, in [5] user behavior models
user, to satisfy his information need, interacts with a search                                                              to predict user preferences for web ranking are learned by
engine and submits the query “restaurants in London”. The                                                                   exploiting simultaneously feature values derived from the in-
search engine returns a ranked list of results. For simplic-                                                                dividual’s behavior and those aggregated across all the users
ity, we focus on two entities only, namely, the user and the                                                                and search session for each query-URL pair.
result. When examining the returned results, the user inter-                                                                  The selection of the features of a source to then be gath-
acts with them and with the information objects the results                                                                 ered affects the modeling step, since they constitute the ev-
refer to. In this scenario the behavior of the user when ex-                                                                idence used to build a model of the source. However, the
amining and (eventually) accessing the results can be con-                                                                  procedure to collect features is part of the design of the
sidered as a property to describe the involved entities and,                                                                IR system, in particular, the components aimed at gather-
particularly, as a source to assist relevance prediction. In                                                                ing the selected features and managing them. For instance,
the above scenario another source available is the content of                                                               when interaction features have been selected as implicit in-
the abstracts (title, snippet and URL) of the results and the                                                               dicators, a browser extension can be used to monitor the
content of the corresponding information objects.                                                                           gathering of such features. This is the approach adopted in
   Once the sources have been selected, the next step is to                                                                 the Lemur Query Log Project2 , a study to gather the query
collect the evidence to build the model of these sources. This                                                              logs from users of the Lemur Query Log Toolbar34 . It should
step consists of selecting the features to be gathered to build                                                             be noted that the development of an extension that stores
a model of these sources, and then the actual collection of                                                                 the usage data on the client side may encourage the user to
the selected features.                                                                                                      adopt this monitoring tool since no personal data need to
   In the event of the source “user behavior” a possible choice,                                                            be provided to the server.
as depicted in step two of Figure 1, is the adoption of so-
called interaction features. This is for instance the approach                                                              4.             SOURCE MODELING AND PREDICTION
adopted in [5, 6] where several interaction features are ex-
                                                                                                                               Once the evidence has been gathered, the next step con-
ploited simultaneously. In particular, in [6] a subset of the
                                                                                                                            sists of modeling the evidence so that it can be used to pre-
features gathered in the user study described in [2] was ex-
                                                                                                                            dict relevance. In this work the mathematical construct of
ploited to obtain a vector subspace representation of the user
                                                                                                                            the vector subspace is used for this purpose.
behavior. When using a representation personalized for each
                                                                                                                               In this paper, the evidence gathered by the different sources
user and tailored on the specific search task to re-rank the
                                                                                                                            is exploited to rank information objects with respect to a
documents, the keywords extracted from the top re-ranked
                                                                                                                            given query. This is done by using the different representa-
documents were shown to be effective as source for query ex-
                                                                                                                            tions of the objects generated from the sources. For instance,
pansion. The methodology proposed in that work assumed
                                                                                                                            if the user “interaction behavior” is a considered source, an
that the interaction features were available for all the doc-
                                                                                                                            information object can be described in terms of the interac-
uments to be re-ranked. But this assumption does not hold
                                                                                                                            tion features monitored when a user is visiting the object —
in our considered scenario, unless the documents have been
                                                                                                                            e.g. an object being displayed for 30 seconds, clicked 3 times
already visited with regard to past queries when performing
                                                                                                                            and on which 5 scrolling actions have been performed, can
the same task. Therefore, the availability of the interaction
                                                                                                                            be represented as the vector y = (30, 3, 5). The same ob-
features is an issue to address. A possible solution is not to
                                                                                                                            ject, if the source “content” is considered, can be described
consider the features with regard to a single user, but with
                                                                                                                            as the vector of the TF·IDF weights of the terms appearing
regard to a group of users, e.g. performing the same task.
                                                                                                                            in it. The construct of the vector space basis is particularly
   Another reason to exploit group interaction data is the
                                                                                                                            suitable to model these multiple representations. Indeed,
reliability of the interaction features. The features need to be
                                                                                                                            intuitively, the same information object can be represented
reliable indicators of the user needs, interests or intents. To
                                                                                                                            with regard to different sources in the same way the same
clarify what we mean by “reliable feature”, let us consider the
                                                                                                                            vector can be generated by different vector space basis.
display-time: this feature, when considered in isolation and
                                                                                                                               A second reason to adopt the construct of the vector space
referring to a single user, is subject to variations. Exploiting
                                                                                                                            basis is that some of the vector subspace representations
this feature when predicting relevance may be difficult [3],
thus making it not reliable. But in [3] the authors found                                                                    2
                                                                                                                              http://lemurstudy.cs.umass.edu/
that display-time, when used as implicit measure, is more                                                                    3
                                                                                                                              http://www.lemurproject.org/querylogtoolbar/
consistent when referring to multiple subjects performing                                                                   4
                                                                                                                              The goal of the study is to create a database of web search
the same task, than when personalized to each user.                                                                         activities that will be provided to the information retrieval
                                                                                                                            research community.
may reveal the logical structure underlying the collected ev-     work, we consider that the relationships are themselves sources.
idence. The collected data, prepared in a matrix, is a vector     Therefore, it is better to not consider distinct mappings, one
representation of the source. This data often may be noisy.       for each source, but to compute a single vector space basis
A matrix transformation, namely a change of basis, can be         to represent the relationships between sources.
applied to map the original view of the data to one that is         The model of the sources can be used in the retrieval pro-
less noisy. Let us consider the re-evaluation of the Vector       cess once the information objects have been represented by
Space Model (VSM) proposed in [16]. The authors point out         the features selected to describe the sources. Indeed, the
how some assumptions underlying the traditional VSM [17]          measure of the degree to which the modeled source occurs
– e.g. that the terms are orthogonal – may suggest that           in an information object can be computed as the distance
the vector was interpreted as a data structure and not as a       between the vector representation of the information object,
logical construct. Subsequent developments show how the           which corresponds to a one-dimensional subspace, and the
vector can be used as a logical construct able to capture de-     subspace modeling the source(s) spanned by the vector space
pendencies between terms and between documents [16, 18].          basis computed in the source modeling step. This motivates
The “latent semantics” [18] of the terms in the documents,        the function proposed in [9], where the author showed how
that is the dependencies between terms, was used as a source      such function can be interpreted as a trace-based function
for implementing a Pseudo Relevance Feedback algorithm [9]        and that the measure is a probability measure. The idea of
and an Explicit Relevance Feedback algorithm [19] based on        using trace in IR, and in particular the density operators,
the geometric framework adopted in this work.                     was originally introduced in [21], and one of its important
   To explain the role of the matrix transformation tech-         consequence – subsequently exploited in [9] – was to “es-
niques in the modeling step, we use the example of informa-       tablish a link between geometry and probability in vector
tion behavior as a source, where the latter is described in       spaces” [21].
terms of interaction features. A matrix A can be prepared
where the element (i, j) is feature j observed during the visit   5.   IMPLEMENTATION AND EVALUATION
of object i, e.g. a display-time of 30 seconds. The matrix
                                                                     The specific implementation we are investigating concern
A, as mentioned above, can be a noisy vector-based repre-
                                                                  the two mentioned sources, that is, the user behavior and
sentation of the observed data. A matrix transformation
                                                                  the latent semantic of the terms in the information objects.
technique such as Principal Component Analysis (PCA) of
                                                                     With respect to user behavior, we are focusing on two
AT A can be used to compute a new vector space basis –
                                                                  issues. The first is the selection of the source for interac-
this is actually the approach proposed in [6]. PCA provides
                                                                  tion features since, as discussed in Section 3, both individ-
a set of eigenvectors and a subset of them can be used to
                                                                  ual and user groups interaction data can be exploited to
obtain the user interaction behavior dimension – the model
                                                                  prepare the matrix A and to build the source model. In par-
of the source is the subspace spanned by the eigenvectors.
                                                                  ticular, we are investigating the difference between the two
As suggested by this example, this geometric framework al-
                                                                  contributions in terms of retrieval effectiveness when PCA
lows us to achieve one of our goals, which is to generate a
                                                                  is adopted as the matrix transformation technique. PCA
representation of the properties of the relationships between
                                                                  allows handing dimensionality reduction and capturing the
entities – in the example mentioned above the user behavior
                                                                  relationships among the features in an unsupervised manner.
was the property to be modeled.
                                                                  However, as stated in [6], the problem is that the eigenvector
   The two mentioned approaches, that is the one adopted
                                                                  whose components best combine the interaction features, is
in [9, 19] and that adopted in [6], provide a solution for
                                                                  not necessarily the first principal eigenvector, and the best
two distinct sources. In the former case the modeled source
                                                                  performance are achieved when the eigenvector is manually
is a property of an entity, namely the latent semantics of
                                                                  selected. For this reason we are investigating other unsuper-
the terms in the documents. In the latter case, the mod-
                                                                  vised methods to obtain a vector subspace representation of
eled source is a property of a relationships between entities,
                                                                  the interaction data.
namely the user interaction behavior. However, we are also
                                                                     With respect to the latent semantics of terms, one issue
interested in modeling relationships (if any) existing between
                                                                  under investigation is the selection of the terms in the feed-
the properties of the entities, namely between sources, e.g.
                                                                  back documents. Indeed, if the terms appearing in these
between the latent semantics of the terms and the user inter-
                                                                  documents are adopted as evidence to build a source model,
action behavior – this is different from modeling properties
                                                                  one issue, particularly when real-time feedback is required,
of relationships, e.g. the user interaction behavior.
                                                                  is to handle matrices whose dimensions are the number of
   Let us return to the scenario of a user looking for infor-
                                                                  distinct terms in the feedback documents. In this case a
mation about restaurants in London and suppose the term
                                                                  possible solution is the selection of a subset of the terms,
“jazz” appears in the abstract of one of the displayed re-
                                                                  e.g. the top weighted ones. However, this strategy has been
sults. The user when examining the result may realize that
                                                                  shown to not be effective [19]; therefore, we are investigating
he is more interested in jazz restaurants than in general ones.
                                                                  selection criteria for “good terms”.
This example also emphasizes how different sources are not
                                                                     Since the main objective of the methodology is to model
necessarily independent from each other. Indeed, the fea-
                                                                  relationships, we will look into the relationships between
tures observed for a source (e.g. the user behavior) can be
                                                                  sources, and investigate their implementation using the pro-
“entangled” with the features observed for another source
                                                                  posed geometric framework, and their impact on retrieval
(e.g. the particular meaning of a query feature in the se-
                                                                  effectiveness. Two approaches are possible. The first ap-
lected results).
                                                                  proach is to rank information objects separately according
   The design of one approach per source may not be able
                                                                  to different dimensions and then combine the rankings into
to model relationships that may occur between sources and
                                                                  one. The second approach is to model all the sources as a
consequently to exploit them, as reported in [20]. In this
                                                                  unique vector subspace and then rank the information ob-
jects against such subspace. The latter approach has the              Technology: Research and Development, 1(1):1–21,
advantage of exploiting all the dimensions simultaneously,            1982.
thus avoiding any loss of information that may arise from not     [2] D. Kelly. Understanding implicit feedback and
considering relationships between sources (which is the case          document preference: a naturalistic user study. PhD
                                                                      thesis, New Brunswick, NJ, USA, 2004.
with the first approach). In particular, as for the user be-
                                                                  [3] R. W. White and D. Kelly. A study on the effects of
havior source, we are investigating unsupervised approaches           personalization and task information on implicit
to model relationships among sources.                                 feedback performance. In Proceedings of CIKM’06,
   Evaluation is crucial to validate the implementation of            pages 297–306, New York, NY, USA, 2006. ACM.
the methodology. The main problem is the availability of          [4] T. Joachims. Optimizing search engines using
datasets where information about user interaction behavior,           clickthrough data. In Proceedings of KDD ’02, pages
the content of results and information objects are available.         133–142, New York, NY, USA, 2002. ACM.
Transaction logs [7] can provide this data, but no explicit       [5] E. Agichtein, E. Brill, S. Dumais, and R. Ragno.
                                                                      Learning user interaction models for predicting web
relevance judgments are available to validate the effective-          search result preferences. In Proceedings of SIGIR ’06,
ness of the approaches under investigation; existing datasets         pages 3–10, New York, NY, USA, 2006. ACM.
with this information are not publicly available.                 [6] M. Melucci and R.W. White. Utilizing a geometry of
                                                                      context for enhanced implicit feedback. In Proceedings
6.   CONCLUDING REMARKS                                               of CIKM’07, pages 273–282, Lisbon, Portugal, 2007.
                                                                  [7] B. Jansen. Search log analysis: What it is, what’s
   The purpose of this work was the introduction of a method-         been done, how to do it. Library & Information
ology that aims at exploiting evidence coming from multiple           Science Research, 28(3):407–432, 2006.
sources to predict the relevance of information objects for       [8] W.B. Croft. Combining approaches to information
given queries. Four methodological steps are required to              retrieval. Advances in information retrieval, 7:1–36,
achieve this goal, namely, sources selection, features collec-        2000.
tion, dimension modeling and relevance prediction. The ge-        [9] M. Melucci. A basis for information retrieval in
                                                                      context. ACM TOIS, 26(3):1–41, 2008.
ometric framework proposed in [9] was chosen to implement
                                                                 [10] D. Kelly and J. Teevan. Implicit feedback for inferring
the last two steps because it provides a uniform model for            user preference: a bibliography. SIGIR Forum,
the sources, which can be used by to rank objects according           37(2):18–28, 2003.
to their estimated relevance.                                    [11] R.W. White, J.M. Jose, and I. Ruthven. An implicit
   Moreover, we discussed some issues to be addressed when            feedback approach for interactive information
implementing the methodology for two specific sources, that           retrieval. IP&M, 42(1):166–190, 2006.
is the user interaction behavior and the latent semantic of      [12] N. Fuhr. A probability ranking principle for
the terms in the information objects. The issues specifically         interactive information retrieval. Information
                                                                      Retrieval, 11(3):251–265, 2008.
concern the evidence collection and source modeling steps.
                                                                 [13] M. Agosti, and F. Crestani A methodology for the
   In future work we want to further investigate the concepts         automatic construction of a hypertext for information
adopted in this paper, namely, entity, relationship, dimen-           retrieval Proc. of ACM SAC, 745–753, Indianapolis,
sion and feature. We chose these concepts as they relate to           Indiana, United States, 1993.
the view of the world to be modeled – in our case in order       [14] T. Tsikrika and M. Lalmas. Combining evidence for
to predict relevance – which consists of entities and rela-           web retrieval using the inference network model: an
tionships, where the entities exists independently of their           experimental study. IP&M, 40(5):751–772, 2004.
properties. The properties, namely the sources, are the in-      [15] M. Lalmas and I. Ruthven. Representing and
                                                                      retrieving structured documents using the
formation that can be obtained by the observation of entities         Dempster-Shafer theory of evidence: modelling and
and relationships between them. This is the same view of the          evaluation. Journal of Documentation, 54:529–565,
world adopted in the Entity-Relationship (ER) model [22],             1998.
the most widely used data model for the conceptual design        [16] S. K. M. Wong and V. V. Raghavan. Vector space
of databases. In the ER model the result of the observation           model of information retrieval: a reevaluation. In
is a value and the mapping from the entities set (or the rela-        Proc. of SIGIR ’84, pages 167–185, Swinton, UK,
tionship set) to the value set is named attribute. The notion         1984. British Computer Society.
of feature adopted in this work can be compared to the ER        [17] G. Salton, A. Wong, and C. S. Yang. A vector space
                                                                      model for automatic indexing. Communications of the
notion of value set. Moreover the notion of dimension can             ACM, 18(11):613–620, 1975.
be compared to the notion of attribute, since both refers to     [18] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K.
properties of entities and relationships.                             Landauer, and R. Harshman. Indexing by latent
   The above discussion suggests investigate the relation-            semantic analysis. JASIS, 41:391–407, 1990.
ships among the ER model, the geometric framework pro-           [19] E. Di Buccio and M. Melucci. University of Padua at
posed in [9] and the methodology proposed in this paper.              TREC 2009: Relevance Feedback Track. In Proc. of
                                                                      TREC 2009, Washington, DC, USA, 2009. To Appear.
Acknowledgements This research is partly funded by a             [20] E. Di Buccio and M. Melucci. Towards a Methodology
                                                                      for Contextual Information Retrieval. In Proc. of
Royal Society International Joint Project (2008/R4). Mou-             CIRSE 2009, Tolouse, France, 2009.
nia Lalmas is currently funded by Microsoft Research/Royal       [21] C.J. van Rijsbergen. The Geometry of Information
Academy of Engineering.                                               Retrieval. Cambridge University Press, New York, NY,
                                                                      USA, 2004.
7.   REFERENCES                                                  [22] P.P. Chen. The entity-relationship model—toward a
                                                                      unified view of data. ACM TODS, 1(1):9–36, 1976.
 [1] S. E. Robertson, M. E. Maron, and W. S. Cooper.
     Probability of relevance: A unification of two
     competing models for document retrieval. Information