<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Entities to Geometry: Towards exploiting Multiple Sources to Predict Relevance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emanuele Di Buccio</string-name>
          <email>dibuccio@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mounia Lalmas</string-name>
          <email>mounia@acm.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimo Melucci</string-name>
          <email>melo@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing</institution>
          ,
          <addr-line>Science</addr-line>
          ,
          <institution>University of Glasgow</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information, Engineering, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>The goal of an Information Retrieval (IR) system is to predict which information objects can help users in satisfying their information needs, i.e. predict relevance. Di erent sources of evidence can be exploited for this purpose. These sources are the properties of the di erent entities involved when retrieving and accessing information, where examples of entities include the information objects, the task, the user, or the location. The main hypothesis of this paper is that, to exploit the variety of entities and sources, it is necessary to model the relationships existing between the entities and those existing between the properties of the entities. Such relationships are themselves possible sources that can be used to predict relevance. This paper proposes a methodology that supports the design of an IR system able to model in a uniform way the properties of the entities involved, the properties of their relationships and the relationships between the di erent properties. The methodology is structured in four steps, aiming, respectively, at supporting the selection of the sources, collecting the evidence, modeling the sources and their relationships, and using the latter two to predict relevance. Sources and relationships are modeled and then exploited through a previously proposed geometric framework, which provides a uniform and concrete representation in terms of vector subspaces.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The goal of an IR system is to predict which information
objects can help users in satisfying their information needs.
For instance, if the information need is expressed by the user
as a textual query, the IR system has to predict which
documents are relevant to the formulated query. According to
this interpretation, IR can be framed as a problem of
evidence and prediction [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The prediction can be performed
through the di erent sources of evidence involved in the
retrieval process. Content, meta-data and annotations of the
information objects are examples of such sources, and have
been used by many retrieval systems.
      </p>
      <p>
        These sources have been shown to be e ective to predict
relevance, but other sources exist. An example is the
behavior of the user during the search process, for instance
described in terms of interaction features { display time,
click-through data, amount of scrolling, or other features
e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These features have been adopted as sources of
evidence to estimate relevance, e.g. display-time in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
clickthrough data in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], or a combination of several features
in [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Nowadays commercially available devices, e.g.
mobile phones, are equipped with tools that can capture
information about the user location and from the surrounding
environment, besides having access to all the information
provided by the web or the user personal data.
      </p>
      <p>
        The various sources may not have the same impact in
predicting relevance, and as such their relative contributions
should be investigated. For instance ranking algorithms that
are based on di erent object representations will usually
return sets of relevant information objects with little
overlap [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. It is therefore important, as stated in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], to
\explicitly describe and combine multiple sources of evidence
about relevance" when developing ranking algorithms. More
precisely, it is important to explicitly consider the
relationships existing between sources. However, the design and
the implementation of distinct ranking algorithms, one for
each type of sources, may not allow for considering
relationships between sources. It is thus important to investigate
approaches that combine evidences rather than approaches
that combine ranking algorithms. This would allow for the
relationships between sources to be explicitly integrated in
the ranking algorithm.
      </p>
      <p>
        This paper proposes a methodology that supports the
design of an IR system able to model in a uniform way the
properties of the entities involved, the properties of their
relationships and the relationships between the di erent
properties. The methodology is structured in four steps, aiming,
respectively, at supporting the selection of the sources,
collecting the evidence, modeling the sources and their
relationships, and using the latter two to predict relevance. The last
two steps are based on the geometric framework proposed
in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which provides a uniform and concrete
representation of the sources and their relationships in terms of vector
subspaces.
      </p>
      <p>
        The methodology aims at being general, in the sense that
it is not related to a speci c source or set of sources.
However, for illustration purpose, two sources will be considered
in this paper, namely, the content of the information objects
to be ranked and the behavior of the users when accessing
or retrieving information. The former has been selected
because past research in IR provides a number of
representations of the content that have been shown to lead to e ective
retrieval [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The latter has been extensively investigated in
Information Science (IS) and has in the last decade become a
subject of investigation in IR. Indeed, experimental
evaluation has shown how usage data stored in transaction logs [
        <xref ref-type="bibr" rid="ref10 ref3 ref4 ref6">3,
4, 6, 10</xref>
        ] or so-called interactive IR systems [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] can e
ectively predict relevance. The use of the Entity-Relationship
database model for describing IR objects was introduced
in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for automatic hypertext construction purpose { this
paper enlarges that view and connect the entities and
relationship at the conceptual level to a mathematical model
which provides a language at the logical level.
      </p>
    </sec>
    <sec id="sec-2">
      <title>MOTIVATIONS AND METHODOLOGY</title>
    </sec>
    <sec id="sec-3">
      <title>RATIONALE</title>
      <p>
        IR systems can exploit the evidence provided by di
erent sources to improve retrieval e ectiveness. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] the
author considers several document representations and
discusses approaches to combine the contribution provided by
each representation. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] the inference network framework
is adopted to combine link-based evidence with
contentbased evidence for web retrieval. Evidence on the structure
of the documents can be incorporated, for instance, using
the Dempster-Shafer theory of evidence [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. However, the
di erent document representations are only a subset of the
available sources.
      </p>
      <p>Let us consider, for instance, the scenario where a user
is looking for information about restaurants in London. If
Venice is the location where the search is performed, this
probably suggests that the user is planning a trip in
London, and restaurants in an arbitrary London area may be
of interest. If the search is performed on a mobile phone
and the GPS position indicates that the user is in London,
probably the user is more interested in restaurants near his
current position. We can see that in this scenario, other
units besides the information objects are involved. In this
paper, we refer to units as entities. For instance, in our
scenario, the entities involved are the user, the location, the
task the user is performing when looking for information {
i.e. \travel in London" { and the speci c topic within the
task1 { i.e. \ nding restaurants in London".</p>
      <p>Each entity is characterized by a number of properties.
When the entity is an \information object", examples of
properties include content, meta-data and annotation. For
the entity \location", instances of properties are the GPS
position or the IP address.</p>
      <p>Each entity exists independently of the properties we can
observe about it, but the observed properties are the
evidence that can be used to build a model of the entity, that
is to obtain a description of the entity { in this work a
mathematical description { that can be used to predict relevance.
In other words, the properties of the entities are the sources
of evidence that can be exploited to help predicting the
relevance of information objects.</p>
      <p>
        Not only the properties of the entities are sources of
evidence, but also the relationships between entities (if any)
can provide additional evidence to predict relevance. Let
us consider a list of results returned by an IR system in
response to a query and the user who formulated the query.
The behavior of the user when examining a result is one of
1We take the de nition of task and topic from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: \Task
was de ned for this study as the goal of information-seeking
behavior, and topic was de ned as the speci c subject within
a task."
the properties to describe the relationship between the
entity user and the entity result; such property constitutes
a source that can be exploited to predict relevance.
Indeed, research in Interactive IR has shown that a retrieval
system can bene t from evidence gathered from the
information seeking activities of a user. For example, Implicit
Relevance Feedback (IRF) algorithms [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] exploit the
information gathered from the interactions between the user and
the documents to recommend query expansion terms or to
re-rank documents. Even the concept of relevance can be
de ned as \a relation between a document and a person,
relative to a given information need" [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the document and
the person being two entities.
      </p>
      <p>The set of entities and relationships, and their properties,
are neither xed nor unique, as they depend on the speci c
retrieval application { e.g. the entity location is crucial for
search carried out on a mobile phone or to customize search
results according to the country where the search originates.
Therefore, the selection of the sources is an important issue
that needs to be addressed.</p>
      <p>Once the appropriate sources have been identi ed, each
of them has to be modeled, so that to be exploited for
retrieval. In this work, we refer to the model of a source as a
dimension. A rst step to obtain a dimension is to identify
a set of features that describe it. Feature here refers to the
information obtained by the observation of a property of an
entity or a relationship. For an entity \location" described
by the dimension \GPS position", the features are the GPS
position components. For a \web result" entity, the keywords
in the title, the snippet or the URL of the result are example
of features. Since the features constitute the evidence that
model a source, a procedure to select and collect features has
to be designed and implemented.</p>
      <p>
        The description (model) of the sources is what get used
to predict relevance. In this work the framework adopted
to build the description is the vector subspace formalism
proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The basic rationale for this is that we want
to map the collected data, prepared in a matrix, in a new
vector space basis { the vector subspace spanned by the basis
is the model of the source.
      </p>
      <p>
        Once a representation in terms of subspaces has been built
both for the sources and the information objects, a
tracebased function, the one exploited in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], can be adopted
to rank information objects by exploiting the information
about the di erent sources of evidence that have been
modeled. In other words the trace-based function, which we
brie y describe in Section 4, is a tool to handle the
prediction problem.
      </p>
      <p>In summary, four steps have been identi ed, and each of
them needs to be addressed to be able to predict relevance
using multiple sources of evidence, namely, sources
selection, features collection, source modeling and relevance
prediction. Figure 1 illustrates these four steps for the
relationship between the entities \user" and \information objects";
here, the relationship is characterized by the source \user
behavior" described in terms of \interaction features".</p>
      <p>In this paper we will focus on two of the above steps,
speci cally evidence collection and source modelling, which
will be discussed respectively in Section 3 and Section 4;
some remarks on the implementation of these methodology
steps and their evaluation are reported in Section 5.
Source Selection
Selected Entities
- User
- Documents
Selected Source
- User Behavior</p>
      <p>Collection of
Interaction Features</p>
      <p>fn1 . . . fnk
- k interaction features
- n visited documents
(a) Raw data (b) un1... unk (c)
f11 . . . f1k Logical view of data
e.g. eigenvectors
computed by PCA</p>
    </sec>
    <sec id="sec-4">
      <title>EVIDENCE COLLECTION</title>
      <p>Let us return to the scenario of a user looking for
information about restaurants in London. Let us suppose the
user, to satisfy his information need, interacts with a search
engine and submits the query \restaurants in London". The
search engine returns a ranked list of results. For
simplicity, we focus on two entities only, namely, the user and the
result. When examining the returned results, the user
interacts with them and with the information objects the results
refer to. In this scenario the behavior of the user when
examining and (eventually) accessing the results can be
considered as a property to describe the involved entities and,
particularly, as a source to assist relevance prediction. In
the above scenario another source available is the content of
the abstracts (title, snippet and URL) of the results and the
content of the corresponding information objects.</p>
      <p>Once the sources have been selected, the next step is to
collect the evidence to build the model of these sources. This
step consists of selecting the features to be gathered to build
a model of these sources, and then the actual collection of
the selected features.</p>
      <p>
        In the event of the source \user behavior" a possible choice,
as depicted in step two of Figure 1, is the adoption of
socalled interaction features. This is for instance the approach
adopted in [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] where several interaction features are
exploited simultaneously. In particular, in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] a subset of the
features gathered in the user study described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was
exploited to obtain a vector subspace representation of the user
behavior. When using a representation personalized for each
user and tailored on the speci c search task to re-rank the
documents, the keywords extracted from the top re-ranked
documents were shown to be e ective as source for query
expansion. The methodology proposed in that work assumed
that the interaction features were available for all the
documents to be re-ranked. But this assumption does not hold
in our considered scenario, unless the documents have been
already visited with regard to past queries when performing
the same task. Therefore, the availability of the interaction
features is an issue to address. A possible solution is not to
consider the features with regard to a single user, but with
regard to a group of users, e.g. performing the same task.
      </p>
      <p>
        Another reason to exploit group interaction data is the
reliability of the interaction features. The features need to be
reliable indicators of the user needs, interests or intents. To
clarify what we mean by \reliable feature", let us consider the
display-time: this feature, when considered in isolation and
referring to a single user, is subject to variations. Exploiting
this feature when predicting relevance may be di cult [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
thus making it not reliable. But in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] the authors found
that display-time, when used as implicit measure, is more
consistent when referring to multiple subjects performing
the same task, than when personalized to each user.
      </p>
      <p>
        Individual users and user groups, does not necessarily
need to be considered as mutually exclusive sources for
interaction features. For instance, in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] user behavior models
to predict user preferences for web ranking are learned by
exploiting simultaneously feature values derived from the
individual's behavior and those aggregated across all the users
and search session for each query-URL pair.
      </p>
      <p>The selection of the features of a source to then be
gathered a ects the modeling step, since they constitute the
evidence used to build a model of the source. However, the
procedure to collect features is part of the design of the
IR system, in particular, the components aimed at
gathering the selected features and managing them. For instance,
when interaction features have been selected as implicit
indicators, a browser extension can be used to monitor the
gathering of such features. This is the approach adopted in
the Lemur Query Log Project2, a study to gather the query
logs from users of the Lemur Query Log Toolbar34. It should
be noted that the development of an extension that stores
the usage data on the client side may encourage the user to
adopt this monitoring tool since no personal data need to
be provided to the server.</p>
    </sec>
    <sec id="sec-5">
      <title>4. SOURCE MODELING AND PREDICTION</title>
      <p>Once the evidence has been gathered, the next step
consists of modeling the evidence so that it can be used to
predict relevance. In this work the mathematical construct of
the vector subspace is used for this purpose.</p>
      <p>In this paper, the evidence gathered by the di erent sources
is exploited to rank information objects with respect to a
given query. This is done by using the di erent
representations of the objects generated from the sources. For instance,
if the user \interaction behavior" is a considered source, an
information object can be described in terms of the
interaction features monitored when a user is visiting the object |
e.g. an object being displayed for 30 seconds, clicked 3 times
and on which 5 scrolling actions have been performed, can
be represented as the vector y = (30; 3; 5). The same
object, if the source \content" is considered, can be described
as the vector of the TF IDF weights of the terms appearing
in it. The construct of the vector space basis is particularly
suitable to model these multiple representations. Indeed,
intuitively, the same information object can be represented
with regard to di erent sources in the same way the same
vector can be generated by di erent vector space basis.</p>
      <p>
        A second reason to adopt the construct of the vector space
basis is that some of the vector subspace representations
2http://lemurstudy.cs.umass.edu/
3http://www.lemurproject.org/querylogtoolbar/
4The goal of the study is to create a database of web search
activities that will be provided to the information retrieval
research community.
may reveal the logical structure underlying the collected
evidence. The collected data, prepared in a matrix, is a vector
representation of the source. This data often may be noisy.
A matrix transformation, namely a change of basis, can be
applied to map the original view of the data to one that is
less noisy. Let us consider the re-evaluation of the Vector
Space Model (VSM) proposed in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The authors point out
how some assumptions underlying the traditional VSM [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
{ e.g. that the terms are orthogonal { may suggest that
the vector was interpreted as a data structure and not as a
logical construct. Subsequent developments show how the
vector can be used as a logical construct able to capture
dependencies between terms and between documents [
        <xref ref-type="bibr" rid="ref16 ref18">16, 18</xref>
        ].
The \latent semantics" [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] of the terms in the documents,
that is the dependencies between terms, was used as a source
for implementing a Pseudo Relevance Feedback algorithm [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
and an Explicit Relevance Feedback algorithm [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] based on
the geometric framework adopted in this work.
      </p>
      <p>
        To explain the role of the matrix transformation
techniques in the modeling step, we use the example of
information behavior as a source, where the latter is described in
terms of interaction features. A matrix A can be prepared
where the element (i; j) is feature j observed during the visit
of object i, e.g. a display-time of 30 seconds. The matrix
A, as mentioned above, can be a noisy vector-based
representation of the observed data. A matrix transformation
technique such as Principal Component Analysis (PCA) of
AT A can be used to compute a new vector space basis {
this is actually the approach proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. PCA provides
a set of eigenvectors and a subset of them can be used to
obtain the user interaction behavior dimension { the model
of the source is the subspace spanned by the eigenvectors.
As suggested by this example, this geometric framework
allows us to achieve one of our goals, which is to generate a
representation of the properties of the relationships between
entities { in the example mentioned above the user behavior
was the property to be modeled.
      </p>
      <p>
        The two mentioned approaches, that is the one adopted
in [
        <xref ref-type="bibr" rid="ref19 ref9">9, 19</xref>
        ] and that adopted in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], provide a solution for
two distinct sources. In the former case the modeled source
is a property of an entity, namely the latent semantics of
the terms in the documents. In the latter case, the
modeled source is a property of a relationships between entities,
namely the user interaction behavior. However, we are also
interested in modeling relationships (if any) existing between
the properties of the entities, namely between sources, e.g.
between the latent semantics of the terms and the user
interaction behavior { this is di erent from modeling properties
of relationships, e.g. the user interaction behavior.
      </p>
      <p>Let us return to the scenario of a user looking for
information about restaurants in London and suppose the term
\jazz" appears in the abstract of one of the displayed
results. The user when examining the result may realize that
he is more interested in jazz restaurants than in general ones.
This example also emphasizes how di erent sources are not
necessarily independent from each other. Indeed, the
features observed for a source (e.g. the user behavior) can be
\entangled" with the features observed for another source
(e.g. the particular meaning of a query feature in the
selected results).</p>
      <p>
        The design of one approach per source may not be able
to model relationships that may occur between sources and
consequently to exploit them, as reported in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In this
work, we consider that the relationships are themselves sources.
Therefore, it is better to not consider distinct mappings, one
for each source, but to compute a single vector space basis
to represent the relationships between sources.
      </p>
      <p>
        The model of the sources can be used in the retrieval
process once the information objects have been represented by
the features selected to describe the sources. Indeed, the
measure of the degree to which the modeled source occurs
in an information object can be computed as the distance
between the vector representation of the information object,
which corresponds to a one-dimensional subspace, and the
subspace modeling the source(s) spanned by the vector space
basis computed in the source modeling step. This motivates
the function proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where the author showed how
such function can be interpreted as a trace-based function
and that the measure is a probability measure. The idea of
using trace in IR, and in particular the density operators,
was originally introduced in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], and one of its important
consequence { subsequently exploited in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] { was to
\establish a link between geometry and probability in vector
spaces" [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>5. IMPLEMENTATION AND EVALUATION</title>
      <p>The speci c implementation we are investigating concern
the two mentioned sources, that is, the user behavior and
the latent semantic of the terms in the information objects.</p>
      <p>
        With respect to user behavior, we are focusing on two
issues. The rst is the selection of the source for
interaction features since, as discussed in Section 3, both
individual and user groups interaction data can be exploited to
prepare the matrix A and to build the source model. In
particular, we are investigating the di erence between the two
contributions in terms of retrieval e ectiveness when PCA
is adopted as the matrix transformation technique. PCA
allows handing dimensionality reduction and capturing the
relationships among the features in an unsupervised manner.
However, as stated in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the problem is that the eigenvector
whose components best combine the interaction features, is
not necessarily the rst principal eigenvector, and the best
performance are achieved when the eigenvector is manually
selected. For this reason we are investigating other
unsupervised methods to obtain a vector subspace representation of
the interaction data.
      </p>
      <p>
        With respect to the latent semantics of terms, one issue
under investigation is the selection of the terms in the
feedback documents. Indeed, if the terms appearing in these
documents are adopted as evidence to build a source model,
one issue, particularly when real-time feedback is required,
is to handle matrices whose dimensions are the number of
distinct terms in the feedback documents. In this case a
possible solution is the selection of a subset of the terms,
e.g. the top weighted ones. However, this strategy has been
shown to not be e ective [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]; therefore, we are investigating
selection criteria for \good terms".
      </p>
      <p>Since the main objective of the methodology is to model
relationships, we will look into the relationships between
sources, and investigate their implementation using the
proposed geometric framework, and their impact on retrieval
e ectiveness. Two approaches are possible. The rst
approach is to rank information objects separately according
to di erent dimensions and then combine the rankings into
one. The second approach is to model all the sources as a
unique vector subspace and then rank the information
objects against such subspace. The latter approach has the
advantage of exploiting all the dimensions simultaneously,
thus avoiding any loss of information that may arise from not
considering relationships between sources (which is the case
with the rst approach). In particular, as for the user
behavior source, we are investigating unsupervised approaches
to model relationships among sources.</p>
      <p>
        Evaluation is crucial to validate the implementation of
the methodology. The main problem is the availability of
datasets where information about user interaction behavior,
the content of results and information objects are available.
Transaction logs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] can provide this data, but no explicit
relevance judgments are available to validate the e
ectiveness of the approaches under investigation; existing datasets
with this information are not publicly available.
      </p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUDING REMARKS</title>
      <p>
        The purpose of this work was the introduction of a
methodology that aims at exploiting evidence coming from multiple
sources to predict the relevance of information objects for
given queries. Four methodological steps are required to
achieve this goal, namely, sources selection, features
collection, dimension modeling and relevance prediction. The
geometric framework proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] was chosen to implement
the last two steps because it provides a uniform model for
the sources, which can be used by to rank objects according
to their estimated relevance.
      </p>
      <p>Moreover, we discussed some issues to be addressed when
implementing the methodology for two speci c sources, that
is the user interaction behavior and the latent semantic of
the terms in the information objects. The issues speci cally
concern the evidence collection and source modeling steps.</p>
      <p>
        In future work we want to further investigate the concepts
adopted in this paper, namely, entity, relationship,
dimension and feature. We chose these concepts as they relate to
the view of the world to be modeled { in our case in order
to predict relevance { which consists of entities and
relationships, where the entities exists independently of their
properties. The properties, namely the sources, are the
information that can be obtained by the observation of entities
and relationships between them. This is the same view of the
world adopted in the Entity-Relationship (ER) model [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ],
the most widely used data model for the conceptual design
of databases. In the ER model the result of the observation
is a value and the mapping from the entities set (or the
relationship set) to the value set is named attribute. The notion
of feature adopted in this work can be compared to the ER
notion of value set. Moreover the notion of dimension can
be compared to the notion of attribute, since both refers to
properties of entities and relationships.
      </p>
      <p>
        The above discussion suggests investigate the
relationships among the ER model, the geometric framework
proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and the methodology proposed in this paper.
Acknowledgements This research is partly funded by a
Royal Society International Joint Project (2008/R4).
Mounia Lalmas is currently funded by Microsoft Research/Royal
Academy of Engineering.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Maron</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. S.</given-names>
            <surname>Cooper</surname>
          </string-name>
          .
          <article-title>Probability of relevance: A uni cation of two competing models for document retrieval</article-title>
          .
          <source>Information Technology: Research and Development</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>21</fpage>
          ,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kelly</surname>
          </string-name>
          .
          <article-title>Understanding implicit feedback and document preference: a naturalistic user study</article-title>
          .
          <source>PhD thesis</source>
          , New Brunswick, NJ, USA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>White</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kelly</surname>
          </string-name>
          .
          <article-title>A study on the e ects of personalization and task information on implicit feedback performance</article-title>
          .
          <source>In Proceedings of CIKM'06</source>
          , pages
          <fpage>297</fpage>
          {
          <fpage>306</fpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Optimizing search engines using clickthrough data</article-title>
          .
          <source>In Proceedings of KDD '02</source>
          , pages
          <fpage>133</fpage>
          {
          <fpage>142</fpage>
          , New York, NY, USA,
          <year>2002</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Agichtein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ragno</surname>
          </string-name>
          .
          <article-title>Learning user interaction models for predicting web search result preferences</article-title>
          .
          <source>In Proceedings of SIGIR '06</source>
          , pages
          <fpage>3</fpage>
          {
          <fpage>10</fpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Melucci</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.W.</given-names>
            <surname>White</surname>
          </string-name>
          .
          <article-title>Utilizing a geometry of context for enhanced implicit feedback</article-title>
          .
          <source>In Proceedings of CIKM'07</source>
          , pages
          <fpage>273</fpage>
          {
          <fpage>282</fpage>
          , Lisbon, Portugal,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jansen</surname>
          </string-name>
          .
          <article-title>Search log analysis: What it is, what's been done, how to do it</article-title>
          .
          <source>Library &amp; Information Science Research</source>
          ,
          <volume>28</volume>
          (
          <issue>3</issue>
          ):
          <volume>407</volume>
          {
          <fpage>432</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Combining approaches to information retrieval</article-title>
          .
          <source>Advances in information retrieval</source>
          ,
          <volume>7</volume>
          :1{
          <fpage>36</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Melucci</surname>
          </string-name>
          .
          <article-title>A basis for information retrieval in context</article-title>
          .
          <source>ACM TOIS</source>
          ,
          <volume>26</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>41</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kelly</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Teevan</surname>
          </string-name>
          .
          <article-title>Implicit feedback for inferring user preference: a bibliography</article-title>
          .
          <source>SIGIR Forum</source>
          ,
          <volume>37</volume>
          (
          <issue>2</issue>
          ):
          <volume>18</volume>
          {
          <fpage>28</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.W.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.M.</given-names>
            <surname>Jose</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Ruthven.</surname>
          </string-name>
          <article-title>An implicit feedback approach for interactive information retrieval</article-title>
          . IP&amp;M,
          <volume>42</volume>
          (
          <issue>1</issue>
          ):
          <volume>166</volume>
          {
          <fpage>190</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          .
          <article-title>A probability ranking principle for interactive information retrieval</article-title>
          .
          <source>Information Retrieval</source>
          ,
          <volume>11</volume>
          (
          <issue>3</issue>
          ):
          <volume>251</volume>
          {
          <fpage>265</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agosti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          <article-title>A methodology for the automatic construction of a hypertext for information retrieval Proc</article-title>
          .
          <source>of ACM SAC</source>
          ,
          <volume>745</volume>
          {
          <fpage>753</fpage>
          , Indianapolis, Indiana, United States,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tsikrika</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lalmas</surname>
          </string-name>
          .
          <article-title>Combining evidence for web retrieval using the inference network model: an experimental study</article-title>
          .
          <source>IP&amp;M</source>
          ,
          <volume>40</volume>
          (
          <issue>5</issue>
          ):
          <volume>751</volume>
          {
          <fpage>772</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lalmas</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Ruthven.</surname>
          </string-name>
          <article-title>Representing and retrieving structured documents using the Dempster-Shafer theory of evidence: modelling and evaluation</article-title>
          .
          <source>Journal of Documentation</source>
          ,
          <volume>54</volume>
          :
          <fpage>529</fpage>
          {
          <fpage>565</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>S. K. M. Wong</surname>
            and
            <given-names>V. V.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
          </string-name>
          .
          <article-title>Vector space model of information retrieval: a reevaluation</article-title>
          .
          <source>In Proc. of SIGIR '84</source>
          , pages
          <fpage>167</fpage>
          {
          <fpage>185</fpage>
          ,
          <string-name>
            <surname>Swinton</surname>
          </string-name>
          , UK,
          <year>1984</year>
          . British Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>18</volume>
          (
          <issue>11</issue>
          ):
          <volume>613</volume>
          {
          <fpage>620</fpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deerwester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Furnas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Harshman</surname>
          </string-name>
          .
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>JASIS</source>
          ,
          <volume>41</volume>
          :
          <fpage>391</fpage>
          {
          <fpage>407</fpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>E. Di</given-names>
            <surname>Buccio</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Melucci</surname>
          </string-name>
          . University of Padua at TREC 2009:
          <article-title>Relevance Feedback Track</article-title>
          .
          <source>In Proc. of TREC 2009</source>
          , Washington, DC, USA,
          <year>2009</year>
          . To Appear.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E. Di</given-names>
            <surname>Buccio</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Melucci</surname>
          </string-name>
          .
          <article-title>Towards a Methodology for Contextual Information Retrieval</article-title>
          .
          <source>In Proc. of CIRSE</source>
          <year>2009</year>
          , Tolouse, France,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>C.J. van Rijsbergen</surname>
          </string-name>
          .
          <source>The Geometry of Information Retrieval</source>
          . Cambridge University Press, New York, NY, USA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.P.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>The entity-relationship model|toward a uni ed view of data</article-title>
          .
          <source>ACM TODS</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):9{
          <fpage>36</fpage>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>