=Paper= {{Paper |id=Vol-2621/CIRCLE20_04 |storemode=property |title=Proof of Concept and Evaluation of Eye Gaze Enhanced Relevance Feedback in Ecological Context |pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_04.pdf |volume=Vol-2621 |authors=Vaynee Sungeelee,Francis Jambon,Philippe Mulhem |dblpUrl=https://dblp.org/rec/conf/circle/SungeeleeJM20 }} ==Proof of Concept and Evaluation of Eye Gaze Enhanced Relevance Feedback in Ecological Context== https://ceur-ws.org/Vol-2621/CIRCLE20_04.pdf

Proof of concept and evaluation
of eye gaze enhanced relevance feedback
in ecological context
Vaynee Sungeelee, Francis Jambon, Philippe Mulhem
vaynee.sungeelee@etu.univ-grenoble-alpes.fr,Francis.Jambon@imag.fr,Philippe.Mulhem@imag.fr
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France

ABSTRACT However, such approach does not considers specific aspects
The major method for evaluating Information Retrieval systems related to human (see [12]), and does not tackle Web searches:
still relies nowadays on the “Cranfield paradigm", supported by
test collections. This sheds light on the fact that human behaviour • only the first few snippets (document excerpts) are really
is not considered central to Information Retrieval. For instance, considered by a user looking at a Search Engine Result Page
some Information Retrieval systems that need users feedback to (SERP) [3];
improve results relevance can not completely be evaluated with • actual document relevance assessment by users is a sequen-
classical test collections (since the interaction itself is not a part tial two stages process: a user first looks at snippets, and then
of the evaluation). Our goal is to work toward the integration of may consult the corresponding documents [15]. This is not
specific human behaviour in Information Retrieval. More precisely, really consistent with classical assessment, where experts
we studied the impact of eye gaze analysis on information retrieval. are passing through full documents to check relevance;
The hypothesis is that acquiring the terms read by a user on the • the behaviour of users changes and adapts to the quality of
result page displayed may be beneficial for a relevance feedback a the search engine [3];
mechanism, without any explicit intervention of the user. We have • a real life Web search usually does not consist of a single
implemented a proof of concept which allows us to experiment with query, but is composed of a set of progressively manually
this new method of interaction with a search engine. The contribu- refined queries [9].
tions of our work are twofold. First, the proof of concept we created
Our goal is here to complement classical IR systems evaluation
shows that eye gaze enhanced relevance feedback information re-
via test collections, by adding some of the specifics of human be-
trieval systems could be implemented and that its evaluation gives
haviour to the evaluation method. Formally speaking, our objective
interesting results. Second, we propose the basis of a evaluation
is to search for human behaviour indicators that could have a posi-
platform for Information Retrieval systems that take into account
tive or negative impact on the efficiency of search engine at large,
users behaviour in ecological contexts.
and to promote their usage in addition with test collections.
To do so, we develop an original instrumented platform that
CCS CONCEPTS mimics a classic Web search engine. Such a platform is configurable
• Information systems → Query reformulation; Test collec- to work with research (i.e. Terrier) and commercial (i.e. Qwant)
tions; Users and interactive retrieval. search engines. The platform could also be tuned to implement ad-
KEYWORDS hoc snippet generator and relevance feedback engine. To analyse
Relevance feedback, eye tracking, user behaviour, ecological con- user behaviour, the platform could collect user’s actions and his/her
text, proof of concept. perceptions –via of the shelf eye tracking system– of the result
page, at different levels of granularity. Moreover, the platform could
be deployed simply, in a way to allow user evaluation at a large
1 INTRODUCTION scale in ecological context.
One fundamental concern in Information Retrieval (IR) raises the The concept of “ecological context" is widely used in research
question: what makes documents relevant to an information need on the design and evaluation of user interfaces. For instance [11]
[15]. Since the 70’s, the major method for evaluating Information proposes the following definition: “the ecological context is a set of
Retrieval systems, and therefore checking if a system provides conditions for a user test experiment that gives it a degree of validity.
relevant documents, relies heavily on the “Cranfield paradigm" An experiment with real users to possess ecological validity must
[12], supported by test collections such as TREC1 . These collections use methods, materials, and settings that approximate the real-life
consists of a set of documents, a set of queries, and assessments situation that is under study."
corresponding to relevance judgements. Queries are chosen and Our first implementation of this platform –described in this
written by experts, whereas the relevance of documents are also paper– is a mock-up of a search engine enhanced with eye gaze
evaluated by experts. assisted relevance feedback. More specifically, the search engine
analyse user visual behaviour and try to refine user search intention.
Such specific IR system could not be evaluated with test collections
"Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0)." only since the user’s feedback is a key element used by the IR
1 https://trec.nist.gov system to improve the relevance of the documents returned.
Sungeele, Jambon, Mulhem

This paper is structured in 5 sections. In this first section we raised the question of whether analysis on a finer grain –words
provide an introduction to the general scientific motivation for instead of lines– could lead to even better results. More recently,
the platform. Then, in the next section, we present the eye gaze Y. Chen et al. [8] proposed the analysis of documents at the word
enhanced relevance feedback use case. In the third section, we granularity and concluded that this level of analysis was in fact a
describe the design and implementation of the proof of concept. good idea. However, at we said, they still deal with full documents
Next, the fourth section provides and discusses the results obtained and not snippets in SERPs.
after testing this proof of concept with user experiments. Finally, Closer to our hypothesis is the work of Eickhoff et al. [9], in
we conclude and propose future work in the fifth section. which users use a search engine to answer given questions, and
reformulate their queries several times to refine them. Eye move-
ments analysis showed that there is a close link between the words
2 EYE GAZE ENHANCED used to reformulate the query, and those read in the SERP. However,
RELEVANCE FEEDBACK Eickhoff et al. work has an explanatory purpose, i.e. the reformu-
In classical Web information retrieval systems, a user’s query is lation process is performed by users themselves, and there is no
used to filter and sort a corpus of documents, returning a list of relevance feedback mechanism proposed.
results ordered by decreasing order of relevance. This list of results Our current research draws elements from our previous work in
is collectively called a Search Engine Results Page (SERP), where Albarede et al. [1], which involved studying how eye gaze informa-
each result is usually composed of a summary of the corresponding tion could be used to provide a relevance feedback mechanism at the
document, called a Snippet. granularity of words in SERPs. We used the following assumptions:
As with any information retrieval system, a Web search engine (1) the word read for the longest duration undergoes a deeper cogni-
does not provide only relevant results in a SERP. Reasons for this are tive process; (2) the last word read before clicking on a document is
numerous: polysemy of query, indexing of documents, matching, the one which triggers the decision to select that document. These
etc. In addition, it can be difficult for users to phrase what they assumptions lead to the definition of two corresponding metrics:
are looking for until they see the results [2]. Usually, Web retrieval (1) the word read for the longest duration in a snippet, and (2) the
systems do not provide simple ways for a user to give his/her last word read by users. We associated with each word of SERP
feedback during the search process. To improve this, some Web the notion of positive, neutral and negative feature that reflects the
search engines use the clicks on the displayed snippets in a SERP potential contribution of each term in the relevance feedback. In the
as relevance clues. experiment, users were asked to choose the most relevant snippet
Such user explicit feedback can be used for relevance feedback, for given queries. We found that (a) the detection of terms read
by modifying the initial query, with the expected consequence of by users in snippet is sensitive to eye tracking system hardware
improving the relevance of documents returned by the system. performance and can be fairly precise with high end devices; (b)
This use is consistent to IR literature [10, 19] which indicates that the last and mostly the longest read terms are relevant to assess
user feedback tends to improve the overall quality of the search. the relevance of a document and (c) while positive terms could give
However, asking for explicit feedback can be a burden to the user [6], interesting clues to the relevance of a document, the metrics that
especially in the context of Web search where users are looking for were proposed to use negative terms were inconclusive.
fast and simple interactions (i.e. by providing very short queries
and by looking at very few results in SERPs).
Therefore, we believe that automatically interpreting at a fine 3 PROOF OF CONCEPT
grain user behaviour while reading a SERP, thus allowing precise Our previous research objectives in Albarede et al. [1] was to iden-
implicit feedback, is a promising approach. More precisely, our tify useful metrics from the analysis of user behaviour with a SERP,
hypothesis is that the analysis of user perceptions –via analysis of and to derive optimal parameter settings for relevance feedback.
his/her eye movements– and actions while reading a SERP could Our present objective is to demonstrate –by a proof of concept– that
be used to implement an effective relevance feedback mechanism. a search engine enhanced by implicit relevance feedback driven by
Our objective is to implement and validate this hypothesis in an eye gaze analysis could be implemented and used in ecological con-
ecological context. text. Such enhanced search engine is then be used to experimentally
Literature shows that the acquisition of eye movements is in- evaluate whether the quality of a search engine is improved in the
deed an interesting means of personalising information retrieval. real world with the metrics and optimal setup we identified in [1].
For instance, in a Critiquing-Based Recommender System, Chen To our knowledge, there is currently no working information re-
and Wang [7] study shows the feasibility of inferring users’ feed- trieval system application which implements an implicit relevance
back based on their eye movements. Buscher et al [6] verified that feedback mechanism assisted by the analysis of eye movements (at
analysing the display time of documents while scrolling provides word level) described in the literature.
valuable data for retrieval purposes. They also captured eye move- The core hypothesis of this Proof of Concept (PoC) is that eye
ments over single lines of text in documents. In another study [5] gaze analysis can identify words to be used for the implicit rele-
they show a relation between eye movement measures and user- vance feedback mechanism. Therefore our objective is, on the one
perceived relevance of read text passages in documents. They also hand, to design and implement an application that mimics a Web
demonstrate the effect of using reading behaviour of documents as search engine, and on the other hand, to analyse user interactions
implicit relevance feedback. Their studies concluded that these –actions and perceptions– with he the search engine user interface
methods can significantly improve retrieval performance. This as a means for implicit feedback to reformulate user queries. We
Proof of concept and evaluation of eye gaze enhanced relevance feedback...

Figure 1: Software architecture overview of the application.

implement metrics derived from [1] to study their usefulness in an independent of the evolutions of Web browsers. The software ar-
almost real context of information retrieval usage. chitecture of the application (see fig. 1) is composed of the 3 main
In short, we aim at evaluating whether the relevance feedback components of the Seeheim model: (1) the User Interface, (2) the
assisted by the analyse of eye gaze increases the precision of the ex- Dialog Controller, and (3) the Application Interface. With the User
panded query results in the real world. The implementation of such Interface is associated an Eye Tracking Analysis component which
a PoC raised questions that are tackled in the following subsections. role is to managed the eye gaze analysis. With the Dialog Controller
are associated two separate modules, the Calculation of Indicators
component which role is to refine the query, and the Data Col-
3.1 Software architecture overview lection component that store data (user actions and perceptions,
To structure the application, we make use of the Seeheim software queries, results) exchanged between User Interface and Applica-
architecture model [17], a reference model in Human Computer tion Interface in a log file for analysis purpose. These modules are
Interface domain. The Seeheim model is a high-level model which detailed in the following subsections.
was designed for single user systems with a graphical user inter-
face. It leads to an application whose components –the graphical
user interface and/or the functional core– can easily be replaced if
3.2 User Interface and Eye Tracking
a different implementation is required. The Seeheim model splits Analysis modules
the application into 3 main parts: the User Interface, the Dialog The User Interface aims at mimicking classical Web search engines:
Controller and the Application Interface. The User Interface man- this interface features a text field for the query and a "Search" button.
ages the user inputs and the outputs of the application. The Dialog Once the query is processed by the IR system, a SERP (composed
Controller is a mediator between the user (via the User Interface) of snippets) is displayed, with a "Refine" button at the right of each
and the functional core (via the Application Interface) and is re- snippet (see fig. 2). For its implementation, we used the Java Swing
sponsible for defining the structure of the exchanges between them. graphical widget library.
The Application Interface defines the software interfaces with the The Eye Tracking Analysis module aims at detecting the zones
processes that can be initiated by the Dialog Controller. viewed by the user. Any graphical element of the user interface, i.e.
Separating the User Interface from the rest of the application any widget, could potentially be defined as a zone. In our current
preserves the application from modifications of the User Interface implementation, zones only refer to text elements of the result
(e.g. changing the display of snippets). Similarly, changes in the page, i.e. snippets and their titles. Since the indicators are based
process (e.g. modification of the search engine) are hidden to the rest on the semantic of texts, documents’ URL were excluded. Zones
of the application by the Application Interface. At last, evolutions are defined as rectangles around one word or a list of words (e.g.
of user interaction (e.g. modification of the indicators used for a entire snippet). Each word zone is represented by a bounding
relevance feedback) do not require significant changes to the rest box around the word. For a list of words, the zone is defined as the
of the application, since the Dialog Controller and the Application union of the words it contains. The Eye Tracking Analysis module
Interface live in their own separate components. tracks theses zones, named Areas of Interest, in the SERP displayed
The PoC is implemented in Java as a standalone application. We by the User Interface (see fig. 3), and detects when user eye gaze
have chosen not to develop a plugin in a Web browser to remain is inside one of these zones. The Eye Tracking Analysis module is
Sungeele, Jambon, Mulhem

Figure 2: Simulated search engine Web page user interface.

Figure 3: Areas Of Interest defined around words of figure 2.

also responsible for information exchange with the Eye Tracking Interface and Application Interface during query/results operations
System and with the Dialog Controller module. For exchanging and is responsible of the dissemination of information to other
messages between the Eye Tracking System and the application, modules.
we used the Usybus library which is based on Ivy library. Usybus The Calculation of Indicators component implements the algo-
is a framework that associates a type to messages so that devices rithm for the estimation of the metrics. These metrics are used to
know which messages to subscribe to, and Ivy2 is a middleware determine the new query when the user request a refinement of the
which facilitates data exchange between applications on a network. current results. We created an abstract class to manage the metric
to be used. Subclasses are implemented for each metric. The name
3.3 Dialog Controller, Calculation of Indicators of the metric to use is provided as a String object, possibly from a
and Data Collection modules configuration file and the corresponding class is initialised in the
Dialog Controller when the application is launched. This allows
The Dialog Controller, as specified in the Seeheim model, has a
better management of multiple metrics and makes them easier to
central role in the application operation. This component is in
test.
charge of managing the sequence of communications between User
The Data Collection component is in charge of recording data for
2 https://www.eei.cena.fr/products/ivy/ the experiments. In order to facilitate the testing process further, we
Proof of concept and evaluation of eye gaze enhanced relevance feedback...

can trace the program execution by creating two log files. The first movements analysis (i.e. "Functional tests"). Secondly, we want to
log file keeps track of requests as they evolve during the refining evaluate whether the relevance feedback mechanism we propose ef-
mechanism of the application. The purpose this log is to keep a trace fectively increases the relevance of the query results (i.e. "Relevance
of expanded queries to analyse them with information retrieval tests").
evaluation measures. The second log file records viewed words
and user actions. This latter contains a comprehensive view of the 4.1 Experimental Setup
application execution and can be consulted for debugging purposes.
The experimental setup (see fig. 4) is composed of a classic desktop
computer configuration with central unit, monitor, keyboard and
3.4 Application Interface and Information
mouse. The eye tracker device is attached under the screen, and
Retrieval System does not have any impact on the natural interaction of the user
The Application Interface component is responsible for establish- with the search engine. Our PoC simulates modern Web search
ing the network connection to the Information Retrieval System engine user interfaces to give users a real search engine experience.
and contains data structures to represent the query and the cor- It is important to provide such an ecological setting to be able to
responding results. This structure contains the query, with each analyse user behaviour with this new IR system.
result consisting of a document id, a title, an URL and a snippet.
Data exchange between these modules are in XML defined by an
internal DTD declaration.
We use the Terrier V4.03 [14] information retrieval platform,
an open source platform which we adapted with Python and Perl
add-ons to retrieve queries, generate snippets from documents
and get back the response constituting the SERP in XML format.
The snippet generation (see Algorithm 1), is inspired by [4, 18]: it
consists in finding the text window of size lmax, from a document
doc, that contains the larger amount of query terms wqset. This
generator assumes that the interest of a snippet only depends on
the query terms occurrences. Other additional elements, like the
topical link between documents words and the query or the impact
of the snippet for query disambiguation, may be used in the future.

Algorithm 1: Snippet generator (simplified).
Data: document source text : 𝑑𝑜𝑐;
query terms set : 𝑤𝑞𝑠𝑒𝑡;
length max of snippet : 𝑙𝑚𝑎𝑥.
Result: The excerpt for the document source
initialization : 𝑤𝑑𝑜𝑐 ← split 𝑑𝑜𝑐 in words
p←0
curr_mscore ← 0
mp ← 0
while p < length(wdoc)-lmax do Figure 4: Experimental setup showing the monitor with the
curr_score ← sum of query words occurrences in eye tracker device attached under the screen.
wdoc[p, p+lmax-1]
if curr_mscore < curr_score then
curr_mscore ← curr_score
The query results are displayed as in typical SERPs, with a click-
mp ← p
able URL which allows the user to consult documents (see fig. 2).
end
The text of the SERP is in Arial 15 font as it was shown to allow good
p++
reading by the user and good detection by the eye tracking analy-
end
sis [1]. The display parameters, such as font-style, font-size and text
Return wdoc[mp,min(length(doc),mp+lmax)]
colours are customizable. We also provide a refining mechanism
per snippet by adding a "Refine" button next to each snippet. After
scanning the SERP, the user can identify a relevant snippet and
4 USER EXPERIMENTS refine his/her original query modification by clicking on "Refine".
The objectives of theses user experiments are twofold. First, we On our case, this adds a relevant word to the original query in the
aim to make sure that the experimental configuration is technically search bar and automatically relaunches the search. This relevant
effective, namely that the words are correctly detected by the eye word is chosen based on a configurable metric, for instance: the
longest fixation in the selected snippet, or the last fixation in the
3 http://terrier.org SERP.
Sungeele, Jambon, Mulhem

For the eye tracker device, we opted for the Eye Tribe ET10004 , of his/her eyes, is fairly small (30x40 cm). So, even though the par-
which is a low cost and popular device for human-computer inter- ticipants usually sat attentively during the whole calibration, they
action experiments. It has a sampling frequency of 30 Hz and an could shifted position unconsciously during the actual experiments.
average accuracy between 0.5 to 1◦ of visual angle, which corre- This may have caused lost of gaze tracking. In addition, and on a
sponds to an on-screen average error of 0.5 to 1 cm if the user sits behavioural point of view, some participants identified the target
about 60 cm away from the monitor. It has an acceptable precision word and clicked on the "Refine" button before the end of reading
for fixation analyses provided it is properly calibrated and tested in of the word. In that case, this counteracted the metric used and the
a proper setup. We used a 1280x1024 resolution 19" LCD monitor. Calculation of Indicator module did not return the correct word.
The participants were instructed to keep a fixed position for best However, even if they are not particularly good, these results are
results and maintained a distance of 60-70 cm from the monitor fairly consistent with our previous findings [1] in which 3 out of 6
during the experiment. The calibration error for participants varied correct words have been detected with this eye tracker device.
between 0.37◦ and 0.48◦ , an acceptable range for reading research,
with 0.5◦ being the maximum acceptable value [13].
The corpus of documents used is the TIME5 collection, which
consists of queries on articles from Time magazine. This collection is 4.3 Refinement tests
rather small, but adequate for a proof of concept. We experimented In order to evaluate the relevance of the expanded queries generated
the following indicator: the longest fixation duration in the selected thanks to eye movements analysis, we used information retrieval
snippet. This indicator was founded to be effective by [1], enabling evaluation measures based on recall and precision. The goal of this
up to 87% of success in identifying positive words. second experiment was to verify whether better results could be
For the experiments, the task protocol is as follow: each partic- obtained with the expanded queries.
ipant is asked to run two different queries (chosen from a set of After posing the query, participants were asked to judge the
three possible queries); then, for each query, he/she looks through most relevant result pertaining to their information need on the
the SERP and chooses the snippet containing information he/she SERP. To do so, they have to look at one word which helped identify
considered relevant to the query. The duration of an experiment is the most relevant result in a given snippet, and then click on the
about 10 minutes per participant. The results of the experiments corresponding "Refine" button. Each user query is then expanded
are then analysed thru the logs recorded by the Data Collection with the term which received the longest attention, gathered by
module. implicit feedback. In case this term is already present in the original
We conducted experiments for a total of 9 participants. Due to query, the next best term is considered.
eye tracking device limits (the device fails to detect eyes), only To evaluate a query performance with respect to its initial per-
7 of the participants were retained for analysis. We agree that, formance before expansion, we use the following evaluation mea-
for now, the small number of participants does not allow us to sures [16]: Precision at 5 (P@5), Precision at 10 (P@10) and Recip-
conclude definitively on the question of whether or not eye tracking rocal Rank (RRank), each evaluating a different aspect of search
improves the relevance feedback mechanism of search engines, but engine performance. Thus, to compare relevance scores for the
is adequate for a first evaluation of this proof of concept. expanded query with the initial query, we calculate the measures
above-mentioned and verify which ones yield an improvement.
4.2 Functional tests P@5 corresponds to the number of relevant documents among
the first 5 documents and P@10 corresponds to the number of
The purpose of this first experiment was to evaluate the eye tracker
relevant documents among the first 10: Precision@k = (# of results
ability to correctly detect the words users gazed at. Users were told
@k that are relevant) / k. The reciprocal rank is a statistical measure
to posed the query, and then to search in the SERP a specific word
which takes the order of correctness into account and evaluates the
(target). Once they found it, they were advised to click immediately
result lists of a sample of queries. The reciprocal rank of a query
on the "Refine" button. Since searching a specific word involves a
response is the multiplicative inverse of the rank of the first correct
cognitive effort, using this protocol tends to simulate the user ac-
answer: 1 for first place, 1/2 for second place, 1/3 for third place
tivity to look for words in snippets that could help him/her to asset
and so on. We do not use measures such as Average Precision (AP)
the relevance of documents. The results (binary values) indicate
or Mean Average Precision (MAP), as they are not appropriate in
whether or not the target was detected. For this experiment, the
our case because we only focus on the top results.
Calculation of Indicator module correctly detected the target for 3
The results obtained are detailed for each of the three queries se-
out of 7 participants.
lected in table 1. The initial scores of each query –without expansion–
These limited results could be explained in two ways. From a
is given on the first line for P@5, P@10 and Reciprocal Rank fol-
technical point of view, the eye tracker device used, the Eye Tribe
lowed by their corresponding scores after expansion with the given
ET1000, is known to perform differently for a variety of partici-
term. A (+) sign next to each score denotes an improvement com-
pants and environmental conditions. For instance, the device is
pared to the initial query’s corresponding score. Similarly, (-) indi-
very sensitive to light conditions. Moreover, the tracking box, i.e.
cates a decrease and (=) indicates that the score has not changed.
the area in which the user’s head must stay to allow the detection
As stated before, even if our experiments do not consider a large
number of events to conclude, we believe that these results give
4 https://theeyetribe.com interesting clues about the expected performance of this relevance
5 http://ir.dcs.gla.ac.uk/resources/test_collections/time feedback mechanism: 4 out of 7 experiments show improvements
Proof of concept and evaluation of eye gaze enhanced relevance feedback...

Query Term added P@5 P@10 RRank
- 0.0000 0.0000 0.0164
Baath party settle 0.0000 (=) 0.0000 (=) 0.0244 (+)
self-isolation 0.0000 (=) 0.0000 (=) 0.0227 (+)
- 0.0000 0.1000 0.1429
U.S. policy toward conference 0.0000 (=) 0.1000 (=) 0.1000 (-)
South Viet Nam Military 0.0000 (=) 0.0000 (-) 0.0714 (-)
misinformed 0.0000 (=) 0.1000 (=) 0.1429 (=)
- 0.0000 0.0000 0.0227
Ceremonial suicides
automobile 0.2000 (+) 0.1000 (+) 0.3333 (+)
of buddhists monks
school 0.2000 (+) 0.1000 (+) 0.2000 (+)
Table 1: Precision@5, Precision@10, Reciprocal Rank scores before/after expansion; A (+) sign denotes an improvement com-
pared to the initial query’s corresponding score, (-) indicates a decrease, and (=) indicates that the score has not changed.

after query expansion for at least one of the scores; and 2 out of 7 improve the query, we can verify if [1] results matches our PoC
experiments show a degradation of performances. results.
We also note that the results are not uniform among the queries. In [1] preliminary experiments showed that the ET1000 could
The third query expansion has positive impact on all measures, detect a correct word for 3 out of 6 participants, where a X3-120
which is not the case for the others. These finding are not really a could detect 5 out of 6. To compute the percentage of positive words
surprise in the domain, in which many elements impact the quality that could be expected for the ET1000, we multiply the results
of the result. So, it seems that the query strongly matters, may be due obtained for X3-120 by the ratio of detection performance for the
to the query topic, the query formulation, the snippet generation, ET1000 to detection performance for X3-120, obtained from [1] first
the nature of the documents, etc. We do not have enough data here experiment:
to clearly identify the element(s) that cause this disparity.

4.4 Comparison with our previous research Res(ET1000) = Res [𝐴𝑙𝑏𝑎𝑟𝑒𝑑𝑒 𝑒𝑡 𝑎𝑙 . 2019] (X3-120)
presented in Albarede et al. 2019 [1] ×
Detect [𝐴𝑙𝑏𝑎𝑟𝑒𝑑𝑒 𝑒𝑡 𝑎𝑙 . 2019] (ET1000)
In our previous research [1], we obtained significant higher results: Detect [𝐴𝑙𝑏𝑎𝑟𝑒𝑑𝑒 𝑒𝑡 𝑎𝑙 . 2019] (X3-120)
it was found that up to 87% of words looked at in snippets were 3/6
= 87% ×
positive words, and we have obtained improvements in the rele- 5/6
vance after query expansion for only 4 out of 7 experiments (57%). = 52%
However, theses results should not be compared directly.
First of all, the element of comparison is not the same depend- Such extrapolated results for the Eye Tribe gives 52% of correct
ing on the approach. In [1] we compared detections of positive word detection, which is close to the results we actually obtain
terms, whereas for this PoC we have compared improvements in (4/7 ≈ 57%, see subsection about Functional Tests). Even if we do
the relevance after query expansion. However, the two results could not have enough data to draw a definitive conclusion, we estimate
eventually be compared if we make the assumption that a positive –subject to the limitations of our assumptions– that our results are
word added to a query systematically improve the relevance of the consistent with our previous research [1].
results after query expansion. It is probably not always true, but on Corollary, this means that with a more efficient eye tracking
the contrary, non-positive words may also improve the relevance device, results about 87% of positive results, could probably be
of the results after query expansion. obtained, and as a consequence, better performance of eye gaze
Moreover, the eye tracking device used in the two approaches enhanced relevance feedback could be achieved.
is different. In [1] we were using a Tobii Pro X3-1206 device, a
professional class device, while for the PoC we use a Eye Tribe 5 CONCLUSION AND FUTURE WORK
ET1000 device, a consumer class device. We have chosen this latter
In this paper, we presented the implementation and the user eval-
device for the PoC because this low cost eye tracker represents
uation of a proof of concept for a novel search engine enhanced
a class devices that could be used by end-users in an ecological
with eye gaze assisted relevance feedback. To our knowledge, there
context.
is no similar implementation in the literature. We showed that: (a)
As in [1] we compared these two devices at a functional level in a
there is a potential benefit of using eye gaze analysis as implicit
preliminary experiment, it is possible to extrapolate from the latter
relevance feedback method ; (b) the results we obtained are consis-
results if the less efficient eye tracker device was used to detect
tent with those we previously obtained in Albarede et al. [1] and
positive words. If we assume that a positive word will actually
so better performances could be expected in the future. Because of
the limited number of participants (7 users) and tasks (2 out of 3
6 https://www.tobiipro.com/product-listing/tobii-pro-x3-120/ queries per participant), the results of this study are indicative only,
Sungeele, Jambon, Mulhem

but we noted a tendency for the number of relevant documents to ACKNOWLEDGMENTS
increase after query expansion. The research presented in this article was partly funded by the
Our prototype and the experimental setup suffers from some Gelati Emergence project of the Grenoble Informatics Laboratory
limitations that could had negatively infer with the results we ob- (UMR 5217).
tained. One technical limitation was that the participant had to
keep his/her head relatively still to get good eye gaze detection REFERENCES
with the eye tracking device we used. While this might not be a [1] Lucas Albarede, Francis Jambon, and Philippe Mulhem. 2019. Exploration de
realistic setup for search engine use in ecologic context, the experi- l’apport de l’analyse des perceptions oculaires : étude préliminaire pour le
bouclage de pertinence. In COnférence en Recherche d’Informations et Applications -
ments showed that word detection is possible and could probably CORIA 2019, 16th French Information Retrieval Conference. Lyon, France, May 25-29,
significantly be improved with more robust (better tracking box) 2019. Proceedings. https://doi.org/doi:10.24348/coria.2019.CORIA_2019_paper_1
and more precise (better angular precision) eye tracking devices. In [2] Hiteshwar Kumar Azad and Akshay Deepak. 2019. Query expansion techniques
for information retrieval: a survey. Information Processing & Management 56, 5
a future implementation, we will consider replacing the Eye Tribe (2019), 1698–1735.
ET1000 with a more powerful eye tracker device, such as Tobii Pro [3] Ricardo Baeza-Yates. 2018. Bias on the Web. Commun. ACM 61, 6 (May 2018),
X3-120 to yield better results. 54–61. https://doi.org/10.1145/3209581
[4] Lorena Leal Bando, Falk Scholer, and Andrew Turpin. 2010. Constructing query-
Another limitation deals with the precision of eye gaze tracking. biased summaries: a comparison of human and system generated snippets. In
Most devices have an average accuracy between 0.5 to 1◦ of visual IIiX.
[5] Georg Buscher, Andreas Dengel, Ralf Biedert, and Ludger V. Elst.
angle, which corresponds to an on-screen average error of 0.5 to 2012. Attentive Documents: Eye Tracking As Implicit Feedback for
1 cm if the user sits about 60 cm away from the monitor. For short Information Retrieval and Beyond. ACM Trans. Interact. Intell. Syst.
words and with usual character size, this spatial accuracy does 1, 2 (Jan. 2012), 9:1–9:30. https://doi.org/10.1145/2070719.2070722
http://gbuscher.com/publications/BuscherDengel12_AttentiveDocuments.pdf.
not allow to make the distinction between two short words. This [6] Georg Buscher, Ludger Van Elst, and Andreas Dengel. 2009. Segment-level display
limitation have no simple answer since it is mostly linked to the time as implicit feedback: a comparison to eye tracking. In Proceedings of the 32nd
human fovea size. Increasing screen and character size could be international ACM SIGIR conference on Research and development in information
retrieval. ACM, 67–74.
a solution, but these answers alterate the ecologic validity of the [7] Li Chen and Feng Wang. 2016. An Eye-Tracking Study: Implication to Implicit
context. This is why a certain degree of uncertainty still remains in Critiquing Feedback Elicitation in Recommender Systems. In Proceedings of the
2016 Conference on User Modeling Adaptation and Personalization (UMAP ’16).
the detection of words read by users, and this aspect must be taken Association for Computing Machinery, Halifax, Nova Scotia, Canada, 163–167.
into account in the use of this technique. https://doi.org/10.1145/2930238.2930286
In addition, the corpus of documents used –TIME collection– [8] Yongqiang Chen, Peng Zhang, Dawei Song, and Benyou Wang. 2015. A real-
time eye tracking based query expansion approach via latent topic modeling.
was not ideal for user tests, given that they cover ancient historical In Proceedings of the 24th ACM International on Conference on Information and
events only: users might not have been able to understand the Knowledge Management. ACM, 1719–1722.
context to these information needs. In a future user experiment, [9] Carsten Eickhoff, Sebastian Dungs, and Vu Tran. 2015. An eye-tracking study of
query reformulation. In Proceedings of the 38th International ACM SIGIR Confer-
it would thus be desirable to have a collection that is of a more ence on Research and Development in Information Retrieval. ACM, 13–22.
general and recent nature. [10] Liana Ermakova and Josiane Mothe. 2016. Document re-ranking based on topic-
comment structure. In 2016 IEEE Tenth International Conference on Research
It will also be interesting to study and implement other metrics Challenges in Information Science (RCIS). IEEE, 1–10.
to test their usefulness in different information search contexts. [11] Francesco Bellotti, Riccardo Berta, Alessandro De Gloria, and Massimiliano Mar-
Another possible pathway worth exploring would be testing new garone. 2008. Widely Usable User Interfaces on Mobile Devices with RFID.
In Handbook of Research on User Interface Design and Evaluation for Mobile
snippets generators, e.g. the generators provided by Terrier V5 or Technology, Joanna Lumsden (Ed.). IGI Global, Hershey, PA, USA, 657–672.
Apache Lucene7 as the words in documents selected by the snippets https://doi.org/10.4018/978-1-59904-871-0.ch039
generator may have a significant impact on words that could be [12] Donna Harman. 2010. Is the Cranfield Paradigm Outdated?. In Proceedings of
the 33rd International ACM SIGIR Conference on Research and Development in
viewed by users, and as a consequence, on metrics used. Information Retrieval (Geneva, Switzerland) (SIGIR ’10). ACM, New York, NY,
Another experimental track could be to explore situations even USA, 1–1.
[13] Kenneth Holmqvist, Marcus Nyström, Richard Andersson, Richard Dewhurst,
closer to usual user searches on the Web, for instance when a user Halszka Jarodzka, and Joost Van de Weijer. 2011. Eye tracking: A comprehensive
make multiple queries for the same information need topic. guide to methods and measures. OUP Oxford.
We proposed here the basis of a modular platform for the evalu- [14] Craig Macdonald, Richard McCreadie, Rodrygo LT Santos, and Iadh Ounis. 2012.
From puppy to maturity: Experiences in developing Terrier. Proc. of OSIR at SIGIR
ation of information retrieval systems that take into account both (2012), 60–63.
user behaviour and classical test collections. Ideally, if classical [15] Stefano Mizzaro. 1997. Relevance: The whole history. Journal of the American
search engines were providing standardised SERP, it should be Society for Information Science 48, 9 (1997), 810–832.
[16] IC Mogotsi, Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze.
usable on any engine. A more concrete way to integrate existing 2010. Introduction to information retrieval. Information Retrieval 13, 2 (2010),
systems will be to provide tunable wrappers to adapt simply to any 192–195.
[17] Günther E Pfaff et al. 1985. User interface management systems. Vol. 1. Springer.
search engine. Extensions could integrate task oriented sequences [18] Anastasios Tombros and Mark Sanderson. 1998. Advantages of Query Biased
of queries (with document display tracking) so that other features Summaries in Information Retrieval. 2–10. https://doi.org/10.1145/290941.290947
may be provided. [19] ChengXiang Zhai and Sean Massung. 2016. Text data management and analysis:
a practical introduction to information retrieval and text mining. Morgan &
Claypool.

7 https://lucene.apache.org