=Paper= {{Paper |id=Vol-2621/CIRCLE20_04 |storemode=property |title=Proof of Concept and Evaluation of Eye Gaze Enhanced Relevance Feedback in Ecological Context |pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_04.pdf |volume=Vol-2621 |authors=Vaynee Sungeelee,Francis Jambon,Philippe Mulhem |dblpUrl=https://dblp.org/rec/conf/circle/SungeeleeJM20 }} ==Proof of Concept and Evaluation of Eye Gaze Enhanced Relevance Feedback in Ecological Context== https://ceur-ws.org/Vol-2621/CIRCLE20_04.pdf
                                    Proof of concept and evaluation
                               of eye gaze enhanced relevance feedback
                                         in ecological context
                                         Vaynee Sungeelee, Francis Jambon, Philippe Mulhem
                  vaynee.sungeelee@etu.univ-grenoble-alpes.fr,Francis.Jambon@imag.fr,Philippe.Mulhem@imag.fr
                             Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France

ABSTRACT                                                                                However, such approach does not considers specific aspects
The major method for evaluating Information Retrieval systems                        related to human (see [12]), and does not tackle Web searches:
still relies nowadays on the “Cranfield paradigm", supported by
test collections. This sheds light on the fact that human behaviour                      • only the first few snippets (document excerpts) are really
is not considered central to Information Retrieval. For instance,                          considered by a user looking at a Search Engine Result Page
some Information Retrieval systems that need users feedback to                             (SERP) [3];
improve results relevance can not completely be evaluated with                           • actual document relevance assessment by users is a sequen-
classical test collections (since the interaction itself is not a part                     tial two stages process: a user first looks at snippets, and then
of the evaluation). Our goal is to work toward the integration of                          may consult the corresponding documents [15]. This is not
specific human behaviour in Information Retrieval. More precisely,                         really consistent with classical assessment, where experts
we studied the impact of eye gaze analysis on information retrieval.                       are passing through full documents to check relevance;
The hypothesis is that acquiring the terms read by a user on the                         • the behaviour of users changes and adapts to the quality of
result page displayed may be beneficial for a relevance feedback                           a the search engine [3];
mechanism, without any explicit intervention of the user. We have                        • a real life Web search usually does not consist of a single
implemented a proof of concept which allows us to experiment with                          query, but is composed of a set of progressively manually
this new method of interaction with a search engine. The contribu-                         refined queries [9].
tions of our work are twofold. First, the proof of concept we created
                                                                                         Our goal is here to complement classical IR systems evaluation
shows that eye gaze enhanced relevance feedback information re-
                                                                                     via test collections, by adding some of the specifics of human be-
trieval systems could be implemented and that its evaluation gives
                                                                                     haviour to the evaluation method. Formally speaking, our objective
interesting results. Second, we propose the basis of a evaluation
                                                                                     is to search for human behaviour indicators that could have a posi-
platform for Information Retrieval systems that take into account
                                                                                     tive or negative impact on the efficiency of search engine at large,
users behaviour in ecological contexts.
                                                                                     and to promote their usage in addition with test collections.
                                                                                         To do so, we develop an original instrumented platform that
CCS CONCEPTS                                                                         mimics a classic Web search engine. Such a platform is configurable
• Information systems → Query reformulation; Test collec-                            to work with research (i.e. Terrier) and commercial (i.e. Qwant)
tions; Users and interactive retrieval.                                              search engines. The platform could also be tuned to implement ad-
KEYWORDS                                                                             hoc snippet generator and relevance feedback engine. To analyse
Relevance feedback, eye tracking, user behaviour, ecological con-                    user behaviour, the platform could collect user’s actions and his/her
text, proof of concept.                                                              perceptions –via of the shelf eye tracking system– of the result
                                                                                     page, at different levels of granularity. Moreover, the platform could
                                                                                     be deployed simply, in a way to allow user evaluation at a large
1    INTRODUCTION                                                                    scale in ecological context.
One fundamental concern in Information Retrieval (IR) raises the                         The concept of “ecological context" is widely used in research
question: what makes documents relevant to an information need                       on the design and evaluation of user interfaces. For instance [11]
[15]. Since the 70’s, the major method for evaluating Information                    proposes the following definition: “the ecological context is a set of
Retrieval systems, and therefore checking if a system provides                       conditions for a user test experiment that gives it a degree of validity.
relevant documents, relies heavily on the “Cranfield paradigm"                       An experiment with real users to possess ecological validity must
[12], supported by test collections such as TREC1 . These collections                use methods, materials, and settings that approximate the real-life
consists of a set of documents, a set of queries, and assessments                    situation that is under study."
corresponding to relevance judgements. Queries are chosen and                            Our first implementation of this platform –described in this
written by experts, whereas the relevance of documents are also                      paper– is a mock-up of a search engine enhanced with eye gaze
evaluated by experts.                                                                assisted relevance feedback. More specifically, the search engine
                                                                                     analyse user visual behaviour and try to refine user search intention.
                                                                                     Such specific IR system could not be evaluated with test collections
     "Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0)."                          only since the user’s feedback is a key element used by the IR
   1 https://trec.nist.gov                                                           system to improve the relevance of the documents returned.
                                                                                                                       Sungeele, Jambon, Mulhem


   This paper is structured in 5 sections. In this first section we      raised the question of whether analysis on a finer grain –words
provide an introduction to the general scientific motivation for         instead of lines– could lead to even better results. More recently,
the platform. Then, in the next section, we present the eye gaze         Y. Chen et al. [8] proposed the analysis of documents at the word
enhanced relevance feedback use case. In the third section, we           granularity and concluded that this level of analysis was in fact a
describe the design and implementation of the proof of concept.          good idea. However, at we said, they still deal with full documents
Next, the fourth section provides and discusses the results obtained     and not snippets in SERPs.
after testing this proof of concept with user experiments. Finally,         Closer to our hypothesis is the work of Eickhoff et al. [9], in
we conclude and propose future work in the fifth section.                which users use a search engine to answer given questions, and
                                                                         reformulate their queries several times to refine them. Eye move-
                                                                         ments analysis showed that there is a close link between the words
2   EYE GAZE ENHANCED                                                    used to reformulate the query, and those read in the SERP. However,
    RELEVANCE FEEDBACK                                                   Eickhoff et al. work has an explanatory purpose, i.e. the reformu-
In classical Web information retrieval systems, a user’s query is        lation process is performed by users themselves, and there is no
used to filter and sort a corpus of documents, returning a list of       relevance feedback mechanism proposed.
results ordered by decreasing order of relevance. This list of results      Our current research draws elements from our previous work in
is collectively called a Search Engine Results Page (SERP), where        Albarede et al. [1], which involved studying how eye gaze informa-
each result is usually composed of a summary of the corresponding        tion could be used to provide a relevance feedback mechanism at the
document, called a Snippet.                                              granularity of words in SERPs. We used the following assumptions:
   As with any information retrieval system, a Web search engine         (1) the word read for the longest duration undergoes a deeper cogni-
does not provide only relevant results in a SERP. Reasons for this are   tive process; (2) the last word read before clicking on a document is
numerous: polysemy of query, indexing of documents, matching,            the one which triggers the decision to select that document. These
etc. In addition, it can be difficult for users to phrase what they      assumptions lead to the definition of two corresponding metrics:
are looking for until they see the results [2]. Usually, Web retrieval   (1) the word read for the longest duration in a snippet, and (2) the
systems do not provide simple ways for a user to give his/her            last word read by users. We associated with each word of SERP
feedback during the search process. To improve this, some Web            the notion of positive, neutral and negative feature that reflects the
search engines use the clicks on the displayed snippets in a SERP        potential contribution of each term in the relevance feedback. In the
as relevance clues.                                                      experiment, users were asked to choose the most relevant snippet
   Such user explicit feedback can be used for relevance feedback,       for given queries. We found that (a) the detection of terms read
by modifying the initial query, with the expected consequence of         by users in snippet is sensitive to eye tracking system hardware
improving the relevance of documents returned by the system.             performance and can be fairly precise with high end devices; (b)
This use is consistent to IR literature [10, 19] which indicates that    the last and mostly the longest read terms are relevant to assess
user feedback tends to improve the overall quality of the search.        the relevance of a document and (c) while positive terms could give
However, asking for explicit feedback can be a burden to the user [6],   interesting clues to the relevance of a document, the metrics that
especially in the context of Web search where users are looking for      were proposed to use negative terms were inconclusive.
fast and simple interactions (i.e. by providing very short queries
and by looking at very few results in SERPs).
   Therefore, we believe that automatically interpreting at a fine       3   PROOF OF CONCEPT
grain user behaviour while reading a SERP, thus allowing precise         Our previous research objectives in Albarede et al. [1] was to iden-
implicit feedback, is a promising approach. More precisely, our          tify useful metrics from the analysis of user behaviour with a SERP,
hypothesis is that the analysis of user perceptions –via analysis of     and to derive optimal parameter settings for relevance feedback.
his/her eye movements– and actions while reading a SERP could            Our present objective is to demonstrate –by a proof of concept– that
be used to implement an effective relevance feedback mechanism.          a search engine enhanced by implicit relevance feedback driven by
Our objective is to implement and validate this hypothesis in an         eye gaze analysis could be implemented and used in ecological con-
ecological context.                                                      text. Such enhanced search engine is then be used to experimentally
   Literature shows that the acquisition of eye movements is in-         evaluate whether the quality of a search engine is improved in the
deed an interesting means of personalising information retrieval.        real world with the metrics and optimal setup we identified in [1].
For instance, in a Critiquing-Based Recommender System, Chen             To our knowledge, there is currently no working information re-
and Wang [7] study shows the feasibility of inferring users’ feed-       trieval system application which implements an implicit relevance
back based on their eye movements. Buscher et al [6] verified that       feedback mechanism assisted by the analysis of eye movements (at
analysing the display time of documents while scrolling provides         word level) described in the literature.
valuable data for retrieval purposes. They also captured eye move-          The core hypothesis of this Proof of Concept (PoC) is that eye
ments over single lines of text in documents. In another study [5]       gaze analysis can identify words to be used for the implicit rele-
they show a relation between eye movement measures and user-             vance feedback mechanism. Therefore our objective is, on the one
perceived relevance of read text passages in documents. They also        hand, to design and implement an application that mimics a Web
demonstrate the effect of using reading behaviour of documents as        search engine, and on the other hand, to analyse user interactions
implicit relevance feedback. Their studies concluded that these          –actions and perceptions– with he the search engine user interface
methods can significantly improve retrieval performance. This            as a means for implicit feedback to reformulate user queries. We
Proof of concept and evaluation of eye gaze enhanced relevance feedback...




                                             Figure 1: Software architecture overview of the application.


implement metrics derived from [1] to study their usefulness in an           independent of the evolutions of Web browsers. The software ar-
almost real context of information retrieval usage.                          chitecture of the application (see fig. 1) is composed of the 3 main
   In short, we aim at evaluating whether the relevance feedback             components of the Seeheim model: (1) the User Interface, (2) the
assisted by the analyse of eye gaze increases the precision of the ex-       Dialog Controller, and (3) the Application Interface. With the User
panded query results in the real world. The implementation of such           Interface is associated an Eye Tracking Analysis component which
a PoC raised questions that are tackled in the following subsections.        role is to managed the eye gaze analysis. With the Dialog Controller
                                                                             are associated two separate modules, the Calculation of Indicators
                                                                             component which role is to refine the query, and the Data Col-
3.1     Software architecture overview                                       lection component that store data (user actions and perceptions,
To structure the application, we make use of the Seeheim software            queries, results) exchanged between User Interface and Applica-
architecture model [17], a reference model in Human Computer                 tion Interface in a log file for analysis purpose. These modules are
Interface domain. The Seeheim model is a high-level model which              detailed in the following subsections.
was designed for single user systems with a graphical user inter-
face. It leads to an application whose components –the graphical
user interface and/or the functional core– can easily be replaced if
                                                                             3.2    User Interface and Eye Tracking
a different implementation is required. The Seeheim model splits                    Analysis modules
the application into 3 main parts: the User Interface, the Dialog            The User Interface aims at mimicking classical Web search engines:
Controller and the Application Interface. The User Interface man-            this interface features a text field for the query and a "Search" button.
ages the user inputs and the outputs of the application. The Dialog          Once the query is processed by the IR system, a SERP (composed
Controller is a mediator between the user (via the User Interface)           of snippets) is displayed, with a "Refine" button at the right of each
and the functional core (via the Application Interface) and is re-           snippet (see fig. 2). For its implementation, we used the Java Swing
sponsible for defining the structure of the exchanges between them.          graphical widget library.
The Application Interface defines the software interfaces with the               The Eye Tracking Analysis module aims at detecting the zones
processes that can be initiated by the Dialog Controller.                    viewed by the user. Any graphical element of the user interface, i.e.
   Separating the User Interface from the rest of the application            any widget, could potentially be defined as a zone. In our current
preserves the application from modifications of the User Interface           implementation, zones only refer to text elements of the result
(e.g. changing the display of snippets). Similarly, changes in the           page, i.e. snippets and their titles. Since the indicators are based
process (e.g. modification of the search engine) are hidden to the rest      on the semantic of texts, documents’ URL were excluded. Zones
of the application by the Application Interface. At last, evolutions         are defined as rectangles around one word or a list of words (e.g.
of user interaction (e.g. modification of the indicators used for            a entire snippet). Each word zone is represented by a bounding
relevance feedback) do not require significant changes to the rest           box around the word. For a list of words, the zone is defined as the
of the application, since the Dialog Controller and the Application          union of the words it contains. The Eye Tracking Analysis module
Interface live in their own separate components.                             tracks theses zones, named Areas of Interest, in the SERP displayed
   The PoC is implemented in Java as a standalone application. We            by the User Interface (see fig. 3), and detects when user eye gaze
have chosen not to develop a plugin in a Web browser to remain               is inside one of these zones. The Eye Tracking Analysis module is
                                                                                                                            Sungeele, Jambon, Mulhem




                                             Figure 2: Simulated search engine Web page user interface.




                                             Figure 3: Areas Of Interest defined around words of figure 2.


also responsible for information exchange with the Eye Tracking               Interface and Application Interface during query/results operations
System and with the Dialog Controller module. For exchanging                  and is responsible of the dissemination of information to other
messages between the Eye Tracking System and the application,                 modules.
we used the Usybus library which is based on Ivy library. Usybus                 The Calculation of Indicators component implements the algo-
is a framework that associates a type to messages so that devices             rithm for the estimation of the metrics. These metrics are used to
know which messages to subscribe to, and Ivy2 is a middleware                 determine the new query when the user request a refinement of the
which facilitates data exchange between applications on a network.            current results. We created an abstract class to manage the metric
                                                                              to be used. Subclasses are implemented for each metric. The name
3.3    Dialog Controller, Calculation of Indicators                           of the metric to use is provided as a String object, possibly from a
       and Data Collection modules                                            configuration file and the corresponding class is initialised in the
                                                                              Dialog Controller when the application is launched. This allows
The Dialog Controller, as specified in the Seeheim model, has a
                                                                              better management of multiple metrics and makes them easier to
central role in the application operation. This component is in
                                                                              test.
charge of managing the sequence of communications between User
                                                                                 The Data Collection component is in charge of recording data for
   2 https://www.eei.cena.fr/products/ivy/                                    the experiments. In order to facilitate the testing process further, we
Proof of concept and evaluation of eye gaze enhanced relevance feedback...


can trace the program execution by creating two log files. The first         movements analysis (i.e. "Functional tests"). Secondly, we want to
log file keeps track of requests as they evolve during the refining          evaluate whether the relevance feedback mechanism we propose ef-
mechanism of the application. The purpose this log is to keep a trace        fectively increases the relevance of the query results (i.e. "Relevance
of expanded queries to analyse them with information retrieval               tests").
evaluation measures. The second log file records viewed words
and user actions. This latter contains a comprehensive view of the           4.1    Experimental Setup
application execution and can be consulted for debugging purposes.
                                                                             The experimental setup (see fig. 4) is composed of a classic desktop
                                                                             computer configuration with central unit, monitor, keyboard and
3.4     Application Interface and Information
                                                                             mouse. The eye tracker device is attached under the screen, and
        Retrieval System                                                     does not have any impact on the natural interaction of the user
The Application Interface component is responsible for establish-            with the search engine. Our PoC simulates modern Web search
ing the network connection to the Information Retrieval System               engine user interfaces to give users a real search engine experience.
and contains data structures to represent the query and the cor-             It is important to provide such an ecological setting to be able to
responding results. This structure contains the query, with each             analyse user behaviour with this new IR system.
result consisting of a document id, a title, an URL and a snippet.
Data exchange between these modules are in XML defined by an
internal DTD declaration.
   We use the Terrier V4.03 [14] information retrieval platform,
an open source platform which we adapted with Python and Perl
add-ons to retrieve queries, generate snippets from documents
and get back the response constituting the SERP in XML format.
The snippet generation (see Algorithm 1), is inspired by [4, 18]: it
consists in finding the text window of size lmax, from a document
doc, that contains the larger amount of query terms wqset. This
generator assumes that the interest of a snippet only depends on
the query terms occurrences. Other additional elements, like the
topical link between documents words and the query or the impact
of the snippet for query disambiguation, may be used in the future.

 Algorithm 1: Snippet generator (simplified).
  Data: document source text : 𝑑𝑜𝑐;
            query terms set : 𝑤𝑞𝑠𝑒𝑡;
            length max of snippet : 𝑙𝑚𝑎𝑥.
  Result: The excerpt for the document source
  initialization : 𝑤𝑑𝑜𝑐 ← split 𝑑𝑜𝑐 in words
  p←0
  curr_mscore ← 0
  mp ← 0
  while p < length(wdoc)-lmax do                                             Figure 4: Experimental setup showing the monitor with the
       curr_score ← sum of query words occurrences in                        eye tracker device attached under the screen.
        wdoc[p, p+lmax-1]
       if curr_mscore < curr_score then
           curr_mscore ← curr_score
                                                                                The query results are displayed as in typical SERPs, with a click-
           mp ← p
                                                                             able URL which allows the user to consult documents (see fig. 2).
       end
                                                                             The text of the SERP is in Arial 15 font as it was shown to allow good
       p++
                                                                             reading by the user and good detection by the eye tracking analy-
  end
                                                                             sis [1]. The display parameters, such as font-style, font-size and text
  Return wdoc[mp,min(length(doc),mp+lmax)]
                                                                             colours are customizable. We also provide a refining mechanism
                                                                             per snippet by adding a "Refine" button next to each snippet. After
                                                                             scanning the SERP, the user can identify a relevant snippet and
4     USER EXPERIMENTS                                                       refine his/her original query modification by clicking on "Refine".
The objectives of theses user experiments are twofold. First, we             On our case, this adds a relevant word to the original query in the
aim to make sure that the experimental configuration is technically          search bar and automatically relaunches the search. This relevant
effective, namely that the words are correctly detected by the eye           word is chosen based on a configurable metric, for instance: the
                                                                             longest fixation in the selected snippet, or the last fixation in the
    3 http://terrier.org                                                     SERP.
                                                                                                                       Sungeele, Jambon, Mulhem


   For the eye tracker device, we opted for the Eye Tribe ET10004 ,      of his/her eyes, is fairly small (30x40 cm). So, even though the par-
which is a low cost and popular device for human-computer inter-         ticipants usually sat attentively during the whole calibration, they
action experiments. It has a sampling frequency of 30 Hz and an          could shifted position unconsciously during the actual experiments.
average accuracy between 0.5 to 1◦ of visual angle, which corre-         This may have caused lost of gaze tracking. In addition, and on a
sponds to an on-screen average error of 0.5 to 1 cm if the user sits     behavioural point of view, some participants identified the target
about 60 cm away from the monitor. It has an acceptable precision        word and clicked on the "Refine" button before the end of reading
for fixation analyses provided it is properly calibrated and tested in   of the word. In that case, this counteracted the metric used and the
a proper setup. We used a 1280x1024 resolution 19" LCD monitor.          Calculation of Indicator module did not return the correct word.
The participants were instructed to keep a fixed position for best       However, even if they are not particularly good, these results are
results and maintained a distance of 60-70 cm from the monitor           fairly consistent with our previous findings [1] in which 3 out of 6
during the experiment. The calibration error for participants varied     correct words have been detected with this eye tracker device.
between 0.37◦ and 0.48◦ , an acceptable range for reading research,
with 0.5◦ being the maximum acceptable value [13].
   The corpus of documents used is the TIME5 collection, which
consists of queries on articles from Time magazine. This collection is   4.3    Refinement tests
rather small, but adequate for a proof of concept. We experimented       In order to evaluate the relevance of the expanded queries generated
the following indicator: the longest fixation duration in the selected   thanks to eye movements analysis, we used information retrieval
snippet. This indicator was founded to be effective by [1], enabling     evaluation measures based on recall and precision. The goal of this
up to 87% of success in identifying positive words.                      second experiment was to verify whether better results could be
   For the experiments, the task protocol is as follow: each partic-     obtained with the expanded queries.
ipant is asked to run two different queries (chosen from a set of           After posing the query, participants were asked to judge the
three possible queries); then, for each query, he/she looks through      most relevant result pertaining to their information need on the
the SERP and chooses the snippet containing information he/she           SERP. To do so, they have to look at one word which helped identify
considered relevant to the query. The duration of an experiment is       the most relevant result in a given snippet, and then click on the
about 10 minutes per participant. The results of the experiments         corresponding "Refine" button. Each user query is then expanded
are then analysed thru the logs recorded by the Data Collection          with the term which received the longest attention, gathered by
module.                                                                  implicit feedback. In case this term is already present in the original
   We conducted experiments for a total of 9 participants. Due to        query, the next best term is considered.
eye tracking device limits (the device fails to detect eyes), only          To evaluate a query performance with respect to its initial per-
7 of the participants were retained for analysis. We agree that,         formance before expansion, we use the following evaluation mea-
for now, the small number of participants does not allow us to           sures [16]: Precision at 5 (P@5), Precision at 10 (P@10) and Recip-
conclude definitively on the question of whether or not eye tracking     rocal Rank (RRank), each evaluating a different aspect of search
improves the relevance feedback mechanism of search engines, but         engine performance. Thus, to compare relevance scores for the
is adequate for a first evaluation of this proof of concept.             expanded query with the initial query, we calculate the measures
                                                                         above-mentioned and verify which ones yield an improvement.
4.2    Functional tests                                                     P@5 corresponds to the number of relevant documents among
                                                                         the first 5 documents and P@10 corresponds to the number of
The purpose of this first experiment was to evaluate the eye tracker
                                                                         relevant documents among the first 10: Precision@k = (# of results
ability to correctly detect the words users gazed at. Users were told
                                                                         @k that are relevant) / k. The reciprocal rank is a statistical measure
to posed the query, and then to search in the SERP a specific word
                                                                         which takes the order of correctness into account and evaluates the
(target). Once they found it, they were advised to click immediately
                                                                         result lists of a sample of queries. The reciprocal rank of a query
on the "Refine" button. Since searching a specific word involves a
                                                                         response is the multiplicative inverse of the rank of the first correct
cognitive effort, using this protocol tends to simulate the user ac-
                                                                         answer: 1 for first place, 1/2 for second place, 1/3 for third place
tivity to look for words in snippets that could help him/her to asset
                                                                         and so on. We do not use measures such as Average Precision (AP)
the relevance of documents. The results (binary values) indicate
                                                                         or Mean Average Precision (MAP), as they are not appropriate in
whether or not the target was detected. For this experiment, the
                                                                         our case because we only focus on the top results.
Calculation of Indicator module correctly detected the target for 3
                                                                            The results obtained are detailed for each of the three queries se-
out of 7 participants.
                                                                         lected in table 1. The initial scores of each query –without expansion–
   These limited results could be explained in two ways. From a
                                                                         is given on the first line for P@5, P@10 and Reciprocal Rank fol-
technical point of view, the eye tracker device used, the Eye Tribe
                                                                         lowed by their corresponding scores after expansion with the given
ET1000, is known to perform differently for a variety of partici-
                                                                         term. A (+) sign next to each score denotes an improvement com-
pants and environmental conditions. For instance, the device is
                                                                         pared to the initial query’s corresponding score. Similarly, (-) indi-
very sensitive to light conditions. Moreover, the tracking box, i.e.
                                                                         cates a decrease and (=) indicates that the score has not changed.
the area in which the user’s head must stay to allow the detection
                                                                            As stated before, even if our experiments do not consider a large
                                                                         number of events to conclude, we believe that these results give
   4 https://theeyetribe.com                                             interesting clues about the expected performance of this relevance
   5 http://ir.dcs.gla.ac.uk/resources/test_collections/time             feedback mechanism: 4 out of 7 experiments show improvements
Proof of concept and evaluation of eye gaze enhanced relevance feedback...


                                       Query                          Term added       P@5       P@10             RRank
                                                            -         0.0000      0.0000      0.0164
                                       Baath party        settle    0.0000 (=) 0.0000 (=) 0.0244 (+)
                                                     self-isolation 0.0000 (=) 0.0000 (=) 0.0227 (+)
                                                            -         0.0000      0.1000      0.1429
                              U.S. policy toward      conference    0.0000 (=) 0.1000 (=) 0.1000 (-)
                              South Viet Nam            Military    0.0000 (=) 0.0000 (-) 0.0714 (-)
                                                     misinformed 0.0000 (=) 0.1000 (=) 0.1429 (=)
                                                            -         0.0000      0.0000      0.0227
                              Ceremonial suicides
                                                      automobile    0.2000 (+) 0.1000 (+) 0.3333 (+)
                              of buddhists monks
                                                         school     0.2000 (+) 0.1000 (+) 0.2000 (+)
Table 1: Precision@5, Precision@10, Reciprocal Rank scores before/after expansion; A (+) sign denotes an improvement com-
pared to the initial query’s corresponding score, (-) indicates a decrease, and (=) indicates that the score has not changed.



after query expansion for at least one of the scores; and 2 out of 7               improve the query, we can verify if [1] results matches our PoC
experiments show a degradation of performances.                                    results.
   We also note that the results are not uniform among the queries.                   In [1] preliminary experiments showed that the ET1000 could
The third query expansion has positive impact on all measures,                     detect a correct word for 3 out of 6 participants, where a X3-120
which is not the case for the others. These finding are not really a               could detect 5 out of 6. To compute the percentage of positive words
surprise in the domain, in which many elements impact the quality                  that could be expected for the ET1000, we multiply the results
of the result. So, it seems that the query strongly matters, may be due            obtained for X3-120 by the ratio of detection performance for the
to the query topic, the query formulation, the snippet generation,                 ET1000 to detection performance for X3-120, obtained from [1] first
the nature of the documents, etc. We do not have enough data here                  experiment:
to clearly identify the element(s) that cause this disparity.

4.4     Comparison with our previous research                                           Res(ET1000) = Res [𝐴𝑙𝑏𝑎𝑟𝑒𝑑𝑒 𝑒𝑡 𝑎𝑙 . 2019] (X3-120)
        presented in Albarede et al. 2019 [1]                                                                 ×
                                                                                                                  Detect [𝐴𝑙𝑏𝑎𝑟𝑒𝑑𝑒 𝑒𝑡 𝑎𝑙 . 2019] (ET1000)
In our previous research [1], we obtained significant higher results:                                             Detect [𝐴𝑙𝑏𝑎𝑟𝑒𝑑𝑒 𝑒𝑡 𝑎𝑙 . 2019] (X3-120)
it was found that up to 87% of words looked at in snippets were                                                   3/6
                                                                                                      = 87% ×
positive words, and we have obtained improvements in the rele-                                                    5/6
vance after query expansion for only 4 out of 7 experiments (57%).                                    = 52%
However, theses results should not be compared directly.
   First of all, the element of comparison is not the same depend-                    Such extrapolated results for the Eye Tribe gives 52% of correct
ing on the approach. In [1] we compared detections of positive                     word detection, which is close to the results we actually obtain
terms, whereas for this PoC we have compared improvements in                       (4/7 ≈ 57%, see subsection about Functional Tests). Even if we do
the relevance after query expansion. However, the two results could                not have enough data to draw a definitive conclusion, we estimate
eventually be compared if we make the assumption that a positive                   –subject to the limitations of our assumptions– that our results are
word added to a query systematically improve the relevance of the                  consistent with our previous research [1].
results after query expansion. It is probably not always true, but on                 Corollary, this means that with a more efficient eye tracking
the contrary, non-positive words may also improve the relevance                    device, results about 87% of positive results, could probably be
of the results after query expansion.                                              obtained, and as a consequence, better performance of eye gaze
   Moreover, the eye tracking device used in the two approaches                    enhanced relevance feedback could be achieved.
is different. In [1] we were using a Tobii Pro X3-1206 device, a
professional class device, while for the PoC we use a Eye Tribe                    5    CONCLUSION AND FUTURE WORK
ET1000 device, a consumer class device. We have chosen this latter
                                                                                   In this paper, we presented the implementation and the user eval-
device for the PoC because this low cost eye tracker represents
                                                                                   uation of a proof of concept for a novel search engine enhanced
a class devices that could be used by end-users in an ecological
                                                                                   with eye gaze assisted relevance feedback. To our knowledge, there
context.
                                                                                   is no similar implementation in the literature. We showed that: (a)
   As in [1] we compared these two devices at a functional level in a
                                                                                   there is a potential benefit of using eye gaze analysis as implicit
preliminary experiment, it is possible to extrapolate from the latter
                                                                                   relevance feedback method ; (b) the results we obtained are consis-
results if the less efficient eye tracker device was used to detect
                                                                                   tent with those we previously obtained in Albarede et al. [1] and
positive words. If we assume that a positive word will actually
                                                                                   so better performances could be expected in the future. Because of
                                                                                   the limited number of participants (7 users) and tasks (2 out of 3
    6 https://www.tobiipro.com/product-listing/tobii-pro-x3-120/                   queries per participant), the results of this study are indicative only,
                                                                                                                                   Sungeele, Jambon, Mulhem


but we noted a tendency for the number of relevant documents to          ACKNOWLEDGMENTS
increase after query expansion.                                          The research presented in this article was partly funded by the
   Our prototype and the experimental setup suffers from some            Gelati Emergence project of the Grenoble Informatics Laboratory
limitations that could had negatively infer with the results we ob-      (UMR 5217).
tained. One technical limitation was that the participant had to
keep his/her head relatively still to get good eye gaze detection        REFERENCES
with the eye tracking device we used. While this might not be a           [1] Lucas Albarede, Francis Jambon, and Philippe Mulhem. 2019. Exploration de
realistic setup for search engine use in ecologic context, the experi-        l’apport de l’analyse des perceptions oculaires : étude préliminaire pour le
                                                                              bouclage de pertinence. In COnférence en Recherche d’Informations et Applications -
ments showed that word detection is possible and could probably               CORIA 2019, 16th French Information Retrieval Conference. Lyon, France, May 25-29,
significantly be improved with more robust (better tracking box)              2019. Proceedings. https://doi.org/doi:10.24348/coria.2019.CORIA_2019_paper_1
and more precise (better angular precision) eye tracking devices. In      [2] Hiteshwar Kumar Azad and Akshay Deepak. 2019. Query expansion techniques
                                                                              for information retrieval: a survey. Information Processing & Management 56, 5
a future implementation, we will consider replacing the Eye Tribe             (2019), 1698–1735.
ET1000 with a more powerful eye tracker device, such as Tobii Pro         [3] Ricardo Baeza-Yates. 2018. Bias on the Web. Commun. ACM 61, 6 (May 2018),
X3-120 to yield better results.                                               54–61. https://doi.org/10.1145/3209581
                                                                          [4] Lorena Leal Bando, Falk Scholer, and Andrew Turpin. 2010. Constructing query-
   Another limitation deals with the precision of eye gaze tracking.          biased summaries: a comparison of human and system generated snippets. In
Most devices have an average accuracy between 0.5 to 1◦ of visual             IIiX.
                                                                          [5] Georg Buscher, Andreas Dengel, Ralf Biedert, and Ludger V. Elst.
angle, which corresponds to an on-screen average error of 0.5 to              2012.      Attentive Documents: Eye Tracking As Implicit Feedback for
1 cm if the user sits about 60 cm away from the monitor. For short            Information Retrieval and Beyond.             ACM Trans. Interact. Intell. Syst.
words and with usual character size, this spatial accuracy does               1, 2 (Jan. 2012), 9:1–9:30.              https://doi.org/10.1145/2070719.2070722
                                                                              http://gbuscher.com/publications/BuscherDengel12_AttentiveDocuments.pdf.
not allow to make the distinction between two short words. This           [6] Georg Buscher, Ludger Van Elst, and Andreas Dengel. 2009. Segment-level display
limitation have no simple answer since it is mostly linked to the             time as implicit feedback: a comparison to eye tracking. In Proceedings of the 32nd
human fovea size. Increasing screen and character size could be               international ACM SIGIR conference on Research and development in information
                                                                              retrieval. ACM, 67–74.
a solution, but these answers alterate the ecologic validity of the       [7] Li Chen and Feng Wang. 2016. An Eye-Tracking Study: Implication to Implicit
context. This is why a certain degree of uncertainty still remains in         Critiquing Feedback Elicitation in Recommender Systems. In Proceedings of the
                                                                              2016 Conference on User Modeling Adaptation and Personalization (UMAP ’16).
the detection of words read by users, and this aspect must be taken           Association for Computing Machinery, Halifax, Nova Scotia, Canada, 163–167.
into account in the use of this technique.                                    https://doi.org/10.1145/2930238.2930286
   In addition, the corpus of documents used –TIME collection–            [8] Yongqiang Chen, Peng Zhang, Dawei Song, and Benyou Wang. 2015. A real-
                                                                              time eye tracking based query expansion approach via latent topic modeling.
was not ideal for user tests, given that they cover ancient historical        In Proceedings of the 24th ACM International on Conference on Information and
events only: users might not have been able to understand the                 Knowledge Management. ACM, 1719–1722.
context to these information needs. In a future user experiment,          [9] Carsten Eickhoff, Sebastian Dungs, and Vu Tran. 2015. An eye-tracking study of
                                                                              query reformulation. In Proceedings of the 38th International ACM SIGIR Confer-
it would thus be desirable to have a collection that is of a more             ence on Research and Development in Information Retrieval. ACM, 13–22.
general and recent nature.                                               [10] Liana Ermakova and Josiane Mothe. 2016. Document re-ranking based on topic-
                                                                              comment structure. In 2016 IEEE Tenth International Conference on Research
   It will also be interesting to study and implement other metrics           Challenges in Information Science (RCIS). IEEE, 1–10.
to test their usefulness in different information search contexts.       [11] Francesco Bellotti, Riccardo Berta, Alessandro De Gloria, and Massimiliano Mar-
Another possible pathway worth exploring would be testing new                 garone. 2008. Widely Usable User Interfaces on Mobile Devices with RFID.
                                                                              In Handbook of Research on User Interface Design and Evaluation for Mobile
snippets generators, e.g. the generators provided by Terrier V5 or            Technology, Joanna Lumsden (Ed.). IGI Global, Hershey, PA, USA, 657–672.
Apache Lucene7 as the words in documents selected by the snippets             https://doi.org/10.4018/978-1-59904-871-0.ch039
generator may have a significant impact on words that could be           [12] Donna Harman. 2010. Is the Cranfield Paradigm Outdated?. In Proceedings of
                                                                              the 33rd International ACM SIGIR Conference on Research and Development in
viewed by users, and as a consequence, on metrics used.                       Information Retrieval (Geneva, Switzerland) (SIGIR ’10). ACM, New York, NY,
   Another experimental track could be to explore situations even             USA, 1–1.
                                                                         [13] Kenneth Holmqvist, Marcus Nyström, Richard Andersson, Richard Dewhurst,
closer to usual user searches on the Web, for instance when a user            Halszka Jarodzka, and Joost Van de Weijer. 2011. Eye tracking: A comprehensive
make multiple queries for the same information need topic.                    guide to methods and measures. OUP Oxford.
   We proposed here the basis of a modular platform for the evalu-       [14] Craig Macdonald, Richard McCreadie, Rodrygo LT Santos, and Iadh Ounis. 2012.
                                                                              From puppy to maturity: Experiences in developing Terrier. Proc. of OSIR at SIGIR
ation of information retrieval systems that take into account both            (2012), 60–63.
user behaviour and classical test collections. Ideally, if classical     [15] Stefano Mizzaro. 1997. Relevance: The whole history. Journal of the American
search engines were providing standardised SERP, it should be                 Society for Information Science 48, 9 (1997), 810–832.
                                                                         [16] IC Mogotsi, Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze.
usable on any engine. A more concrete way to integrate existing               2010. Introduction to information retrieval. Information Retrieval 13, 2 (2010),
systems will be to provide tunable wrappers to adapt simply to any            192–195.
                                                                         [17] Günther E Pfaff et al. 1985. User interface management systems. Vol. 1. Springer.
search engine. Extensions could integrate task oriented sequences        [18] Anastasios Tombros and Mark Sanderson. 1998. Advantages of Query Biased
of queries (with document display tracking) so that other features            Summaries in Information Retrieval. 2–10. https://doi.org/10.1145/290941.290947
may be provided.                                                         [19] ChengXiang Zhai and Sean Massung. 2016. Text data management and analysis:
                                                                              a practical introduction to information retrieval and text mining. Morgan &
                                                                              Claypool.




   7 https://lucene.apache.org