<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the CLEF 2018 Personalised Information Retrieval Lab (PIR-CLEF 2018)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriella Pasi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gareth J. F. Jones</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Keith Curtis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefania Marrara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Camilla Sanvitto</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Debasis Ganguly</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Procheta Sen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Consorzio C2T</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dublin City University</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>IBM Research Labs</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Milano Bicocca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>At CLEF 2018, the Personalised Information Retrieval Lab (PIR-CLEF 2018) has been conceived to provide an initiative aimed at both providing and critically analysing a new approach to the evaluation of personalization in Information Retrieval (PIR). PIR-CLEF 2018 is the rst edition of this Lab after the successful Pilot lab organised at CLEF 2017. PIR CLEF 2018 has provided registered participants with the data sets originally developed for the PIR-CLEF 2017 Pilot task; the data collected are related to real search sessions over a subset of the ClueWeb12 collection, undertaken by 10 users by using a novel methodology. The data were gathered during the search sessions undertaken by 10 volunteer searchers. Activities during these search sessions included relevance assessment of a retrieved documents by the searchers. 16 groups registered to participate at PIR-CLEF 2018 and were provided with the data set to allow them to work on PIR related tasks and to provide feedback about our proposed PIR evaluation methodology with the aim to create an e ective evaluation task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The PIR CLEF Lab organized within CLEF 2018 aims to provide a framework
for the evaluation of Personalised Information Retrieval (PIR). PIR systems
seek to enhance traditional IR systems to better satisfy the users information
needs by providing search results that are not only relevant to the query but
more speci cally to the interests of the user who submitted the query. In order
to provide a personalised service, a PIR system can leverage various kinds of
information about the current user and their preferences and interests. These
can be stated directly or be inferred through a variety of interactions of the user
with the system. This information is then represented in a user model, which can
be employed to either improve the user's query or to re-rank a set of retrieved
results list so that documents more relevant to the user are presented in the top
positions of the list.</p>
      <p>Evaluating the e ectiveness of personalised approaches to search has been
investigated for many years within studies of interactive information retrieval.
In this work, the notion of relevance has been user centered with potential
variation during a search session, depending both on the task at hand and on the
user's interactions with the search system. This work has mostly based on user
studies; this approach involves real users undertaking search tasks in a
supervised environment. By placing the user at the centre of the evaluation activity
these studies have produced valuable insights and feedback. However, while this
methodology has the advantage of enabling the detailed study of the activities of
real users, it has the signi cant drawback of not being easily reproducible, thus
greatly limiting the scope for algorithmic exploration of technologies for search
personalisation. Among some previous attempts to de ne PIR benchmark tasks
based on the Cran eld paradigm, the closest experiment to the PIR Lab is the
TREC Session track1 conducted annually between 2010 and 2014. This track
focused on stand-alone search sessions, where a \session" is a continuous sequence
of query reformulations on the same topic, along with any user interaction with
the retrieved results in service of satisfying a speci c information need; however
no details of the searcher undertaking the task have been made available. Thus,
the TREC Session track did not exploit any user model to personalise the search
experience, nor did it allow user actions over multiple search session to be taken
into consideration in the ranking of the search output.</p>
      <p>The PIR-CLEF 2018 Lab has provided search data gathered in search sessions
carried out ten volunteer users: the provided data were originally collected for
the Pilot Lab run in 2017. We plan in the future to gather data across multiple
sessions to enable the construction and exploitation of persistent user behaviour
data collected from the user across the multiple search sessions. This year the
data were provided to the 16 groups registered to task. with the objective of
allowing them to attempt the proposed tasks. An evaluation using this collection
was run to allow research groups working on PIR to both experience with and
provide feedback about our proposed PIR evaluation methodology. Two papers
were submitted and accepted for presentation at the workshop related to the
Lab; both papers report on the usage of the collected data to perform di erent
tasks; the work reported in these papers is summarized later in this overview.
The papers give some useful suggestions to improve the data gathering process,
which will give rise to interesting discussions during the Lab.</p>
      <p>The remainder of this paper is organised as follows: Section 2 outlines
existing related work, Section 3 provides an overview of the PIR-CLEF 2018 task,
Section 4 discusses the metrics available for the evaluation of the task, Section
5 overviews papers submitted by task participants, and Section 6 concludes the
paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Recent years have seen increasing interest in the study of contextual search: in
particular, several research contributions have addressed the task of
personalizing search by incorporating knowledge of user preferences into the search process
1 http://trec.nist.gov/data/session.html
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This user-centered approach to search has raised the related issue of how to
properly evaluate the e ectiveness of personalized search in a scenario where
relevance is strongly dependent on the interpretation of the individual user. To
this purpose several user-based evaluation frameworks have been developed, as
discussed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        A rst category of approaches aimed at evaluating personalized search
systems (PIRS, Personalized Information Retrieval Systems) are focused on
performing a user-centered evaluation by providing a kind of extension to the
laboratory based evaluation paradigm. The TREC Interactive track [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the
TREC HARD track [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] are examples of this kind of evaluation framework, which
aimed at involving users in interactive tasks to get additional information about
them and the query context. The evaluation was done by comparing a baseline
run ignoring the user/topic metadata with another run considering it.
      </p>
      <p>
        The more recent TREC Contextual Suggestion track [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] was proposed with
the purpose of investigating search techniques for complex information needs
that are highly dependent on both context and users interests. Participants in the
track were given, as input, a set of geographical contexts and a set of user pro les
that contain a list of attractions the user has previously rated. The task was to
produce a list of ranked suggestions for each pro le-context pair by exploiting
the given contextual information. However, despite these extensions, the overall
evaluation was still system controlled and only a few contextual features were
available in the process.
      </p>
      <p>
        TREC also introduced a Session track [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] the focus of which was to exploit
user interactions during a query session to incrementally improve the results
within that session. The novelty of this task was the evaluation of system
performance over entire sessions instead of a single query.
      </p>
      <p>However, the above tasks have various limitations to the satisfactory injection
of user behaviour into the evaluation proces; for this reason the problem of
de ning a standard approach to the evaluation of personalized search is a hot
research topic, which needs e ective solutions.</p>
      <p>
        A rst attempt to create a collection satisfactorily accounting for the user
behaviour in search was done in the FIRE Conference held in 2011. The
Personalised and Collaborative Information Retrieval track [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was organised with the
aim of extending a standard IR ad-hoc test collection by gathering additional
meta-information during the topic development process to facilitate research on
personalised and collaborative IR. However, since no runs were submitted to this
track, only preliminary studies have been carried out and reported using it.
      </p>
      <p>Within CLEF 2017, we launched the PIR-CLEF benchmark with a pilot
study and workshop (PIR CLEF 2017). for the purpose of providing a forum for
the exploration of the evaluation of PIR. The Pilot Lab provided a preliminary
edition of the 2018 PIR-CLEF Lab. One of the achievements of the PIR-CLEF
2017 Pilot Task was the setting up of an evaluation benchmark which seeks to
combine user-centered methods with the Cran eld evaluation paradigm, with the
key potential bene t of producing evaluation results that are easily reproducible.</p>
      <p>The Pilot task was based on search sessions over a subset of the ClueWeb12
document collection, undertaken by 10 users by using a clearly de ned and
novel methodology. The collection was de ned by relying on data gathered from
activities undertaken during the search sessions by each participant, including
details of relevant documents as marked by the searchers. An important point
to be outlined is that the collection was prepared but not used by any group
participating at the pilot task. For this reason at PIR-CLEF 2018 we relied on
this data collection. We distributed it to the 16 groups registered to the Lab.
We have also prepared a second collection, as well as a prototype system for the
comparative evaluation of systems developed by participating groups. The data
collection is described in more detail in Section 3.1.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Overview of the task</title>
      <p>The goal of the PIR-CLEF 2018 Task is to investigate the use of a
laboratorybased method to enable comparative evaluation of PIR methods. The collection
de ned within the PIR-CLEF 2017 Pilot Study and the PIR-CLEF 2018 Lab
was created with the cooperation of volunteer users, and was organized into two
sequential phases:
{ Data gathering. This phase involved the volunteer users carrying out a
taskbased search session during which a set of activities performed by the user
were recorded (e.g, formulated queries, bookmarked documents, etc.). Each
search session was composed of a phase of query development, re nement
and modi cation, and associated search with each query on a speci c
topical domain selected by the user, followed by a relevance assessment phase
where the user indicated the relevance of documents returned in response to
each query and a short report writing activity based on the search activity
undertaken.
{ Data cleaning and preparation. This phase took place once the data gathering
had been completed, and did not involve any user participation. It consisted
of ltering and elaborating the information collected in the previous phase
in order to prepare a dataset with various kinds of information related to
the speci c user's preferences. In addition, a bag-of-words representation of
the participant's user pro le was created to allow comparative evaluation of
PIR algorithms using the same simple user model.</p>
      <p>For the Task we made available the user pro le data and raw search data
produced by guided search sessions undertaken by the 10 volunteer users as
detailed in section 3.1.</p>
      <p>The aim of the task was to use the provided information to improve the
ranking of a search results list over a baseline ranking of documents judged
relevant to the query by the user who entered the query.</p>
      <p>The Task data was provided in csv format to registered participants in the
task. Access to the search service for the indexed subset of the ClueWeb12
collection was provided by Dublin City University via an API.
3.1</p>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>For the PIR-CLEF 2018 Task we made available both user pro le data and raw
search data produced by guided search sessions undertaken by the 10 volunteer
users. The data provided included the submitted queries, the baseline ranked lists
of documents retrieved in response to each query by using a standard search
system, the items clicked by the user in the result list, and the documents relevance
assessments provided by the user on a 4-grade scale. Each session was performed
by the user on a topic of her choice selected from a provided list of broad topics,
and search carried out over a subset of the ClueWeb12 web collection.</p>
        <p>The data was extracted and stored in csv format in 7 csv les in a zip folder
which was provided to participants. The details of the contents of the csv les
are as follows:
csv 1: The le user's session contains the information about each phase of the
query sessions performed by each user. Each row of the csv contains:
{ username: the user who performed the session
{ query session: id of the performed query session
{ category: the top level search domain of the session
{ task: the description of the search task ful lled by the user
{ start time: starting time of the query session
{ close time: closing time of the search phase
{ evaluated time, closing time of the assessment phase
{ end time: closing time of the topic evaluation and the whole session.
csv 2: The le user's log contains the search logs of each user, i.e. every search
event that was triggered by a users action. The le row contains:
{ username: the user who performed the session
{ query session: id of the query session within the search was performed
{ category: the top level search domain
{ query text: the submitted query
{ document id: the document on which a particular action was performed
{ rank: the retrieval rank of the document on which a particular action is
performed
{ action type: the type of the action executed by the user (query submission,
open document, close document, bookmark)
{ time stamp: the timestamp of the action.
csv3: The le user's assessment contains the relevance assessments of a pool of
documents with respect to every single query developed by each user to ful ll
the given task:
{ username: the user who performed the session
{ query session: id of the query session within the evaluation was performed
{ query text: the query on which the evaluation is based
{ document id: the document id for which the evaluation was provided
{ rank: the retrieval rank of the document on which a particular action is
performed
{ relevance score: the relevance of the document to the topic (1 o -topic, 2
not relevant, 3 somewhat relevant, 4 relevant).
csv4: The le user's info contains some personal information about the users:
{ username
{ age range
{ gender
{ occupation
{ native language.</p>
        <p>The le user's topic (csv5) contains the TREC-style nal topic descriptions
about the users information needs that were developed in the nal step of each
search session:
{ username, the user who formulated the topic
{ query session, id of the query session which the topic refers to
{ title, a small phrase de ning the topic provided by the user
{ description, a detailed sentence describing the topic provided by the user
{ narrative, a description of which documents are relevant to the topic and
which are not, provided by the user
csv6: The le simple user pro le for each user contains the following information
(simple version - the applied indexing included tokenization, shingling, and index
terms weighting):
{ username: the user whose interests are represented
{ category: the search domain of interest
{ a list of triples constituted by:
a term: a word or n-grams related to the users searches
a normalised score: term weight computed as the mean of the term
frequencies in the users documents of interests, where term frequency is the
ratio of the number of occurrences of the term in a document and the
number of occurrences of the most frequent term in the same document.
csv6b: The le complex user pro le contains, for each user, the same information
provided in csv6a, with the di erence that the applied indexing was enriched by
also including stop word removal:
{ username, the user whose interests are represented
{ category, the search domain of interest
{ a list of triples constituted by:
term, a word or a set of words related to the users searches
normalised score,</p>
        <p>Task participants had the possibility of contribute in two di erent ways:
{ the two user pro le les (csv6a and csv6b) provide bag-of words pro les for
each of the 10 volunteer searchers in the data collection. The pro les were
created by applying di erent indexing procedures to the documents that the
searcher assessed as relevant during the search session. The searcher's log
le (cvs2) contains all the queries she formulated during the query session.
Task participants could compare the results obtained by applying their
personalisation algorithms on these queries with the results obtained and
evaluated by the searchers on the same queries (and included in the user
assessment le csv3). The search had to be carried out on the ClueWeb12
collection, by using the API provided by DCU. Then, by using the 4-graded scale
evaluations of the documents (relevant, somewhat relevant, non relevant, o
topic) provided by the users and contained in the user assessment le csv3, it
was possible to compute Average Precision (AP) and Normalized Discounted
Cumulative Gain (NDCG) using the standard NIST trec eval tool. Note that
documents that do not appear in csv3 were considered non-relevant.
{ The challenge here was to use the raw data provided in csv1, csv2, csv3, csv4,
and csv5 to create user pro les. A user pro le is a formal representation of the
user interests and preferences; the more accurate the representation of the
user model, the higher is the probability to improve the search process. In the
approaches proposed in the literature, user pro les are formally represented
as bags of words, as vectors, or as conceptual taxonomies, generally de ned
based on external knowledge resources (such as the WordNet and the ODP
Open Directory Project). The task request here was more research oriented:
are the provided information su cient to create a useful pro le? Which
information is missing? The outcome here was a report up to 6 pages by
the participant discussing the theme of user information to pro ling aims,
by proposing possible integrations of the provided data and by suggesting a
way to collect them in a controlled Cran eld style experiment.</p>
        <p>We encouraged participants to be involved in this task by using existing or
new algorithms and/or to explore new ideas. We also welcomed contributions
that make an analysis of the task and/or of the dataset.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Analysis</title>
      <p>The metrics and methodology used to evaluate and analyze the PIR-CLEF task
pose signi cant challenges. It is not at all obvious how we might properly
compare and contrast the behaviour of alternative methods of integrating
personalization into search sessions. While we can start o using stand metrics, such as
Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG)
for individual these will not be su cient to enable a detailed session based
analysis.</p>
      <p>As a starting for point for the development of formal methodology for analysis
and evaluation of our framework for laboratory-based evaluation of PIR, we have
developed a prototype evaluation tool which we describe in the remainder of this
section.
4.1</p>
      <sec id="sec-4-1">
        <title>PIR Evaluation Tool</title>
        <p>
          Our proposed evaluation tool is designed to provide a repeatable approach for the
evaluation of PIR. A few description of the analysis, design and implementation
of the current version of this tool is provided in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>This evaluation tool consists of three sequential phases:
1. File Extraction: During this phase the data is extracted from standard
TREC format results les created for each retrieval query operation. The
extracted data is stored in e cient data structures to accelerate the next
steps.
2. Metric Calculation: During this phase a set of standard IR evaluation
measures are calculated, including Precision, Recall, Precision@K, NDCG,
etc N_ovel approaches for de ning a retrieval session are also de ned.
3. Output Generation: During this phase the evaluation tool produces a
report which shows a set of standard IR evaluation measures. These evaluation
measures are computed and compared in such a way as to highlight di erent
performance measures between alternative results les. Additionally, a set
of charts are included in results which graphically display the evolution of
the performance of each evaluation measure through a retrieval session.</p>
        <p>We now describe each phase in more detail.</p>
        <p>File Extraction In this phase all information contained in the results les is
extracted and e cient data structures are created to accelerate the evaluation
process. These les contain the required information to estimate the performance
of a system, therefore the les need to be created in a speci ed format:
{ TREC format results le: containing the runs performed by a PIR system
using the personalised data collection. This tool supports concurrent execution
of multiple results les.
{ Relevance judgments: containing the relevant documents mark by the
volunteer searchers.
{ Search logs: containing all searcher activities recorded during the search
sessions.</p>
        <p>{ Commands le: allowing for the results to be tailored.</p>
        <p>TREC Format Results File : The evaluation tool requires TREC format results
les containing the ranked lists computed by the PIR system. The TREC format
has been chosen because it a well known in the IR community. Results les must
have the following elds: user id, topic id, query id, document id, rank, name.
Relevance Judgments File: The relevance judgments le contains the relevant
documents marked by participants during the third phase of the PIR-CLEF
experiment. This le has the following elds: user id, topic id, query id, document
id, relevance score.</p>
        <p>Search Logs: The search logs contain the user information gathered during the
search sessions. This information is used by the PIR system to create the user
representation that will be used in the retrieval process and by the evaluation
tool to simulate the user behaviour.</p>
        <p>Commands File: This allows for the tailoring of the evaluation process by
specifying which measures and charts are to be computed by the tool.
Metric Calculation During this phase metrics are computed to enable the
performance of the PIR systems to be evaluated and compared. The evaluation
tool rst computes per-query measures which allow the evaluation of the e
ectiveness considering a single ranked list. It then computes novel approaches for
evaluation of retrieval sessions considering multiple ranked lists.</p>
        <p>We next describe the process undertaken by the evaluation tool computing
per query measures. We then detail the approaches de ned for the evaluation of
retrieval sessions.</p>
        <p>Per-Query Measures The evaluation tool computes the following measures for
each query in the personalised data collection: Precision, Recall, Precision@K,
Recall@K, F-measure, R-precision, Average Precision, and NDCG.</p>
        <p>The computation process of each metric follows these steps:
1. A ranked list set containing the ranked lists related to the queries are
extracted from the result le provided by the PIR system, so as to have a
ranked list for each query.
2. The relevance judgments set containing the relevance judgment documents
for each query is extracted from the relevance judgments le.
3. The evaluation measure is calculated by comparing the retrieved documents
list and the relevance judgments set for each query.</p>
        <p>For each triple measure-user-topic, the evaluation tool generates both a line
chart and a bar chart to show the evolution of the measure through the retrieval
session as well as to compare the performance of the di erent input algorithms.</p>
        <p>The created charts have a y-axis, which represents the measure value, and
an x-axis, which represents the queries belonging to the same topic sorted by
time using the timestamp contained in the search logs.</p>
        <p>
          Session Measures Real users often begin an interaction with a search engine with
a query which they need to reformulate one or more times. The ability of the PIR
system to improve results after query reformulation cannot be easily assessed by
the measures normally used for measuring system e ectiveness, but requires new
approaches taking into consideration the user behaviour triggered by the system.
Three alternative approaches are proposed to simulate user behaviour through
a retrieval session.
1. Using logs le: User behaviour is simulated using information contained
in the search logs.
2. Looking for non-relevant document: Ranked lists extracted from the
result le are considered to simulate the user path through the session.
3. Using the probabilistic distribution: User behaviour is de ned using
the probability distribution de ned in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>Using only the logs le: If you have a set which contains the queries performed
by a user for a topic, and a ranked list related to the queries extracted form the
result le, this is a decision point provided by the search logs. The information
contained in the ranked list represents the behaviour of the participant who
performed the search session in the experiment. The evaluation of a session is
carried out by applying the following steps:
1. Each ranked list is cut at position j, where j indicates the position of the
last document opened by the user.
2. The new ranked lists are merged to build the session ranked list.
3. Precision and Recall are computed using the session ranked list derived in
the previous steps.</p>
        <p>This approach has the drawback that is assumes the user behaviour triggered
by the PIR system which retrieved the ranked list, is the same as the one
that caused by the IR system used to gather the data in the PIR-CLEF
experiment. But user behaviour is strongly related to the retrieval system
due to the position of the relevant document in the ranked list.</p>
        <p>Looking for non relevant document : This approach cuts the ranked lists at the
position of the rst non relevant document found after the decision point. This
way the documents in the ranked list are thought to simulate user behaviour,
making it dependent on the PIR system which retrieved the ranked lists. This
is based on the following assumptions:
1. The decision point suggests the number of documents that the user considers
in the ranked list.
2. The user continues to look at the ranked list in the case of the last document
opened being relevant.</p>
        <p>This approach considers the documents in the ranked list. The procedure
ensures that di erent ranked lists lead di erent user behaviour to make the
evaluation process as realistic as possible. Despite the higher personalization of
the user behaviour, this approach is still based on the decision points in the
search logs. Consequently, an approach that does not use the information form
the search logs has been designed to provide a simulation of the user behaviour
as independently as possible from the retrieval system used.</p>
        <p>
          Using a probabilistic distribution: The probabilistic distribution de ned by [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
was used to simulate user behaviour to be independent of the search logs. In this
work they suggested that a user progress from one document in the ranked list
to the next with probability p, and end their examination of the ranking at that
point with probability 1 - p, making the following suggestions:
1. Each decision point is made independently of the current depth reached
in the ranking, independently of previous decisions, and independently of
whether or not the document examined is relevant or not.
2. The user always looks at the rst document, looks at the second with
probability p, and looks at the third with probability p2, and at the i th with
probability pi-1.
        </p>
        <p>The user model proposed is a reasonable approximation of how people consult
ranked lists and similar behaviour has been observed in user experiments.</p>
        <p>This evaluation tool produces a report showing a set of standard IR
evaluation measures, computed and compared to highlight the di erent measures
between the input algorithms. A set of charts is also generated to display the
evolution of evaluation measures through a session.</p>
        <p>The evaluation tool generates a report with the following elds:
1. An initial row providing all the information about the query:
(a) User ID
(b) Topic ID
(c) Query ID
2. Measure name: The name of the calculated measure.
3. Algorithm name: The retrieval algorithm used for the run.
4. Measure value: Represents the performance of the algorithm which retrieved
the ranked list used to compute the measure.
5. S1 - Si: It represents the di erence between the algorithm with the best
performance for the measure considered and the algorithm considered.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Participant Submissions</title>
      <p>Two participant papers were accepted for presentation at the PIR-CLEF session
at the CLEF 2018 conference. The papers have made use of the provided data
collection to examine two di erent tasks, and both present interesting and useful
ndings and suggestions on how to improve the PIR-CLEF dataset.</p>
      <p>
        The paper titled ECNU at CLEF PIR 2018: Evaluation of personalised
information retrieval [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] presents a study exploring the potential of the dataset
provided by PIR CLEF. The authors report in their paper two di erent
experiments based on two distinct baselines, i.e. \query level baseline" and \session
level baseline"; in the rst baseline each single query in a session is evaluated
independently, while in the second one all queries in a session are summed up to
de ne a single query. The authors then report on the experiments they made, in
which they applied two methods for query expansion and an approach to de ne a
\topic sensitive user model" based on search sessions. In the reported discussion,
the following suggestions to improve the dataset are made: i) to provide more
numerous relevance labels, ii) to increase the number of provided user-related
information (i.e. to have information related to more users pro les), and iii) to
de ne richer query logs.
      </p>
      <p>
        The paper titled PIR based on explicit and implicit feedback [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] addresses
Task 2 of the PIR-CLEF lab, i.e. user pro ling. More speci cally, the authors
explored the use of explicit and implicit feedback to de ne user pro les.
Concerning explicit feedback, the subjective relevance judgments provided by the
searchers for a given set of documents is employed to train a text classi er, thus
exploiting the PIR task as a text classi cation task. Concerning implicit
feedback, the correlation between information that is inferred form the data and
relevance judgments provided by the users has been analyzed. Several analyses
and useful remarks are reported in the paper.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>In this work the PIR-CLEF 2018 Personalised Information Retrieval task was
presented. This task is the rst edition of a lab dedicated to the theme of
personalised search after the successful pilot held at CLEF 2017. This is the rst
evaluation benchmark in this eld based on the Cran eld paradigm, with the
signi cant bene t of producing results easily reproducible. The PIR-CLEF 2018
workshop has provided a Lab task based on a test collection that has been
generated by using a well de ned methodology. An evaluation using this collection
has been run to allow research groups working on PIR to both experience with
and provide feedback about our proposed PIR evaluation methodology. We also
introduced our current work on a prototype system for the comparative analysis
of PIR systems across search sessions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Sanvitto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Pasi</surname>
          </string-name>
          <article-title>A Laboratory-Based Method for the Evaluation of Personalised Search</article-title>
          .
          <source>Proceedings of the Seventh International Workshop on Evaluating Information Access (EVIA</source>
          <year>2016</year>
          ),
          <source>a Satellite Workshop of the NTCIR-12 Conference</source>
          , June 7, 2016 Tokyo Japan.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>G.</given-names>
            <surname>Pasi</surname>
          </string-name>
          .
          <article-title>Issues in personalising information retrieval</article-title>
          .
          <source>IEEE Intelligent Informatics Bulletin</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>37</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. L.
          <string-name>
            <surname>Tamine-Lechani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Boughanem</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Daoud</surname>
          </string-name>
          .
          <article-title>Evaluation of contextual information retrieval e ectiveness: overview of issues and research</article-title>
          .
          <source>Knowledge and Information Systems</source>
          ,
          <volume>24</volume>
          (
          <issue>1</issue>
          ):
          <fpage>134</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>D.</given-names>
            <surname>Harman</surname>
          </string-name>
          .
          <article-title>Overview of the fourth text retrieval conference (TREC-4)</article-title>
          . In D. K. Harman, editor,
          <source>TREC, volume Special Publication 500-236. National Institute of Standards and Technology (NIST)</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          .
          <article-title>HARD track overview in TREC 2003: High accuracy retrieval from documents</article-title>
          .
          <source>In Proceedings of The Twelfth Text REtrieval Conference (TREC</source>
          <year>2003</year>
          ), pages
          <fpage>2437</fpage>
          ,
          <string-name>
            <surname>Gaithersburg</surname>
          </string-name>
          , Maryland, USA,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Adriel</given-names>
            <surname>Dean-Hall</surname>
          </string-name>
          , Charles L. A.
          <string-name>
            <surname>Clarke</surname>
          </string-name>
          , Jaap Kamps, Paul Thomas, and
          <string-name>
            <surname>Ellen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          .
          <article-title>Overview of the TREC 2012 contextual suggestion track</article-title>
          .
          <source>In Voorhees and Bucklan.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>B.</given-names>
            <surname>Carterette</surname>
          </string-name>
          , E. Kanoulas,
          <string-name>
            <surname>M. M. Hall</surname>
            , and
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
          </string-name>
          .
          <article-title>Overview of the TREC 2014 session track</article-title>
          .
          <source>In Proceedings of The Twenty-Third Text REtrieval Conference (TREC</source>
          <year>2014</year>
          ), Gaithersburg, Maryland, USA.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Debasis</given-names>
            <surname>Ganguly</surname>
          </string-name>
          , Johannes Leveling, and
          <string-name>
            <given-names>Gareth J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Overview of the personalized and collaborative information retrieval (PIR) track at FIRE-2011</article-title>
          . In Prasenjit Majumder, Mandar Mitra, Pushpak Bhat- tacharyya, L. Venkata Subramaniam, Danish Contractor, and Paolo Rosso, editors, Multilingual Information Access in South Asian Lan- guages - Second International Workshop, FIRE 2010, Gandhinagar, India,
          <source>February 19-21</source>
          ,
          <year>2010</year>
          and Third International Workshop, FIRE 2011, Bombay, India, December 2-
          <issue>4</issue>
          ,
          <year>2011</year>
          , Revised Selected Papers, volume
          <volume>7536</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>227240</fpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Puigcerver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Toselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Vidal</surname>
          </string-name>
          .
          <article-title>Overview of the ImageCLEF 2016 Handwritten Scanned Document Retrieval Task</article-title>
          .
          <source>In Proceedings of CLEF</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <article-title>A new interpretation of Average Precision</article-title>
          .
          <source>In Proceedings of the International ACM SIGIR conference on Research and development in information retrieval (SIGIR '08)</source>
          . pp.
          <fpage>689</fpage>
          -
          <lpage>690</lpage>
          . ACM, New York, NY, USA (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>A.</given-names>
            <surname>Angiolillo</surname>
          </string-name>
          ,
          <article-title>Comparative Evaluation of Personalised Search Systems, Universit degli Studi di Milano Bicocca</article-title>
          , Milano, Italy,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. A. Mo at and J.
          <string-name>
            <surname>Zobel</surname>
          </string-name>
          , Justin,
          <article-title>Rank-biased precision for measurement of retrieval e ectiveness</article-title>
          ,
          <source>ACM Transactions on Information Systems (TOIS)</source>
          ,
          <volume>27</volume>
          (
          <issue>1</issue>
          )
          <string-name>
            <surname>,</surname>
            <given-names>ACM</given-names>
          </string-name>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hu</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          .
          <source>ECNU at CLEF PIR</source>
          <year>2018</year>
          :
          <article-title>Evaluation of Personalized Information Retrieval</article-title>
          , In Working Notes of CLEF 2018 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Avignon, France,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. A.
          <string-name>
            <surname>Andreu-Marn</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Martnez-Santiago</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Urea-Lpez and M. C.</surname>
          </string-name>
          <article-title>Daz-Galiano. PIR Based in Explicit and Implicit Feedback</article-title>
          , In Working Notes of CLEF 2018 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Avignon, France,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>