<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cultural Heritage in CLEF (CHiC) Overview 2012</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vivien Petras</string-name>
          <email>P@15</email>
          <email>P@5</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Gäde</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antoine Isaac</string-name>
          <email>aisaac@few.fu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Kleineberg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivano Masiero</string-name>
          <email>masieroi@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mattia Nicchio</string-name>
          <email>nicchio@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juliane Stiller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1Berlin School of Library and Information Science, Humboldt-Universität zu Berlin</institution>
          ,
          <addr-line>Dorotheenstr. 26, 10117 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information Engineering, University of Padova</institution>
          ,
          <addr-line>Via Gradenigo 6/B, 35131 Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Europeana</institution>
          ,
          <addr-line>The Europeana Office, Koninklijke Bibliotheek, Prins Willem-Alexanderhof 5, 2595 BE Den Haag</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper for the CHiC pilot lab describes the motivation, tasks, Europeana collections and topics, evaluation measures as well as the submitted and analyzed information retrieval runs. In its first year, CHiC offered three tasks: ad-hoc, which measured retrieval effectiveness according to relevance of the ranked retrieval results (standard 1000 document TREC output), variability, which required participants to present a list of 12 records that represent diverse information contexts and semantic enrichment, which asked participants to provide a list of 10 semantically related concepts to the one in the query to be used in query expansion experiments. All tasks were offered in monolingual, bilingual and multilingual modes. 126 different experiments from 6 participants were evaluated using the DIRECT system.  </p>
      </abstract>
      <kwd-group>
        <kwd>cultural heritage</kwd>
        <kwd>Europeana</kwd>
        <kwd>variability</kwd>
        <kwd>diversity</kwd>
        <kwd>semantic enrichment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Cultural heritage content is often multilingual and multimedia (e.g. text, photographs,
images, audio recordings, and videos), usually described with metadata in multiple
formats and of different levels of complexity. Institutions in this domain have
different approaches to managing information and serve diverse user communities, often
with specialized needs and information contexts (native language, search
environment, etc.).</p>
      <p>Evaluation approaches (particularly system-oriented evaluation) in this domain
have been fragmentary and often non-standardized. The CHiC 2012 pilot evaluation
lab aimed at moving towards a systematic and large-scale evaluation of cultural
heritage digital libraries and information access systems. The lab's goal is to increase our
understanding on how to integrate examples from the cultural heritage community
into a CLEF-style evaluation framework and how results can be fed back into the CH
community.</p>
      <p>The CHiC lab researches information retrieval systems for the cultural heritage
environment by using real data, real user queries and real tasks. CHiC has teamed up
with Europeana1, Europe’s largest digital library, museum and archive for cultural
heritage objects to provide a realistic environment for experiments.</p>
      <p>At the CLEF 2011 conference, a first workshop on information retrieval evaluation
was put on by the organizers of the lab to discuss information needs, search practices
and appropriate information retrieval tasks for this domain. The outcome of this
workshop was a pilot lab proposal for the CLEF conference series suggesting three
tasks relevant for cultural heritage information systems. Even as a pilot lab, CHiC was
able to use real data and real search topics gathered from Europeana.</p>
      <p>The paper is structured as follows: sections 2-4 explain the data collection, the
preparation of topics and the CHiC tasks as well as the used evaluation measures.
Sections 5 and 6 provide an overview of the participants and submitted experiments
and describe the relevance assessment process. Section 7 discusses the experimental
results, whereas section 8 provides an outlook for the next lab.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Collection</title>
      <p>In March 2012, the complete Europeana data index was downloaded for collection
preparation. The Europeana index as used in Europeana’s Solr search portal contained
23,300,932 documents with a size of 132 GB.</p>
      <p>Europeana data consists of metadata records describing digital representations of
cultural heritage objects, e.g. the scanned version of a manuscript, an image of a
painting of sculpture or an audio or video recording. Roughly 62% of the metadata
records describe images, 35% describe text, 2% describe audio and 1% video
recordings. The metadata contains title and description data, media type and chronological
data as well as provider information. For ca. 30% of the records, content-related
enrichment keywords were added automatically by Europeana.</p>
      <p>The original Europeana index contained fields from different schemas: Simple
Dublin Core, e.g. dc:title, dc:description, Qualified Dublin Core, e.g.
dcterms:provenance, dcterms:spatial and Europeana Semantic Elements, e.g.
europena:type, europeana:isShownAt. On top of these schema-related fields, there
were additional fields used internally in the Lucene index to improve search
performance or to support specific application functionalities.</p>
      <p>These fields were removed from the data collection and the index data was
wrapped in a special XML format. The whole collection was then divided into 14
subcollections according to the language of the content provider of the record (which
usually indicates the language of the metadata record). If all the provider languages
had been used, the number of subcollections would have reached 30. Thus, in order to
1 http://www.europeana.eu
reduce this amount, a threshold was set: all the languages with less than 100,000
documents were grouped together under the name “Other”.</p>
      <p>The resultant 14 subcollections are listed in table 1. For the CHiC 2012
experiments, only the English, French and German subcollections as well as the entire
collection were used.
The XML data for all collections were made available and released to participants.
Figure 1 shows an extract example record from the Europeana CHiC collection.
&lt;ims:metadata
ims:identifier="http://www.europeana.eu/resolve/record/10105/5E1618BFAF072B8953B3070
1A6A6C3BB655ACF9D"
ims:namespace="http://www.europeana.eu/" ims:language="eng"&gt;
&lt;ims:fields&gt;
&lt;dc:identifier&gt;Orn.0240&lt;/dc:identifier&gt;
&lt;dc:subject&gt;Tachymarptis melba&lt;/dc:subject&gt;
&lt;dc:title&gt;Rundun Zaqqu Bajda (Orn.0240)&lt;/dc:title&gt;
&lt;dc:title&gt;Alpine Swift (Orn.0240)&lt;/dc:title&gt;
&lt;dc:type&gt;mounted specimen&lt;/dc:type&gt;
&lt;europeana:country&gt;malta&lt;/europeana:country&gt;
&lt;europeana:dataProvider&gt;Heritage Malta&lt;/europeana:dataProvider&gt;
&lt;europeana:isShownAt&gt;http://www.heritagemalta.org/sterna/orn.php?id=0240&lt;/europeana:isS
hownAt&gt;
&lt;europeana:language&gt;en&lt;/europeana:language&gt;
&lt;europeana:provider&gt;STERNA&lt;/europeana:provider&gt;
&lt;europeana:type&gt;IMAGE&lt;/europeana:type&gt;
&lt;europeana:uri&gt;http://www.europeana.eu/resolve/record/10105/5E1618BFAF072B8953B3070</p>
      <sec id="sec-2-1">
        <title>1A6A6C3BB655ACF9D&lt;/europeana:uri&gt; &lt;/ims:fields&gt; &lt;/ims:metadata&gt; Fig. 1. Europeana CHiC Collection Sample Record</title>
        <p>In the Europeana portal, object records commonly also contain thumbnails of the
object if it is an image and links to related records. The thumbnails were not
contained in the collection given to CHiC participants, but relevance assessors were able
to look at them at the original source.</p>
        <p>Finally, each file in the collection contained specific copyright information about
the metadata record themselves and their providers. The XML code shown in Figure 2
was used for this purpose.
&lt;dc:rights&gt;The metadata contained in this file is made available by Europeana
(http://europeana.eu) only to the members of the Europeana Network
(http://pro.europeana.eu/about/network) that have agreed to use it for the research purposes of
the CLEF initiative (http://www.clef-initiative.eu). This usage falls within the more general
conditions of the Europeana Terms for Re-use of Europeana Metadata
(http://pro.europeana.eu/terms-of-use).&lt;/dc:rights&gt;
For all experiments, original user queries were extracted from Europeana query logs.
From all user search sessions in August 2010, those queries were extracted that
resulted in a user viewing at least one complete object (in order to ensure that the
session contained more than one user-system interaction). The queries were then further
filtered to not include wildcards or automatically generated queries (for example by
Europeana features).</p>
        <p>Over 500 queries were then annotated according to their query category, i.e.
topical, personal name, geographical name, work title or other. Queries could be either in
the English language or ambiguous in language but would also appear in English.
Ambiguous queries could include personal or location names that do not change
across languages, e.g. William Shakespeare.</p>
        <p>For CHiC, 50 queries were selected that covered a wide range of topics and
represented a distribution of query categories that was found in a previous study [9]. For
later relevance assessments, descriptions of the underlying information need were
added, but were not admissible for information retrieval. The underlying information
need for a query can be ambiguous, if the intention of the query is not clear. In this
case, the research group discussed the query and agreed on the most likely
information need. Figure 3 shows an example of an English query.
&lt;topic lang="en"&gt;
&lt;identifier&gt;CHIC-004&lt;/identifier&gt;
&lt;title&gt;silent film&lt;/title&gt;
&lt;description&gt;documents on the history of silent film, silent film videos, biographies of actors
and directors, characteristics of silent film and decline of this genre&lt;/description&gt;
&lt;/topic&gt;
All 50 queries were then translated into French and German. For the variability and
semantic enrichment tasks, only the first 25 topics were used for the experiments.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>CHiC Tasks</title>
      <p>For the pilot lab of CHiC, three experimental tasks were selected that represented
realistic use cases for cultural heritage information systems like Europeana but were
also relatively simple in their set-up and to evaluate. The goal for this year’s lab was
to create baselines for topic and task development but also generate ground-truth in
relevance assessments for experimental results.</p>
      <p>All tasks were offered with the same set of topics and in three language modes: (i)
monolingual (query and document language are the same), (ii) bilingual (query and
document languages are different), (iii) multilingual (documents in multiple
languages, i.e. the whole Europeana collection will be searched). This allowed the
participants to experiment with a number of language variations (table 2).</p>
      <p>Participants were asked to submit at least one monolingual experiment in any
language per chosen task and were allowed to submit up to 4 experiments in the same
language mode and combination.
This task is a standard ad-hoc retrieval task, which measures information retrieval
effectiveness with respect to user input in the form of queries. No further user-system
interaction is assumed although automatic blind feedback or query expansion
mechanisms are allowed to improve the system ranking. The ad-hoc setting is the standard
setting for an information retrieval system - without prior knowledge about the user
need or context, the system is required to produce a relevance-ranked list of
documents based entirely on the query and the features of the collection documents.
Participants were allowed to use all collection fields and had to submit 1000 ranked
documents (TREC-style) for relevance assessment.
4.2</p>
      <sec id="sec-3-1">
        <title>Variability</title>
        <p>A particular user type - the casual user or “information tourist” - does not follow the
conventional pattern of a targeted information need being expressed in a targeted
query but poses particular challenges for access or entry points and result
presentation.</p>
        <p>
          The variability task required systems to present a list of 12 objects (represents the
first Europeana results page), which are relevant to the query and should present a
particular good overview over the different object types and categories targeted
towards a casual user, who might like the "best" documents possibly sorted into "must
sees" and "other possibilities." This task is about returning diverse objects and
resembles the diversity tasks of the Interactive TREC track or the CLEF Image photo
tracks and other research [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], [5], [7-8], [11].
        </p>
        <p>For CHIC, this task resembles a typical user of a cultural heritage information
system, who would like to get an overview over what the system has with respect to a
certain concept or what the best alternatives are. It is also a pilot task for this type of
data collection using different assumptions about diversity or variability. Documents
returned should be relevant but also as diverse as possible with respect to:
 media type of object (text, image, audio, video)
 content provider
 query category
 field match (which metadata field contains a query term)
Several approaches or measures have been suggested to measure diversity in an
information retrieval result set [2-4], [6], [10], [12]. For the pilot variability task, we
decided to measure cluster recall, i.e. the number of retrieved diverse categories
(media type, content providers, query categories etc.) divided by the number of possible
diverse categories per query. The evaluation of the results of this task was therefore
two-fold. First, all returned documents were assessed for their relevance and then the
cluster recall for relevant documents in the 4 categories above was determined.
4.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Semantic Enrichment</title>
        <p>Semantic enrichment is an important task in cultural heritage information systems
with short and ambiguous queries like Europeana, which will support the information
retrieval process either interactively (the user is asked for clarification, e.g. "Did you
mean?") or automatically (the query is automatically expanded with semantically
related concepts to increase the likely search success).</p>
        <p>The semantic enrichment task required systems to present a ranked list of at most
10 related concepts for a query to semantically enrich the query and / or guess the
user's information need or original query intent. For CHiC, this task resembles a
typical user interaction, where the system should react to an ambiguous query with a
clarification request (or a result output from an expanded query).</p>
        <p>Related concepts could be extracted from Europeana data (internal information) or
from other resources in the LOD cloud or other external resources (e.g. Wikipedia).
Europeana already enriches about 30% of its metadata objects with concepts, names
and places (included in the test collection). It uses the vocabularies GeoNames,
GEMET and DBPedia for its included semantic enrichments, which could be
explored further as well.</p>
        <p>For the semantic enrichment task, participants could also use the Europeana Linked
Open Data collections. Europeana released metadata on 2.5 million objects as linked
open data in a pilot project2. The data is represented in the Europeana Data Model
(RDF) and encompasses collections from ca. 300 content providers. Other external
resources are allowed but need to be specified in the description from participants.
The objects described in the LOD dataset are included in the Europeana test
collection, but the RDF format might be convenient for accessing object enrichments.</p>
        <p>System effectiveness was assessed in two phases. First all submitted enrichments
were assessed manually for use in an interactive query expansion environment (e.g.
"does this suggestion make sense with respect to the original query?").</p>
        <p>During the second phase, the submitted terms and phrases were used in a query
expansion experiment, i.e. the enrichments were added to the query and submitted as
new experimental runs. All new topics were searched against the same standard
Lucene indexes of the Europeana collections (according to the language of the
enrichments). The results of those runs were then assessed according to ad-hoc retrieval
standards.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>CHiC Participation and Experiments</title>
      <p>Although 21 groups registered for participation in CHiC, only 6 research groups
submitted experimental results for evaluation. Table 3 shows the experiment participants
for CHiC.
Humboldt Universität (one of the organizers) also submitted experiments for
assessment, which can be seen as baselines, because these multilingual ad-hoc runs used
2 http://pro.europeana.eu/web/guest/linked-open-data
Europeana’s Solr index to retrieve results. Two multilingual Europeana experiments
were submitted, one using Solr’s standard vector space ranking model, the other an
adapted version of the BM-25 ranking model. Table 4 the number of experiments per
task and language.
DIRECT3 (Distributed Information Retrieval Evaluation Campaign Tool) has
supported the different stages of the CHiC evaluation activity, from the experiment
submission phase to the relevance assessment and metrics computation. DIRECT manages
different types of users, i.e. participants, assessors, organizers, and visitors, who need
to have access to different kinds of features and capabilities. A personal username and
password has been assigned to each participant/assessor [13].
6
6.1</p>
    </sec>
    <sec id="sec-5">
      <title>Relevance Assessments</title>
      <sec id="sec-5-1">
        <title>Pooling</title>
        <p>The number of documents in large test collections such as CLEF makes it impractical
to judge every document for relevance. Instead approximate recall values are
calculated using pooling techniques. The results submitted by the groups participating in
3 http://direct.dei.unipd.it/
the tasks are used to form a pool of documents for each topic and language by
collecting the highly ranked documents from selected runs according to a set of predefined
criteria. One important limitation when forming the pools is the number of documents
to be assessed. Traditionally, the top 100 ranked documents from each of the runs
selected are included in the pool; in such a case we say that the pool is of depth 100.
This pool is then used for subsequent relevance judgments. After calculating the
effectiveness measures, the results are analyzed and run statistics produced and
distributed. The main criteria used when constructing the pools in CLEF are:
 favor diversity among approaches adopted by participants, according to the
descriptions that they provide of their experiments;
 for each task, include at least one experiment from every participant, selected from
the experiments indicated by the participants as having highest priority;
 ensure that, for each participant, at least one mandatory title+description
experiment is included, even if not indicated as having high priority;
 add manual experiments, when provided;
 for bilingual tasks, ensure that each source topic language is represented.
This year, we produced three pools, one for each target language (English, French,
and German) using a depth of 100. The pools have been created using all the runs in
the ad-hoc monolingual and variability tasks, two runs per participant in the ad-hoc
bilingual tasks, and all the runs in the bilingual variability task. A fourth pool, for the
multilingual task, is the union of the three pools described above.</p>
        <p>Table 5 provides details about the created pools, their size, the number of relevant
and not relevant documents, and the pooled runs. You can note that English and
French pools one run was not pooled from the monolingual tasks: this is a late
arriving run, submitted after the closure of the submission phase.</p>
        <p>Size
Size
Experiments
The box plot of Fig. 4 compares the distributions of the relevant documents across the
topics of each pool for the different CHiC pools; the boxes are ordered by decreasing
mean number of relevant documents per topic.</p>
        <p>We see that the French and German distributions appear similar and are slightly
asymmetric towards topics with a greater number of relevant documents whereas the
English distribution is almost balanced. All the distributions show some upper
outliers, i.e. topics with a greater number of relevant documents with respect to the
behavior of the other topics in the distribution. These outliers are probably due to the fact
that CHiC topics have to be able to retrieve relevant documents in all the collections;
therefore, they may be considerably broader than typical monolingual topics.
During the relevance assessment phase, all eight assessors followed the same
guidelines for relevance. Unclear or ambiguous cases were discussed within the group. A
final validation by one of the organizers went through all relevant documents to check
for consistency among the assessments.</p>
        <p>The following general assumption guided the decision process: a record is relevant,
when it fulfills the information need represented by the original query (in title) and by
the suggested information need description (in description). Three relevance criteria
were defined:
 Not relevant – the record does not fulfill the information need, the information is
not relevant,
 Relevant – the record as represented in the DIRECT system fulfills the information
need,
 Europeana relevant – the record only as represented in the Europeana portal fulfills
the information need (only the whole Europeana record, i.e. the thumbnail and
other related documents, contains enough information to make this object relevant, not
just the record in the DIRECT system).</p>
        <p>For the analysis, Europeana relevant and not relevant were counted as not relevant,
the remaining documents as relevant.
6.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>The Assessment Interface</title>
        <p>Figure 5 shows the main assessment interface of the DIRECT framework. It
provides the assessor with an overview on the status of each pool. In particular, it
displays the current number of relevance judgments for each topic in a specific pool.</p>
        <p>The assessment stage is supported by the interface shown in Figure 6. The
assessor can easily navigate through the list of document for a given topic. The interface
includes a set of buttons to select relevance criteria for each document (yellow color
for the not assessed documents, red for not relevant documents, green for relevant
documents, grey for Europeana relevant documents). The document preview displays
two direct links to:
1. the original record in the Europeana website;
2. the content of the original europena_isShownAt field.
The semantic enrichment task results were first evaluated for the relevance or
“semantic appropriateness” of the individual suggested terms or phrases. All enrichments
for a query were looked at by the same assessor.</p>
        <p>All submitted enrichments were assessed on a 3-point scale: definitely relevant as
enrichment to the query, maybe relevant, and not relevant. If more than 10
suggestions were submitted, they were not included. If less than 10 suggestions were
submitted, all suggestions were counted.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Ad-hoc Information Retrieval</title>
      </sec>
      <sec id="sec-5-4">
        <title>Monolingual Experiments.</title>
        <p>Monolingual retrieval was offered for the following target collections: English,
German, and French.</p>
        <p>Table 6 shows the top five groups for each target collection, ordered by mean
average precision. Note that only the best run is selected for each group, even if the group
may have more than one top run. The table reports: the short name of the participating
group; the experiment identifier; the mean average precision achieved by the
experiment; and the performance difference between the first and the last participant.
CHIC Ad−Hoc Monolingual German Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision
100%
clef@tu−chemnitz.de [Experiment QE_NO; MAP 60.39%; Pooled]
philipp.schaer@gesis.org [Experiment GESIS_WIKI_ENTITY_DE_DE; MAP 54.80%; Pooled]
0%
0%
10%
20%
30%
40%</p>
        <p>Fig. 8. Monolingual German Top Groups. Interpolated Recall vs. Average Precision
0%0%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%</p>
        <p>Recall</p>
        <p>Fig. 9. Monolingual French Top Groups. Interpolated Recall vs. Average Precision</p>
      </sec>
      <sec id="sec-5-5">
        <title>Bilingual Experiments.</title>
        <p>The bilingual task was structured in three subtasks (X → DE, EN, or FR target
collection). Table 7 shows the best results for this task. For bilingual retrieval evaluation, a
common method is to compare results against monolingual baselines:
 X  EN: 86.40% of best monolingual English IR system
 X  DE: 63.52% of best monolingual German IR system
 X  FR: 81.32% of best monolingual French IR system
CHIC Ad−Hoc Bilingual English Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision
100%
nada.naji@unine.ch [Experiment UNINEDEEN3; MAP 44.59%; Not Pooled]
clef@tu−chemnitz.de [Experiment FR2EN_QE_DBPEDIA_SUBJECTS_MICROSOFT; MAP 35.49%; Pooled]
90%</p>
      </sec>
      <sec id="sec-5-6">
        <title>Multilingual Experiments.</title>
        <p>Table 8 shows the best results for this task with the same logic of Table 6 and 7.
Unfortunately, at the time of writing, the cluster recall analysis was not completed so
that only the first phase evaluation results (retrieval effectiveness in finding relevant
documents) can be shown.</p>
        <p>For now, we report precision@5 and precision@15 values. Recall that participants
were asked to submit 12 results for each query, representing a Europeana result page.
The calculated p@15 measure comes closes to evaluating how many relevant
documents were found even though it overdraws the boundaries of the precision@k. The
corrected evaluation measures will be published on the CHiC website4.</p>
      </sec>
      <sec id="sec-5-7">
        <title>Monolingual Experiments.</title>
        <p>Monolingual retrieval was offered for the following target collections: English,
German, and French. Table 9 shows the best results for this task.
4 http://www.culturalheritageevaluation.org</p>
      </sec>
      <sec id="sec-5-8">
        <title>Bilingual and Multilingual Experiments.</title>
        <p>Only one group (Chemnitz) submitted results for these tasks, so Table 10 shows the
best runs without the difference to other tasks. For bilingual retrieval evaluation, a
common method is to compare results against monolingual baselines:
 Mean of P@5
─ X  EN: 17.83% of best monolingual English IR system
─ X  DE: 70.00% of best monolingual German IR system
─ X  FR: 87.49% of best monolingual French IR system
 Mean of P@15
─ X  EN: 74.06% of best monolingual English IR system
─ X  DE: 70.91% of best monolingual German IR system
─ X  FR: 73.14% of best monolingual French IR system</p>
      </sec>
      <sec id="sec-5-9">
        <title>Track</title>
      </sec>
      <sec id="sec-5-10">
        <title>Bilingual</title>
      </sec>
      <sec id="sec-5-11">
        <title>English</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Bilingual 1st</title>
      <sec id="sec-6-1">
        <title>German</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Bilingual 1st</title>
      <sec id="sec-7-1">
        <title>French</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Multilingual 1st</title>
      <sec id="sec-8-1">
        <title>Rank</title>
        <p>1st</p>
      </sec>
      <sec id="sec-8-2">
        <title>Part.</title>
        <p>We first report the overall results of the first phase evaluation of the semantic
relevance (appropriateness) of the enrichments, then the overall results of the query
expansion runs using the semantic enrichments.</p>
      </sec>
      <sec id="sec-8-3">
        <title>Semantic Relevance.</title>
        <p>For the evaluation of the “semantic appropriateness” of the suggested enrichments,
two relevance measures were used - definitely relevant and maybe relevant – to be
able to distinguish a strict and a relaxed evaluation. Precision (strong) is the average
precision (over 25 queries) of "relevant" suggestions over all suggestions. Precision
(weak) is the average precision (over 25 queries) of "relevant" and "maybe relevant"
over all suggestions.</p>
        <p>Table 11 shows average precision numbers (over all topics and all runs) for each
language mode in this task. The weaker precision measure is, as should be expected,
higher than the strict precision measure, by an average of 10 percentage points. The
strict precision measure shows that on average about half of the suggested terms or
phrases can be considered a good fit for the query.</p>
        <p>German monolingual suggestions seem to have a lower precision than other
experiments. The reason for this is that two experiments were submitted containing errors
that would assign enrichments to the wrong queries after about half of the topics. We
kept the experiments in the analysis for completeness, however.</p>
        <p>Bilingual and multilingual experiments also seem to perform better than the
monolingual experiments on average. This is probably due to averaging as most of the
bilingual and monolingual runs were submitted by one group (Chemnitz Univ. of
Techn.), which achieved higher results.</p>
        <p>The detailed results for every run can be found on the CHiC website.</p>
      </sec>
      <sec id="sec-8-4">
        <title>Monolingual Experiments.</title>
        <p>Table 12 shows the best results for each group in this task.</p>
      </sec>
      <sec id="sec-8-5">
        <title>Bilingual and Multilingual Experiments.</title>
        <p>Only one group (Chemnitz) submitted results for these tasks, so Table 13 shows the
best runs without the difference to other runs.</p>
      </sec>
      <sec id="sec-8-6">
        <title>Monolingual Experiments.</title>
        <p>Monolingual retrieval was offered for the following target collections: English,
German, and French. Table 14 shows the best results for this task. As can be seen, the
original topic runs (without expansion) as denoted by the ORIGINALQUERIES
identifier outperforms all other runs.</p>
        <p>ORIGINALQUERIESFR−se
CUT_T3_FR_FR_R1−se</p>
        <p>MAP
34.11%
30.23%
29.05%
23.38%
10.92%
212.36%
57.01%
31.92%
26.00%
119.26%
32.29%
14.67%
120.10%</p>
        <p>CHIC Semantic Enrichment Monolingual English Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precisio
100%
chictest [Experiment ORIGINALQUERIESEN−se; MAP 34.11%; Not Pooled]
nitish.aggarwal@deri.org [Experiment DERI_SE_CLEF_R1−se; MAP 30.23%; Not Pooled]
90%arantza.otegi@ehu.es [Experiment UKBWIKI−se; MAP 29.05%; Not Pooled]</p>
        <p>philipp.schaer@gesis.org [Experiment GESIS_WIKI_ENTITY_EN_EN−se; MAP 23.38%; Not Pooled]
80%clef@tu−chemnitz.de [Experiment CUT_T3_EN_EN_R1−se; MAP 10.92%; Not Pooled]
70%
60%</p>
        <p>Fig. 12. Monolingual English Top Groups. Interpolated Recall vs. Average Precision
CHIC Semantic Enrichment Monolingual German Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precisi
100%
chictest [Experiment ORIGINALQUERIESDE−se; MAP 57.01%; Not Pooled]
philipp.schaer@gesis.org [Experiment GESIS_WIKI_ENTITY_DE_DE−se; MAP 31.92%; Not Pooled]
90%clef@tu−chemnitz.de [Experiment CUT_T3_DE_DE_R3−se; MAP 26.00%; Not Pooled]
50%</p>
        <p>Recall
10%
20%
30%
40%
60%
70%
80%
90%
100%
80%
70%
60%
n
o
i
ics 50%
e
r
P
90%
80%
70%
60%
40%
30%
20%
10%
0%
0%
10%
20%
30%
40%
50%
Recall
60%
70%
80%
90%
100%
Fig. 14. Monolingual French Top Groups. Interpolated Recall vs. Average Precision
Fig. 13. Monolingual German Top Groups. Interpolated Recall vs. Average Precision
CHIC Semantic Enrichment Monolingual French Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precisio
100%
chictest [Experiment ORIGINALQUERIESFR−se; MAP 32.29%; Not Pooled]
clef@tu−chemnitz.de [Experiment CUT_T3_FR_FR_R1−se; MAP 14.67%; Not Pooled]</p>
      </sec>
      <sec id="sec-8-7">
        <title>Bilingual and Multilingual Experiments.</title>
        <p>Only one group (Chemnitz) submitted results for these tasks, so Table 15 shows the
best runs without the difference to other runs.
Five groups submitted experimental results for the ad-hoc experiments, two groups
for the variability task, and five groups submitted experiments for the semantic
enrichment task. Most groups concentrated on the monolingual tasks (mostly English),
only Chemnitz participated in all monolingual, bilingual and multilingual tasks.</p>
        <p>For the ad-hoc task, most groups used open information retrieval systems like
Cheshire, Indri, Lucene (in its Chemnitz Xtrieval implementation) and Solr. Many
ranking algorithms were tested: vector space, language modeling, DFR and Okapi.</p>
        <p>For translations in the bilingual and multilingual tasks, Google Translate,
Wikipedia entries (with associated translations) and Microsoft’s translation service were
used.</p>
        <p>For the variability task, Chemnitz used its ad-hoc retrieval implementation to
retrieve results and then used the least recently used (LRU) algorithm to prioritize
documents describing different media types from different providers. UPV used different
document collection fields and two approaches for retrieving diverse results: using
maximal-marginal relevance (MMR) to cluster results and then use cosine similarity
to select the most dissimilar documents.</p>
        <p>For the semantic enrichment task, the most often used external source for terms
was Wikipedia at different levels of detail (article titles, first paragraph, full text).
Wordnet and DBpedia (two groups) were also used. Gesis also used co-occurrence
analysis to add related terms from the Europeana collection itself.</p>
        <p>More details on methodologies and approaches can be found in the working papers
of the individual groups.
8</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Conclusion and Outlook</title>
      <p>The results of this year’s pilot CHiC lab have shown that working with data from the
cultural heritage domain is possible but also poses many challenges due to the
ambiguity of the users’ information needs and the sparseness of the retrievable data. The
preparation of new collections, the extraction of real queries and the organization of
three realistic tasks with their respective evaluation measures was a challenge for
organizers and participants, but it provided a lot of insight and more experience to
continue this work in the next year.</p>
      <p>After reviewing the tasks, their descriptions and the results, we believe that we can
work on improving the current tasks by fine-tuning both the requirements and the
evaluation measures (especially in the variability and semantic enrichment tasks). For
2012, we have only used three of the 14 language subcollections that were prepared
and didn’t put a lot of focus on the entire collection. Using the other collections to
introduce more languages into the evaluation as well as putting more focus on the
entire dataset (the actual use case for the Europeana portal) are both viable directions
for additional instances of this lab.</p>
      <p>Europeana is moving towards a linked data model for its objects5 and one direction
for this lab would be to combine experts from the information retrieval and linked
data domains to research new retrieval approaches for this kind of data.</p>
      <p>Finally, cultural heritage information systems are looking to incorporate more user
interactions into their systems. The information retrieval evaluation field has often
been criticized for viewing the viewer as outside of the scope of study. This domain
and the available system (Europeana) enable us to combine and collaborate on
information retrieval and information interaction research. CHiC is attempting to move
towards this direction.</p>
      <sec id="sec-9-1">
        <title>Acknowledgements.</title>
        <p>This work was supported by PROMISE (Participative Research Laboratory for
Multimedia and Multilingual Information Systems Evaluation, Network of Excellence
cofunded by the 7th Framework Program of the European Commission, grant agreement
no. 258191. We would like to thank Europeana for providing the data for collection
and topic preparation and providing valuable feedback on task refinement and
assessment. Elaine Toms (University of Sheffield, UK) and Birger Larsen (Royal
School of Library and Information Science, Copenhagen, Denmark) have shaped the
lab’s organization from the beginning and are instrumental in integrating more
interactive features into the lab’s tasks in further instances. We would like to thank our
external relevance assessors Anthi Agoropoulou, Christophe Onambélé and Astrid
Winkelmann.
5 http://pro.europeana.eu/edm-documentation</p>
        <sec id="sec-9-1-1">
          <title>2. Carbonell, J.G., Goldstein, J.: The use of MMR, diversity-based reranking for reordering</title>
          <p>documents and producing summaries. In: SIGIR ’98, pp. 335-336. ACM, New York
(1998)</p>
        </sec>
        <sec id="sec-9-1-2">
          <title>3. Chen, H., Karger, D. R.: Less is more: probabilistic models for retrieving fewer relevant</title>
          <p>documents. In: SIGIR ’06, pp. 429-436. ACM, New York (2006)</p>
        </sec>
        <sec id="sec-9-1-3">
          <title>4. Clarke, C. L.A., Craswell, N., Soboro, I.: Overview of the TREC 2009 Web Track. In:</title>
        </sec>
        <sec id="sec-9-1-4">
          <title>Voorhees, E. M., Buckland, L.P. (eds.) TREC 2009. NIST (2009)</title>
        </sec>
        <sec id="sec-9-1-5">
          <title>5. Clarke, C. L. A., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher, S.,</title>
          <p>MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In: SIGIR ‘08,
pp.659–666. ACM, NewYork (2008)
6. Over, P.: TREC-6 interactive track report. In: Voorhees, E. M., Harman, D.K. (eds.) TREC
1998, p.73. NIST (1998)
7. Sanderson, M.: Ambiguous queries: test collections need more sense. In: SIGIR ’08, pp.</p>
          <p>499-506. ACM, New York (2008)</p>
        </sec>
        <sec id="sec-9-1-6">
          <title>8. Sanderson, M., Tang, J., Arni, T., Clough, P.: What Else Is There? Search Diversity Exam</title>
          <p>ined. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR '09, pp.
562-569. Springer, Heidelberg (2009)</p>
        </sec>
        <sec id="sec-9-1-7">
          <title>9. Stiller, J., Gäde, M., Petras, V.: Ambiguity of Queries and the Challenges for Query Lan</title>
          <p>guage Detection. CLEF 2010 LogCLEF Workshop. In: Braschler, M., Harman, D., Pianta,</p>
        </sec>
        <sec id="sec-9-1-8">
          <title>E. (eds) CLEF 2010 Labs and Workshops Notebook Papers. Padua, Italy, 22-23 September</title>
          <p>2010. (2010)
10. Voorhees, E. M.: Overview of the TREC 2004 robust retrieval track. In: Voorhees, E. M.,</p>
          <p>Buckland, L.P. (eds.) TREC 2004. NIST (2004)
11. Xu, Y., Yin, H.: Novelty and topicality in interactive information retrieval. J. Am. Soc. Inf.</p>
          <p>Sci. Technol. 59(2), 201-215 (2008)
12. Zhai, C., Cohen, W., Lafferty, J: Beyond independent relevance: methods and evaluation
metrics for subtopic retrieval. In: SIGIR ’03, pp. 10-17. ACM, New York (2003)
13. Agosti, M., Ferro, N.: Towards an Evaluation Infrastructure for DL Performance
Evaluation. In Tsakonas, G. and Papatheodorou, C. (eds.), Evaluation of Digital Libraries: An
Insight to Useful Applications and Methods, pp 93-120. Chandos Publishing, Oxford, UK
(2009)</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollapudi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halverson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ieong</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Diversifying search results</article-title>
          .
          <source>In: Proceedings of the Second ACM International Conference on Web Search and Data Mining</source>
          , pp.
          <fpage>5</fpage>
          -
          <lpage>14</lpage>
          . ACM, New York (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>