Cultural Heritage in CLEF (CHiC) 2013 – Multilingual Task Overview1 Vivien Petras1, Toine Bogers2, Nicola Ferro3, Ivano Masiero3 1 Berlin School of Library and Information Science, Humboldt-Universitätzu Berlin, Dorotheenstr. 26, 10117 Berlin, Germany vivien.petras@ibi.hu-berlin.de 2 Royal School of Library and Information Science, Copenhagen University,Birketinget 6, 2300 Copenhagen S, Denmark mvs872@iva.ku.dk 3 Department of Information Engineering, University of Padova, Via Gradenigo 6/B, 35131Padova, Italy {ferro,masieroi}@dei.unipd.it Abstract. The Cultural Heritage in CLEF 2013 multilingual task comprised two sub-tasks: multilingual ad-hoc retrieval and semantic enrichment. The multilin- gual ad-hoc retrieval sub-task evaluated retrieval experiments in 13 languages (Dutch, English, German, Greek, Finnish, French, Hungarian, Italian; Norwe- gian, Polish, Slovenian, Spanish, Swedish). More than 140,000 documents were assessed for relevance on a tertiary scale. The ad-hoc task had 7 participants submitting 30 multilingual and 41 monolingual runs. The semantic enrichment task evaluated monolingual and multilingual semantic enrichments (suggestions based on a query) in the same 13 languages. Two participants submitted 10 runs. Results indicated that different languages contribute differently to the overall retrieval effectiveness, probably dependent on collection size. Experi- ments showed that using more or all of the provided languages usually increas- es retrieval effectiveness, but not always. For a multilingual task of this scale (13 languages), more participants are necessary in order to provide enough var- iations in runs to allow for comparative analyses. Keywords: cultural heritage, Europeana, ad-hoc retrieval, semantic enrichment, multilingual retrieval 1 Introduction Cultural heritage collections – preserved by archives, libraries, museums and other institutions – consist of “sites and monuments relating to natural history, ethnography, archaeology, historic monuments, as well as collections of fine and applied arts" [3]. Cultural heritage content is often multilingual and multimedia (e.g. text, photographs, images, audio recordings, and videos), usually described with metadata in multiple formats and of different levels of complexity. Cultural heritage institutions have dif- 1 Parts of this paper were already published in the CHIC 2013 LNCS Overview paper [6]. ferent approaches to managing information and serve diverse user communities, often with specialized needs. The targeted audience of the CHiC lab and its tasks are devel- opers of cultural heritage information systems, information retrieval researchers spe- cializing in domain-specific (cultural heritage) and / or structured information retriev- al on sparse text (metadata) and semantic web researchers specializing on semantic enrichment with LOD data. Evaluation approaches (particularly system-oriented eval- uation) in this domain have been fragmentary and often non-standardized. CHiC aims at moving towards a systematic and large-scale evaluation of cultural heritage digital libraries and information access systems. After a pilot lab in 2012, where a standard ad-hoc information retrieval scenario was tested together with two use-case-based scenarios (diversity task and semantic enrichment task), the 2013 lab diversifies and becomes more realistic in its tasks or- ganization. The pilot lab has shown that cultural heritage is a truly multilingual area, where information systems contain objects in many different languages. Cultural her- itage information systems also differ from some more specified information systems in that ad-hoc searching might not be the prevalent form of access to this type of con- tent. The 2013 CHiC lab therefore focuses on multilinguality in the retrieval tasks and adds an interactive task, where different usage scenarios for cultural heritage infor- mation systems were tested. The multilingual tasks described in this paper required multilingual retrieval in up to 13 languages, making CHiC the most multilingual CLEF lab ever. CHiC has teamed up with Europeana2, Europe’s largest digital library, museum and archive for cultural heritage objects to provide a realistic environment for exper- iments. Europeana provided the document collection (digital representations of cul- tural heritage objects) and queries from their query logs. The interactive task also provided a topic clustering algorithm and a customized browsable portal based on Europeana data. The paper is structured as follows: Chapter 2 introduces the Europeana document collection. Chapters 3 and 4 describe the sub-tasks multilingual ad-hoc and multilin- gual semantic enrichment in detail, their requirements, participants and results. The conclusion provides an outlook on the future of CHiC and the potential synergies of combining ad-hoc and interactive information retrieval evaluation. 2 The Europeana Collection The Europeana information retrieval document collection was prepared for the CHiC pilot lab in 2012 (Petras et al., 2012). It consists of the complete Europeana metadata index as downloaded from the production system in March 2012. It contains 23,300,932 documents with a size of 132 GB. With the move of Europeana to an open data license in the summer of 2012 and the subsequent changes in content, this test document collection represents a snapshot of Europeana data from a particular time. However, the overlap to the current content is about 80%. 2http://www.europeana.eu The collection consists of metadata records describing cultural heritage objects, e.g. the scanned version of a manuscript, an image of a painting of sculpture or an audio or video recording. Roughly, 62% of the metadata records describe images, 35% describe text, 2% describe audio and 1% video recordings. The collection was divided into 14 sub-collections according to the language of the content provider of the record (which usually indicates the language of the metadata record). A threshold was set: all languages with less than 100,000 documents were grouped together under the name “Others”. The 13 language collections included Dutch, English, German, Greek, Finnish, French, Hungarian, Italian; Norwegian, Polish, Slovenian, Spanish, Swedish. For the CHiC 2013 experiments, all sub- collections except the “Others” were used, totaling roughly 20 million documents. The 14 sub-collections are listed in table 1. Table 1. CHiC Collections by Language and Media Type. Language Sound Text Image Video Total German 23,370 664,816 3,169,122 8,372 3,865,680 French 13,051 1,080,176 2,439,767 102,394 3,635,388 Swedish 1 1,029,834 1,329,593 622 2,360,050 Italian 21,056 85,644 1,991,227 22,132 2,120,059 Spanish 1,036 1,741,837 208,061 2,190 1,953,124 Norwegian 14,576 207,442 1,335,247 555 1,557,820 Dutch 324 60,705 1,187,256 2,742 1,251,027 English 5,169 45,821 1,049,622 6,564 1,107,176 Polish 230 975,818 117,075 582 1,093,705 Finnish 473 653,427 145,703 699 800,302 Slovenian 112 195,871 50,248 721 246,952 Greek 0 127,369 67,546 2,456 197,371 Hungarian 34 14,134 107,603 0 121,771 Others 375,730 1,488,687 1,106,220 19,870 2,990,507 Total 455,162 8,371,581 14,304,289 169,899 23,300,932 The XML metadata contains title and description data, media type and chronological data as well as provider information. For ca. 30% of the records, content-related en- richment keywords were added automatically by Europeana based on a mapping be- tween metadata terms and terms from controlled lists like DBpedia names. In the Europeana portal, object records commonly also contain thumbnails of the object if it is an image and links to related records. These were not included with the test collec- tion, but relevance assessors were able to look at them at the original source. Figure 1 shows an extract example record from the Europeana CHiC collection. Orn.0240 Tachymarptis melba RundunZaqquBajda (Orn.0240) Alpine Swift (Orn.0240) mounted specimen malta Heritage Malta http://www.heritagemalta.org/sterna/orn.php?id=0240 en STERNA IMAGE http://www.europeana.eu/resolve/record/10105/5E1618BFAF072B8953B307 01A6A6C3BB655ACF9D Fig.1. Europeana CHiC Collection Sample Record 3 The CHiC Multilingual Ad-hoc Task The sub- tasks are a continuation of the 2012 CHiC lab, using a similar task scenarios, but requiring multilingual retrieval and results. Two sub-tasks were defined: multilin- gual ad-hoc retrieval and multilingual semantic enrichment. The traditional multilingual ad-hoc retrieval task measures information retrieval ef- fectiveness with respect to user input in the form of queries. The 13 language sub- collections form the multilingual collection (ca. 20 million documents) against which experiments were run. Participants were asked to submit ad-hoc information retrieval runs based on 50 topics (provided in all 13 languages) and including at least 2 and at most all 13 collection languages. For pooling purposes, participants were also asked to submit monolingual runs choosing any of the collection languages. Because the topics were provided in all collection languages, the focus of the task was not on topic translation, but on multilingual retrieval across different collection languages. 3.1 Topic Creation A new set of 50 topics was created for the 2013 edition of CHiC, where topic selec- tion was determined partially by the potential for retrieving a sufficient number of relevant documents in each of the collection languages. CHiC 2012 used topics from the Europeana query logs alone, which resulted in zero results for some of the 3 lan- guages [13]. The problem of having zero relevant results is aggravated when collec- tion languages are varied, especially in the cultural heritage area. Many topics are relevant for only a few languages or cultures. For 2013, more focus was put on testing all topics in all languages for retrieving relevant documents, which resulted in fewer zero relevant result topics. The topic creation process started with creating a pool of candidate topics, which derived from four different sources:  15 topics that showed promising retrieval performance were re-used from the 2012 topic set (only in 3 languages) to test their performance in 13 languages.  Another 19 topics that were not specific to only a handful of languages were taken from an annotated snapshot of the Europeana query log (the same proce- dure was used for the 2012 topics).  The Polish task also suggested topics, 17 were not considered to be relevant only in Polish and input in the candidate pool.  Finally, two of the track organizers generated another 21 test queries covering a wide range of topics contained in Europeana’s collections that would span all col- lection languages. These 73 candidate topics were then translated into all 13 languages by volunteers. The translated candidate topics were run against the 13 language collections using Indri 5.2 with default settings3. We retained the 50 topics that returned the highest number of relevant documents for all thirteen languages. Another factor that affected the final selection of the 2013 topics was the abundance of named-entity queries (around 60%) in the 2012 topic set. While named-entity queries are a common type of query for Europeana [9], they are less challenging than non-entity queries that de- scribe a more complex information need. For this we wished to down-sample the proportion of named-entity queries to around 20%. The final topics set covers a wide range of topics and consisted of 12 topics from the 2012 topic set, 13 log-based topics, 13 topics from the Polish subtask, and 12 intellectually derived queries. In form and type, the different query types are indistin- guishable and usually include 1-3 query terms (e.g. “silent film”, “ship wrecks”, and “last supper”). The underlying information need for a query can be ambiguous if the intention of the query is not clear. In this case, the track organizers discussed the que- ry and agreed on the most likely information need. These were not admissible for information retrieval. Figure 2 shows an example of an English query. CHIC-004 silent film documents on the history of silent film, silent film videos, biographies of actors and directors, characteristics of silent film and decline of this genre Fig. 2. CHiC Sample Query 3Jelinek-Mercer smoothing with λ set to 0.4 and no stemming or stopword filtering. 3.2 Pooling and Relevance Assessments This year, we produced 13 pools, one for each target language using different depths depending on the language and the available number of documents. The pools were created using all the submitted runs. A 14th pool, for the multilingual task, is the un- ion of the 13 pools described above. Table 2 provides details about the created pools, their size, the number of relevant and not relevant documents, and the pooled runs. Table 2. CHiC 2013 Multilingual Pools CHiC 2013 Multilingual - Dutch Pool Depth 125 Total documents 10,548 Highly Relevant documents 1,583 Size Partially Relevant documents 811 Not relevant documents 8,154 Topics with relevant documents / Total Topics 48 out of 50 Assessors 2 CHiC 2013 Multilingual - English Pool Depth 50 Total documents 16,696 Highly Relevant documents 2,530 Size Partially Relevant documents 70 Not relevant documents 14,096 Topics with relevant documents / Total Topics 49 out of 50 Assessors 2 CHiC 2013 Multilingual - Finnish Pool Depth 200 Total documents 2,465 Highly Relevant documents 276 Size Partially Relevant documents 19 Not relevant documents 2,170 Topics with relevant documents / Total Topics 16 out of 50 Assessors 1 CHiC 2013 Multilingual - French Pool Depth 50 Total documents 17,978 Highly Relevant documents 2,508 Size Partially Relevant documents 436 Not relevant documents 15,034 Topics with relevant documents / Total Topics 50 out of 50 Assessors 1 CHiC 2013 Multilingual - German Pool Depth 50 Size Total documents 18,460 Highly Relevant documents 3,510 Partially Relevant documents 50 Not relevant documents 14,900 Topics with relevant documents / Total Topics 50 out of 50 Assessors 2 CHiC 2013 Multilingual - Greek Pool Depth 125 Total documents 10,032 Highly Relevant documents 265 Size Partially Relevant documents 145 Not relevant documents 9622 Topics with relevant documents / Total Topics 40 out of 50 Assessors 1 CHiC 2013 Multilingual - Hungarian Pool Depth 200 Total documents 5,834 Highly Relevant documents 332 Size Partially Relevant documents 491 Not relevant documents 5,011 Topics with relevant documents / Total Topics 48 out of 50 Assessors 1 CHiC 2013 Multilingual - Italian Pool Depth 75 Total documents 13,387 Highly Relevant documents 2,176 Size Partially Relevant documents 721 Not relevant documents 10,490 Topics with relevant documents / Total Topics 47 out of 50 Assessors 1 CHiC 2013 Multilingual - Norwegian Pool Depth 125 Total documents 10,287 Highly Relevant documents 1,723 Size Partially Relevant documents 289 Not relevant documents 8,275 Topics with relevant documents / Total Topics 43 out of 50 Assessors 2 CHiC 2013 Multilingual - Polish Pool Depth 125 Total documents 11,342 Highly Relevant documents 1,086 Size Partially Relevant documents 624 Not relevant documents 9,632 Topics with relevant documents / Total Topics 46 out of 50 Assessors 1 CHiC 2013 Multilingual - Slovenian Pool Depth 200 Total documents 6,718 Highly Relevant documents 481 Size Partially Relevant documents 195 Not relevant documents 6,042 Topics with relevant documents / Total Topics 37 out of 50 Assessors 1 CHiC 2013 Multilingual - Spanish Pool Depth 100 Total documents 11,373 Highly Relevant documents 1,689 Size Partially Relevant documents 446 Not relevant documents 9,238 Topics with relevant documents / Total Topics 46 out of 50 Assessors 1 CHiC 2013 Multilingual - Swedish Pool Depth 150 Total documents 11,640 Highly Relevant documents 941 Size Partially Relevant documents 342 Not relevant documents 10,357 Topics with relevant documents / Total Topics 43 out of 50 Assessors 1 We used graded relevance, i.e. highly relevant, partially relevant, and not relevant. To compute the standard performance measures reported in Section 3.3, we used binary relevance and conflated highly relevant and partially relevant to just relevant. The DIRECT system [1] was used to collect runs, perform relevance assessment, and compute performances. The system’s interfaces and processes were also described in last year’s CHiC Paper [5] For all languages except English, native language speakers performed the rele- vance assessments. Fifteen assessors took 2 weeks to assess the ca. 140,000 docu- ments. The assessors received detailed instructions on how to use the assessor inter- face and guidelines, how the relevance assessments were to be approached. Constant communication via a common mailing list ensured that assessors across languages treated topics from the same perspective. Despite our efforts in topic creation, some topics in some languages did not have any relevant documents in the pool. Besides not all queries having relevant documents in the Europeana collection, the problem was exacerbated by receiving very few monolingual runs that could be used for pooling, sometimes resulting in very small pools. While 11 languages have at least 40 topics with relevant documents (5 with 48 or more topics with relevant documents), Finnish (only 16 topics with relevant docu- ments) and Slovenian (only 37 topics with relevant documents) give raise for concern in comparative analyses. 3.3 Participants and Runs Seven different teams participated in the 2013 edition of the ad-hoc track (table 3). Table 3.Participating groups and country. Group Country CEA LIST France Department of Computer Science, University of Neuchâtel Switzerland MRIM/LIG, University of Grenoble France RSLIS, University of Copenhagen & Aalborg University Denmark School of Information, UC Berkeley USA Technical University of Chemnitz Germany University of Westminster Great Britain Out of the 71 runs submitted, 30 were multilingual runs using at least 2 collection languages; 10 runs used all available languages for both topics and collections. All languages were also represented in the monolingual or bilingual runs (41 total). Eng- lish, German, French and Italian were the popular languages for the monolingual runs, all other languages had only 1 or 2 runs. Toine Bogers (RSLIS) provided 2 more baseline runs for each language collection using the Indri information retrieval system using language modelling with either the Dirichlet (no stopword list, no stemming) or the Jelinek-Mercer smoothing algorithm (with stopword list, no stemming), which are used in the comparison. Table 4 shows the submitted runs and their language combi- nations including the baseline runs. Table 4. Submitted Runs in the CHiC 2013 Multilingual Ad-hoc Retrieval Task Topic Collection Runs Topic Collection Runs Language(s) Language(s) Language(s) Language(s) Monolingual runs Multilingual runs DE DE 6 All All 10 EL EL 3 DE All 1 EN EN 10 EN All 1 ES ES 4 FR All 1 FI FI 3 All NOT EL All NOT EL 1 All NOT EL, All NOT EL, FR FR 6 4 HU, SL HU, SL HU HU 3 All DE,EN,FR 1 DE, EN, ES, IT IT 8 DE,EN,FR 1 FR, IT NL NL 4 DE,EN,FR DE,EN,FR 1 Topic Collection Runs Topic Collection Runs Language(s) Language(s) Language(s) Language(s) NO NO 4 DE DE,EN,FR 1 PO PO 4 EN DE,EN,FR 1 SL SL 3 ES DE,EN,FR 1 SV SV 4 FI DE,EN,FR 1 FR DE,EN,FR 1 Bilingual runs IT DE, EN, FR 1 DE FR 1 NL DE,EN,FR 1 DE EN 1 EN EN, IT 1 EN DE 1 IT EN, IT 1 EN FR 1 FR DE 1 FR EN 1 3.4 Results & Participant Approaches Because of the many variations in topic and collection language configurations, com- parisons between runs is difficult. Since language combinations are then varied by different system configurations, the matrix of possible impact factors becomes very big. However, several comparisons can give indications into further research ques- tions that should be analyzed. 3.4.1 Multilingual Runs: All Languages vs. Fewer languages Table 5 shows the best multilingual run per participating group ordered by MAP showing the topic and collection languages that were used for retrieval. Note that only the best run is selected for each group, even if the group may have more than one top run. Table 5. Best Multilingual Experiments per Group (in MAP) Participant Experiment Identifier Topic Collection MAP Languages Languages Chemnitz TUC_ALL_LA All All 23.38% All NOT All NOT CEA List MULTILINGUALNOEXPANSION 18.78% EL, HU, SL EL, HU, SL Neuchatel UNINEMULTIRUN5 All All 15.45% RSLIS_MULTI_FUSION_COMBS All All RSLIS 8.37% UM Westminster R005 EN EN,IT 6.30% Berkeley BERKMLENFRDE19 EN,FR,DE EN,FR,DE 3.93% Figure 3 shows the best 5 multilingual runs in an interpolated recall vs. average preci- sion graph. THOMAS_WILHELM.TUC_ALL 0,8 THOMAS_WILHELM.TUC_ALL_LA THOMAS_WILHELM.TUC_ALL_HS ADRIANPOPESCU.CEALISTMULTILINGUALNOEXPANSION 0,6 MITRA_AKASEREH.UNINEMULTIRUN5 0,4 0,2 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Fig. 3. Best 5 Multilingual Runs – Interpolated Recall / Precision It is difficult to interpret these figures in terms of which languages have the most in- put for retrieval success as the applied IR systems play a much bigger role in this cross-system comparison. UC Berkeley compared experiments with different topic languages against a multi- lingual collection of English, French and German combined. Results show that using the exact same languages for topics achieves a slightly higher result than using just one of the topic languages or even more languages (table 6). In this experiment, dif- ferences between runs are probably not all statistically significant. However it is in- teresting to note that English and French seem not to contribute to the retrieval effec- tiveness as much as German, for example, and that a topic language, which is not represented in the collection languages (ES) can still achieve almost as high a MAP as the topic language English. Table 6. UC Berkeley: Comparing Topic and Collection Languages (in MAP) [4] Experiment Identifier Topic Collection MAP Languages Languages BERKMLENFRDE19 EN,FR,DE EN,FR,DE 3.93% BERKMLALL17 All EN,FR,DE 3.57% BERKMLSPENFRDEIT18 EN,FR,DE, ES, IT EN,FR,DE 3.53% BERKMLDE12 DE EN,FR,DE 3.31% BERKMLFR11 FR EN,FR,DE 2.22% BERKMLEN10 EN EN,FR,DE 1.66% BERKMLSP16 ES EN,FR,DE 1.33% RSLIS used a similar approach with equivalent results: using one topic language against the whole multilingual index did result in lower retrieval effectiveness than the fusion runs using 3 topic languages (table 7). Table 7. RSLIS: Comparing Topic and Collection Languages (in MAP)[8] Experiment Identifier Topic Collection MAP Languages Languages MULTI_FUSION_COMBSUM EN,FR,DE All 8.37% MULTI_FUSION_COMBMNZ EN,FR,DE All 8.36% MULTI_MONO_GER DE All 6.79% MULTI_MONO_FRE FR All 4.30% MULTI_MONO_ENG EN All 3.70% Both groups found that the German topics seem to have the highest retrieval impact. The Westminster group [11] showed in a similar experiment that English seemed to have a higher impact than Italian. More runs would be necessary to be able to perform a complete analysis. Unine experimented with removing topic and collection languages equally and dif- ferent fusion algorithms (merging results from separate language indexes) and showed that leaving out the smaller collection languages can result in an increase in performance, however, the impact of an individual language is unclear (table 8). Table 8. Unine: Comparing Topic and Collection Languages (in MAP) [2] Experiment Identifier Topic Collection MAP Languages Languages UNINEMULTIRUN5 All All 15.45% All NOT EL, All NOT EL, Inofficial Unine Run, Z-score 16.22% HU, SL HU, SL Inofficial Unine Run, RR All All 13.88% Inofficial Unine Run, RR All NOT EL All NOT EL 13.87% Finally, TU Chemnitz experimented with different stemming algorithms for all lan- guages and found that using a less aggressive stemmer worked best compared to the standard rule-based stemmers used in Solr or a no-stemming approach (table 9). Table 9. Chemnitz: Comparing Stemming Approaches (in MAP) [12] Stemming Approach MAP Less aggressive 23.38% Standard (rule-based) 23.36% No stemmer 15.34% 3.4.2 Monolingual Runs For pooling purposes, participants submitted monolingual runs as well. We can com- pare them using the whole multilingual pool (results are also available in the DIRECT4 system) or using the monolingual pools. While a multilingual pool is what the real use case prescribes (all languages are potentially relevant), we can also look at monolingual pools to achieve an improved system comparison (less variation be- cause of language). We will concentrate on the 4 languages with the most submitted experiments: English (10), Italian (8), German and French (6). Table 10 shows the best monolingual run for each participant in those languages. Table 10. Best Monolingual Experiments per Group (in MAP) Experiment Identi- Experiment Participant MAP Participant MAP fier Identifier Monolingual English Monolingual Italian MRIM MRIM_AR_2 40.43% Westminster R004 29.41% Westminster R001 28.30% RSLIS BASELINE.ITA3 24.90% CEALISTITALIA Berkeley BERKBIDEEN04 19.42% CEA List 16.50% NFILTERED RSLIS BASELINE.ENG1 18.35% CEALISTENGLIS CEA List 16.68% HFILTERED Monolingual French Monolingual German CEALISTFRENCH CEA List 27.62% RSLIS BASELINE.GER2 29.79% NOEXPANSION CEALISTGERMA Berkeley BERKMONOFR02 20.14% CEA List 28.99% NNOEXPANSION RSLIS BASELINE.FRE3 Berkeley BERKBIENDE09 17.85% Unfortunately, only 2 groups (RSLIS & CEA List) submitted runs to all 4 languages so that a comparison among even those 4 languages becomes difficult. 3.4.3 Participant Approaches Table 11 briefly summarizes the participants’ approaches to the ad-hoc track. Table 11.Participating groups and their approaches to the multilingual ad-hoc track. Group Description of approach Apache Solr with special focus on comparing different types of Chemnitz stemmers (generic, rule-based, dictionary-based) [12]. Query expansion of a Vector Space model with tf-idf weighting by CEA LIST using related concepts extracted from Wikipedia using Explicit Se- mantic Analysis [7]. 4 http://direct.dei.unipd.it Language modeling approach using Dirichlet smoothing and Wik- MRIM ipedia as external document collection to estimate the word proba- bilities in case of sparsity of the original term-document matrix [10]. Probabilistic IR using Okapi model with stopword filtering and light Neuchâtel stemming. Collection fusion on the results lists from 13 different monolingual indexes using z-score normalization merging [2]. Language modeling with Jelinek-Mercer smoothing and no stop- word filtering or stemming. One run each for English, French, and German where these topic languages are run against a multilingual RSLIS index. Two fusion runs using the CombSUM and CombMNZ meth- ods combining these three monolingual runs against the multilingual index [8]. Probabilistic text retrieval model based on logistic regression togeth- er with pseudo-relevance feedback for all of the runs. Runs with UC Berkeley English, French, and German topic sets and sub-collections, as well translations generated by Google Translate [4]. Divergence from randomness algorithm using Terrier on the English Westminster and Italian collections [11]. 4 The CHiC Multilingual Semantic Enrichment Task The multilingual semantic enrichment task requires systems to present a ranked list of related concepts for query expansion. Related concepts can be extracted from Euro- peana data or from other resources in the Linked Open Data cloud or other external resources (e.g. Wikipedia). Participants were asked to submit up to 10 query expan- sion terms or phrases per topic. This task included 25 topics in all 13 languages. Par- ticipants could choose to experiment on monolingual or multilingual semantic en- richments. The suggested concepts were assessed with respect to their relatedness to the original query terms or query category. Only 2 groups participated in the semantic enrichment task, making a comparison more difficult. Almost all experiments contained either only English concepts or con- cepts from several languages (multilingual). In total, 10 experiments were submitted. MRIM/LIG (Univ. of Grenoble) used Wikipedia as a knowledge base and the que- ry terms in order to identify related Wikipedia articles for enrichment candidates. Both in-links and out-links to and from these related articles (in particular their titles) were then used to extract terms for enrichment [10]. CEA List used Explicit Semantic Analysis (documents are mapped to a semantic structure) also with Wikipedia as a knowledge base. Whereas MRIM/LIG used the title of Wikipedia articles and their in- and out-links for concept expansion, CEA List concentrated on the categories and the first 150 characters within a Wikipedia article. When Wikipedia category terms overlapped with query terms, these concepts were boosted for expansion. In ad-hoc retrieval, the topic and expanded concepts were matched against the collection and the results were then matched again to a consoli- dated version of the topics (favoring more frequent concept phrases) before outputting the result. For multilingual query expansion, the interlingua links to parallel language versions of a Wikipedia article were used in a fusion model. For most expansion ex- periments, only concepts were considered that appear in at least 3 Wikipedia language versions, allowing for multilingual expansions [7]. The semantic enrichments were evaluated using a tertiary relevance assessment (definitely relevant, maybe relevant, not relevant) and P@1, P@3 and P@10 meas- urements. Table 12 shows the results for the best 2 runs for each participants using either the strict relevance measurement (just definitely relevant) or the relaxed rele- vance measurement (definitely relevant and maybe relevant). Table 12.Semantic Enrichment: Best 2 Runs for each Participant Run name P@1 P@3 P@10 Strict relevance ceaListEnglishMonolingual 0.5200 0.5467 0.4680 ceaListEnglishRankMultilingual 0.4800 0.4533 0.3400 MRIM_SE13_EN_WM_1 0.0800 0.0667 0.0522 MRIM_SE13_EN_WM 0.0400 0.0533 0.0422 Relaxed relevance ceaListEnglishRankMultilingual 0.6800 0.7200 0.5600 ceaListEnglishMonolingual 0.6800 0.7067 0.6600 MRIM_SE13_EN_WM_1 0.2800 0.1467 0.1598 MRIM_SE13_EN_WM 0.2800 0.1333 0.1448 Only CEA List experimented with multilingual enrichments. Interestingly, a multilin- gual enrichment run was the best with a relaxed relevance measurement, while the monolingual run was the best with a strict relevance measurement. 5 Conclusion and Outlook The results of this year’s multilingual CHiC task show that multilingual information retrieval experiments are challenging not only because of the number of languages that need to be processed but also because of the number of participants necessary in order to produce comparable results. As the number of possible language variations increases (CHiC had 13 source languages and 13 target languages), very few experi- ments across participants can be compared. While this year’s results have shown that searching in several languages increases the overall performance (an obvious result), we could not show which languages contributed more to retrieval results. Future re- search in the multilingual task needs to focus on narrower defined tasks (e.g. particu- lar source languages against the whole collection) or define a GRID experiment where a particular information retrieval system performs all possible run variation to arrive at better answers. The interactive study collected a rich data set of questionnaire and log data for fur- ther use. Because the task was designed for easy entrance (predetermined system and research protocol, this is somewhat different that the traditional lab and is planned to follow a 2-year cycle (assuming the lab’s continuation). In year two, the data gathered this year should be released to the community in aggregate form having been assessed by the user interaction community with the goal of identifying a set of objects that need to be developed. The ad-hoc retrieval tasks can benefit from the interactive task by re-using the real queries in ad-hoc retrieval test scenarios – effectively merging both evaluation methods. Acknowledgements. This work was supported by PROMISE (Participative Research Laboratory for Mul- timedia and Multilingual Information Systems Evaluation, Network of Excellence co- funded by the 7th Framework Program of the European Commission, grant agreement no. 258191. We would like to thank Europeana for providing the data for collection and topic preparation and providing valuable feedback on task refinement. We would like to thank Maria Gäde, Preben Hansen, Anni Järvelin, Birger Larsen, Simone Pe- ruzzo, Juliane Stiller, Theodora Tsikrika and Ariane Zambiras for their invaluable help in translating the topics. We would also like to thank our relevance assessors Tom Bekers, Veronica Estrada Galinanes, Vanessa Girnth, Ingvild Johansen, Georgi- os Katsimpras, Michael Kleineberg, Kristoffer Liljedahl, Giuliano Migliori, Chris- tophe Onambélé, Timea Peter, Oliver Pohl, Siri Soberg, Tanja Špec, Emma Ylitalo. References 1. Agosti, M., Ferro, N.: Towards an Evaluation Infrastructure for DL Performance Evaluation. In Tsakonas, G. and Papatheodorou, C. (eds.), Evaluation of Digital Libraries: An Insight to Useful Applications and Methods, pp 93-120. Chandos Publishing, Oxford, UK (2009). 2. Akasereh M., Naji N., Savoy J. UniNE at CLEF – CHIC 2013. In Proceedings CLEF 2013, Working Notes (2013). 3. International Council of Museums (2003). Scope Definition of the CIDOC Conceptual Refer- ence Model. http://www.cidoc-crm.org/scope.html 4. Larson, R. Pseudo-Relevance Feedback for CLEF-CHiC Adhoc. In Proceedings CLEF 2013, Working Notes (2013). 5. Petras V., Ferro N., Gäde M., Isaac A., Kleineberg M., Masiero I., Nicchio M., Stiller J. Cultural Heritage in CLEF (CHiC) Overview 2012. In Proceedings CLEF-2012, Working Paper (2012). 6. Petras, V., Bogers, T., Toms, E., Hall, M., Savoy, J., Malak, P., Pawłowski, A., Ferro, N., Masiero, I. Cultural Heritage in CLEF (CHiC) 2013. In Proceedings of CLEF 2013, LNCS, Springer (forthcoming). 7. Popescu, A. CEA LIST’s participation at the CLEF CHiC 2013. In Proceedings CLEF 2013, Working Notes (2013). 8. Skov, M., Bogers, T., Lund, H., Jensen, M., Wistrup, E., Larsen, B. RSLIS/AAU at CHiC 2013. In Proceedings CLEF 2013, Working Notes (2013). 9. Stiller, J., Gäde, M., & Petras, Vivien (2010). Ambiguity of Queries and the Challenges for Query Language Detection. In CLEF 2010 LABs and Workshops. Retrieved from http://clef2010.org/resources/proceedings/clef2010labs_submission_41.pdf 10. Tan, K., Almasri, M., Chevallet, J., Mulhem, P., Berrut, C. Multimedia Information Model- ing and Retrieval(MRIM)/Laboratoire d'Informatique de Grenoble (LIG) at CHiC2013. In Proceedings CLEF 2013, Working Notes (2013). 11. Tanase, D. Using the Divergence Framework for Randomness: CHiC 2013 Lab Report. In Proceedings CLEF 2013, Working Notes (2013). 12. Wilhelm-Stein, T., Schürer, B., Eibl, M. Identifying the most suitable stemmer for the CHiC multilingual ad-hoc task. In Proceedings CLEF 2013, Working Notes (2013).