DCU at VideoCLEF 2009 Ágnes Gyarmati and Gareth J. F. Jones Centre for Digital Video Processing Dublin City University, Dublin 9, Ireland {agyarmati|gjones}@computing.dcu.ie Abstract DCU participated in the VideoCLEF 2009 Linking Task. Our approach was based on identifying relevant related content using the Lemur information retrieval toolkit. We implemented two distinctive variants of our approach. One version performs the search in the Dutch Wikipedia with the exact words (either stemmed or not) of the search query extracted from the ASR transcription, and returns the corresponding links pointing to the English Wikipedia. The other variant first performs an automatic machine translation of the Dutch query into English, and then the translated query is used to search the English Wikipedia directly. Among our four runs, we achieved the best results with the first approach, when the base of retrieval was the stemmed and stopped Dutch Wikipedia. Unfortunately for us, there is no one-to-one relation between the pages of the Dutch and the English Wikipedias, hence some hits from the Dutch Wikipedia have been lost as results due to lack of equivalent English article. In extreme cases, our system might return no output at all if none of the hits for a given anchor are linked to a page in the English Wikipedia. Although we included a preprocessing phase before indexing the article collections, some unuseful, but frequently occurring types of page escaped and had a significant negative impact of our second basic approach implemented in Run 3. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; I.2 [Artificial Intelligence]: I.2.7 Natural Language Processing—Speech recognition and synthesis General Terms Information Retrieval, Automatic Speech Recognition Keywords information retrieval, automatic speech recognition, cross-language description linking 1 Introduction The VideoCLEF Linking Task involves locating content related to sections of an automated speech recognition (ASR) transcription cross-lingually. Elements of a Dutch ASR transcription are to be linked to related pages in an English Wikipedia collection. We submitted four runs for the VideoCLEF 2009 Linking Task [1], by implementing two different approaches to solve the task. Because of the difference between the source language (Dutch) and the target language (English), a switch between the languages at some point in the system is inevitable. The two approaches differ in defining the switching method. One approach performs the search in the Dutch Wikipedia with the exact words (either stemmed or not) of the search query extracted from the ASR transcription, and returns the corresponding links pointing to the English Wikipedia. The other variant first performs an auto- matic machine translation of the Dutch query into English, and then the translated query is used to search the English Wikipedia directly. 2 System Description Wikipedia dumps are created and published regularly, for our experiments we used the dump dated May 30th 2009 for the English, and the dump dated May 31st 2009 for the Dutch language Wikipedia collection. In a simple prepocessing phase, we eliminated some information we did not consider as relevant in the point of the task, e.g. information about users, comments, links to other languages we did not need. For indexing and retrieving, we used the Indri model of the open source Lemur Toolkit [2], unaltered, off-the-shelf. English texts were stemmed by Lemur’s built-in stemmer, while Dutch texts were stemmed by using Oleander’s implementation [5] of Snowball’s Dutch stemmer algorithm [6]. We used stopword lists provided by Snowball for both languages. Queries were formed based on the sequences of words extracted from the ASR transcripts for each of the anchors defined by the task. As transcript files contain timing information for each word, and anchors were defined by their starting and end point, respectively, our system searched for the given starting point of an anchor in the transcript file, and took all the words consecutively that fell in the time period until the given end point. First, these sequences were used directly as queries for retrieval from the Dutch collection, and the Dutch Wikipedia’s own links pointing to the corresponding articles of the English version were returned for each anchor point and each retrieved article, as the solution for the task. The other option was to translate the words to English first, and searching in the English Wikipedia using the translation as query. Translations were performed automatically using the query translation component developed for the Multimatch project [3]. This translation tool combines the WorldLingo machine translation engine augmented with a bilingual dictionary from the cultural heritage domain automatically extracted from the multilingual Wikipedia. 3 Run Configurations Here we describe the four runs we submitted to the Linking Task. The most prominent feature of each run is the choice of the collection for retrieval, i.e. whether it was the Dutch or the English Wikipedia. 1. Dutch As Lemur does not have Dutch-specific built-in tools (stemmer), we indexed the Dutch wikipedia as it was, without stemming or stopping. Retrieval was then performed from the Dutch collection, returning the relevant links found there. 2. Dutch stemmed The steps of retrieval are identical to that of Run 1, the only difference lies in the processing of the collection (and of queries), text is stemmed and stopped. 3. English This run represents the second approach, with the query translated first and re- trieval then performed in the English collection. Text was stemmed and stopped. 4. Dutch with blind relevance feedback This run is almost identical to Run 1, with a difference in parameter setting for Lemur to perform blind relevance feedback. Lemur/Indri uses a relevance model, for details see [4]. The first ten retrieved documents were assumed relevant and queries were expanded by five terms. 4 Results In this section we present the results obtained by our various runs. The Linking Task was assessed by the organisers as a known item task. The top most relevant link for each anchor is called a primary link, and all other relevant links defined additionally by the assessors are called secondary links [1]. Table 1 lists Recall and Mean Reciprocal Rank (MRR) for primary links, Table 2 shows only MRR values for secondary links as Recall cannot be counted due to the lack of an exhaustive list of secondary links. Run Recall MRR Run 1 44/165 0.18202 Run 2 44/165 0.18202 Run 3 13/165 0.05606 Run 4 38/165 0.14434 Table 1: Scores for Primary Links Runs 1 and 2 achieved the highest scores. Although they do yield slightly different output, the decision on whether to stem and stop text does not alter the results statistically, in the matter of primary links, while stemming and stopping (Run 2) improved results in finding secondary links. Run 4 used blind relevance feedback to expand the queries, setting the optimal parameters for these process would require further experimentation, and other expansion methods than Indri’s relevance model can be used and results compared. The main problem of this approach (that is, addressing the Dutch collection) lies in the dif- ferences between the English and the Dutch versions of Wikipedia. Although the English site is approximately ten times larger than its Dutch counterpart (considering the number of articles), there are articles that have no equivalent page in the other language, due to different structuring on the other side, or cultural differences, for example. System 1, 2 and 4 might (and in fact did) come up with relevant links at some point which were lost when checking for direct links provided on the Dutch page pointing to the English page. A weak point of out system in this approach is that some hits from the Dutch Wikipedia might get lost as results due to the lack of an equivalent English article. In extreme case, our system might return no output at all if none of the hits for a given anchor are linked to any page in the English Wikipedia. Run MRR Run 1 0.26773 Run 2 0.27475 Run 3 0.08990 Run 4 0.18960 Table 2: Scores for Related Links Run 3, which involved the implementation of a different approach, performed significantly worse. This might originate due to two aspects of the switch to the English collection. First, the query text was translated automatically from Dutch to English, which in itself carries a certain risk of losing information due to misinterpreting words or expressions or ignoring words unrecognised by the translation tool. While MultiMatch translation tool has a vocabulary expanded to include many concepts from the domain of cultural heritage, there were many specialist concepts in the ASR transcription which are not included in its translation vocabulary. Approximately 3.5% of the words were left in Dutch unchanged by the translator (names not included) which might be considered as insignificant, but some of them turned out to be key words (e.g. rariteitenkabinet ’cabinet of curiosities’, which was in fact retrieved by the system for Run 1 and 2 (although ranked lower than desired)). The other main problem we encountered at Run 2 lay in the size of the English Wikipedia and our insufficient experience concerning its structure. The downloadable dump includes a large number of pages that look like useful articles, but are in fact not: used to be or not meant to be articles at all (discussion pages, articles for deletion). This phenomenon missed our attention during the development phase, but had a high impact on our results, as about 18.5 % of the links given by Run 3 as solution were proven as invalid. Cleaning up the data more carefully gives a fairer opportunity to system 3. Results are also (mostly negatively) affected by the quality of ASR transcripts. As transcripts were provided with the task, we used them as they were in each of the runs, it was not the transcripts but the steps followed that distinguished the four runs. For more discussion on the relation of the transcripts to the Linking Task, see [1]. 5 Conclusions In this paper we outlined details of our submissions to the Linking Task at VideoCLEF 2009. We described how data were processed and how the system, or systems, worked, and presented the results each run scored. On the one hand, there is still room for improvement from our side, e.g. by finding better parameter settings and/or other expansion methods, by preprocessing data better by eliminating more unuseful information than we had previously done. On the other hand, improvement in independent factors might help our system to achieve better results as well. To mention one possibility, wikipedia.org grows larger and larger, in every language, editors might add the links we were missing. Acknowledgements This work is funded by a grant under the Science Foundation Ireland Research Frontiers Pro- gramme 2008. We are grateful to Eamonn Newman for assistance with the MultiMatch translation tool. References [1] Larson, Martha, Newman, Eamonn and Jones, Gareth J. F. Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In Borri, Francesca and Nardi, Alessandro and Peters, Carol (eds.) Working Notes of CLEF 2009 [2] The Lemur Toolkit. http://www.lemurproject.org/ [3] Jones, Gareth J. F., Fantino, Fabio, Newman, Eamonn and Zhang, Ying. Domain-Specific Query Translation for Multilingual Information Access Using Machine Translation Augmented With Dictionaries Mined From Wikipedia, In Proceedings of the 2nd International Work- shop on Cross Lingual Information Access - Addressing the Information Need of Multilingual Societies (CLIA-2008), Hyderabad, India, pp34-41, 2008. [4] Metzler, Don. Indri Retrieval Model Overview. http://ciir.cs.umass.edu/ metzler/indriretmodel.html [5] Oleander Stemming Library. http://sourceforge.net/projects/porterstemmers/ [6] Snowball. http://snowball.tartarus.org/