CACAO PROJECT AT THE LOGCLEF TRACK Alessio Bosca, Luca Dini Celi s.r.l. - 10131 Torino - C. Moncalieri, 21 alessio.bosca, dini@celi.it Abstract This paper presents the participation of the CACAO prototype to the Log Analysis for Digital Societies (LADS) task of LogCLEF 2009 track. CACAO (Cross-language Access to Catalogues And On-line libraries) is an EU project devoted to enabling cross-language access to the contents of a federation of digital libraries with a set of software tools for harvesting, indexing and serching over such data. In our experiment we investigated the possibility to exploit the TEL logs data as a source for inferring new translations, thus enriching already existing translation dictionaries; the proposed approach is based on the assumption that users consulting a multilingual digital col- lection are likely to repeat the same query in different languages. We applyed our approach to the logs from TEL and the results obtained are very promising. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.7 Digital Libraries General Terms Measurement, Performance, Experimentation Keywords Cross-Language Information Retrieval, Log Analysis, Translations Disambiguation, Digital Li- braries 1 Introduction The Log Analysis for Digital Society (LADS) from LogCLEF track is a new task that focuses on the log analysis as a means to infer new knowledge from user logs (i.e. users behaviours, multilingual resources); in particular the task proposes to the participants to deal with logs from The European Library (TEL). CACAO (Cross-language Access to Catalogues And On-line libraries) is an EU project devoted to enabling cross-language access to the contents of a federation of digital libraries with a set of software tools for harvesting, indexing and serching over such data. In our experiment we focused on the multilinguality aspect of log analysis and in particular we investigated the possibility to exploit the TEL logs data as a source for inferring new translations and thus enriching already existing translation resources for dictionary based cross language access to digital libraries. The proposed methodology is based on the assumption that when users are aware of consulting a multilingual digital collection, they are likely to repeat the same query several times, in several languages. By adopting the proposed algorithm, it is possible to discover translationally equivalent queries in logs produced by monitoring user queries. The basic idea beyond our approach (named TLike algorithm) is to detect the probability for two queries to be one a translation of the other. In the simple case we expect that if all the words in query QS have a translation in query QT and if QS and QT have the same number of terms, then QS and QT are translation equivalent. Things are of course more complex than this, due to the following facts: • The presence of compound words make the constraints on cardinality of search terms defea- sible (e.g. the Italian carta di credito vs. the German KreditCarte). • One or more words in QS could be absent from translation dictionaries. • One or more words in QS could be present in the translation dictionaries, but contextually correct translation might be missing. • There might be items which do not need to be translated, notably Named Entities. This paper is organized as follows. We present the architecture of our system in 2, in 3 we describe our experiments and the obtained results; we finally conclude in 4. 2 CACAO Project CACAO (Cross-language Access to Catalogues And On-line libraries) is an EU project funded under the eContentplus program and proposes an innovative approach for accessing, understanding and navigating multilingual textual content in digital libraries and OPACs, enabling European users to better exploit the available European electronic content. By coupling sound Natural Language Processing techniques with available information retrieval systems the project aims at the delivery of a non-intrusive infrastructure to be integrated with current OPAC and digital libraries. The result of such integration will be the possibility for the user to type in queries in his/her own language and retrieve volumes and documents in any available language. CACAO aims at offering cross-lingual and cross-border access to the content of classical and digital libraries and enabling users to find digital content irrespective of the language. In fact, in a context of interlaced cross-border libraries, such as the ones proposed by META OPAC, the absence of a cross-language perspective is likely to cause a substantial impasse: if a user wanted to access a META OPAC including the National Libraries of France, Germany, Italy, Poland and Hungary, s/he would have to type five queries in five different languages. Much of the advantage of having a unique access point is thus lost. CACAO project proposes a system based on the assumptions that users look more and more at library contents using free keyword queries (as those used with a web search engine) rather than more traditional library-oriented access (e.g. via Subject Heading); therefore, the only way to face the cross-language issue is by translating the query into all languages covered by the li- brary/collection (rather than, for instance, translating subject headings, as in the MACS approach, https://macs.vub.ac.be/pub/). The system will then yield results in all desired languages. 2.1 Architecture Overview The general architecture of the Cacao system could be summarized as the result of the interactions of few functional subsystems, coordinated by a central manager and reacting to external stimuli represented by end users queries: • Harvesting subsystem is in charge of collecting data from digital libraries, abstracting from the multiplicity of standards and protocols, and storing them into a repository. • Corpus Analysis subsystem performs specific analysis on the data collected from libraries and infers new information used to support query processing and resource retrieval (e.g. query expansion, terms disambiguation,..). Figure 1: CACAO System Architecture • Web Services subsystem represents third party software providing specific services (e.g. lin- guistic analysis, translations,..). • Query Processing subsystem: a set of components is devoted to process the original mono- lingual user query, transforming and enriching it by means of translations and expansions. 3 Experiments The first step of our experiment consisted in the creation of a Lucene Search index starting from the TEL logs; information contained in the query field of the logs has been filtered in order to remove terms pertaining to the query syntax (restrictions on fields, boolean operators,...) enriched by means of shallow NLP techniques as lemmatization and named entities recognition and of a language guesser facility used to individuate the query source language. The second step involved the CACAO search engine in order to create a resource containing all possible translation candidates. The CACAO project (see [2]) aims at providing a European level answer to cross language information access to digital libraries, by exploiting the by now mature technology of cross language information retrieval. Each distinct query contained in the logs has been used as input for the CACAO system in order to obtain a set of translation candidates for the query. In fact CACAO system translated the query from the TEL logs into all the languages natively supported (english, french, german, polish, hungarian and italian) and then exploited such translations in order to search for related queries in other languages; the result of this step consisted in a textual file containing for each distinct query a list of translation candidates proposed by the CACAO engine. The third step of the algorithm consisted in exploiting the T-Like procedure in order to evaluate the probabilities associated to the different translations candidates extracted from the logs and thus obtain a list of proposed translations as well as some statistics on the retrieved translations. The TLike algorithm is based on three main resources: • A system for Natural Language Processing able to perform for each relevant language basic tasks such as part of speech disambiguation, lemmatization and named entity recognition. • A set of word based bilingual translation modules. • A semantic component able to associate a semantic vectorial representation to words. The basic idea beyond the TLike algorithm is to detect the probability for two queries to be one a translation of the other and a detailed description of the strategy adopted can be found in [3]. 3.1 Experiment results Table 1 presents an excerpt of the translation pairs extracted from the TEL logs with our approach while table 2 shows some statistic measure on the retrieved translations. Source Query in Logs Candidate Translations from Logs the road to glory [en] en route pour la gloire [fr] la vita di gesu narratasales [it] essai sur la vie de jsus [fr] die russische sprache der gegenwart [de] russian language composition and exercises [en] democratie [fr] the future of democracy [en] digital image processing [en] cours de traitement numrique de l image [fr] biblia krolowej zofii [pl] simbolis in the bible [en] architecture [en] trattato di architettura [it] inondation [fr] after the flood [en] guerre mondiale [fr] guerra mondiale [it] quali varieta di meli e di peri [it] biology of apple and pear [en] national library of norway [en] biblioteka narodowa [pl] portrait de dorian gray [fr] the portrait of dorian gray [en] la guerre et la paix [fr] war+and+peace [en] production de l espace [fr] the production of space [en] exposition universelle 1900 [fr] esposizione universale di roma [it] storia della chiesa [it] church history [en] firmen landwirtschaftliche maschinen [de] lagriculture et les machines agricoles [fr] lord of the rings [en] le seigneur des anneaux [fr] dictionnaire biographique [fr] dizionario biografico [it] deutsche mythologie [de] the mythology of aryan nations [en] ancient maps [en] carte antique [fr] round the world in 80 [en] le tour du monde en 80 [fr] Table 1: Submitted Experiments true Positive translations 351.0 true Negative translations 0.0 false Positive translations 0.0 false Negative translations 80049.0 Precision 1.0 Recall 0.004365671641791045 Table 2: Evaluation Measures 4 Conclusions This paper represents the first step of a research on NLP based query log analysis. The preliminary results are quite incouraging and in the future we plan to extend this research into two directions: • We will consider all the information contained in query logs, such as session identifiers, temporal distance, repetition of the same query, semantic distance among similar queries, etc. • We will try to extend the semantic matching method to cover cases where the semantic vectors are not present in the semantic repository. This will imply the use of the web and web search engines as a dynamic corpus([4]). 5 Acknowledgements This work has been supported and founded by CACAO EU project (ECP 2006 DILI 510035). References [1] Lucene. The Lucene search engine. URL: http://jakarta.apache.org/lucene/. [2] A. Bosca and L. Dini. Query expansion via library classication systems. LNCS proceedings on CLEF@TEL, 2008. [3] A. Bosca and L. Dini. The role of logs in improving cross language access in digital libaries. In Proceedings of the International Conference on Semantic Web and Digital Libraries, 2009. [4] Baroni, M., Bisi, S.: Using cooccurrence statistics and the web to discover synonyms in technical language