<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CACAO PROJECT AT THE LOGCLEF TRACK</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessio Bosca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Dini</string-name>
          <email>dini@celi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the participation of the CACAO prototype to the Log Analysis for Digital Societies (LADS) task of LogCLEF 2009 track. CACAO (Cross-language Access to Catalogues And On-line libraries) is an EU project devoted to enabling cross-language access to the contents of a federation of digital libraries with a set of software tools for harvesting, indexing and serching over such data. In our experiment we investigated the possibility to exploit the TEL logs data as a source for inferring new translations, thus enriching already existing translation dictionaries; the proposed approach is based on the assumption that users consulting a multilingual digital collection are likely to repeat the same query in di erent languages. We applyed our approach to the logs from TEL and the results obtained are very promising.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>7 Digital Libraries</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The Log Analysis for Digital Society (LADS) from LogCLEF track is a new task that focuses
on the log analysis as a means to infer new knowledge from user logs (i.e. users behaviours,
multilingual resources); in particular the task proposes to the participants to deal with logs from
The European Library (TEL).</p>
      <p>CACAO (Cross-language Access to Catalogues And On-line libraries) is an EU project devoted
to enabling cross-language access to the contents of a federation of digital libraries with a set of
software tools for harvesting, indexing and serching over such data.</p>
      <p>In our experiment we focused on the multilinguality aspect of log analysis and in particular we
investigated the possibility to exploit the TEL logs data as a source for inferring new translations
and thus enriching already existing translation resources for dictionary based cross language access
to digital libraries.</p>
      <p>The proposed methodology is based on the assumption that when users are aware of consulting
a multilingual digital collection, they are likely to repeat the same query several times, in several
languages. By adopting the proposed algorithm, it is possible to discover translationally equivalent
queries in logs produced by monitoring user queries.</p>
      <p>The basic idea beyond our approach (named TLike algorithm) is to detect the probability for
two queries to be one a translation of the other. In the simple case we expect that if all the words
in query QS have a translation in query QT and if QS and QT have the same number of terms,
then QS and QT are translation equivalent. Things are of course more complex than this, due to
the following facts:</p>
      <p>The presence of compound words make the constraints on cardinality of search terms
defeasible (e.g. the Italian carta di credito vs. the German KreditCarte).</p>
      <p>One or more words in QS could be absent from translation dictionaries.</p>
      <p>One or more words in QS could be present in the translation dictionaries, but contextually
correct translation might be missing.</p>
      <p>There might be items which do not need to be translated, notably Named Entities.</p>
      <p>This paper is organized as follows. We present the architecture of our system in 2, in 3 we
describe our experiments and the obtained results; we nally conclude in 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>CACAO Pro ject</title>
      <p>CACAO (Cross-language Access to Catalogues And On-line libraries) is an EU project funded
under the eContentplus program and proposes an innovative approach for accessing, understanding
and navigating multilingual textual content in digital libraries and OPACs, enabling European
users to better exploit the available European electronic content.</p>
      <p>By coupling sound Natural Language Processing techniques with available information retrieval
systems the project aims at the delivery of a non-intrusive infrastructure to be integrated with
current OPAC and digital libraries. The result of such integration will be the possibility for the user
to type in queries in his/her own language and retrieve volumes and documents in any available
language. CACAO aims at o ering cross-lingual and cross-border access to the content of classical
and digital libraries and enabling users to nd digital content irrespective of the language. In fact,
in a context of interlaced cross-border libraries, such as the ones proposed by META OPAC, the
absence of a cross-language perspective is likely to cause a substantial impasse: if a user wanted
to access a META OPAC including the National Libraries of France, Germany, Italy, Poland and
Hungary, s/he would have to type ve queries in ve di erent languages. Much of the advantage
of having a unique access point is thus lost.</p>
      <p>CACAO project proposes a system based on the assumptions that users look more and more
at library contents using free keyword queries (as those used with a web search engine) rather
than more traditional library-oriented access (e.g. via Subject Heading); therefore, the only way
to face the cross-language issue is by translating the query into all languages covered by the
library/collection (rather than, for instance, translating subject headings, as in the MACS approach,
https://macs.vub.ac.be/pub/). The system will then yield results in all desired languages.
2.1</p>
      <sec id="sec-2-1">
        <title>Architecture Overview</title>
        <p>The general architecture of the Cacao system could be summarized as the result of the interactions
of few functional subsystems, coordinated by a central manager and reacting to external stimuli
represented by end users queries:</p>
        <p>Harvesting subsystem is in charge of collecting data from digital libraries, abstracting from
the multiplicity of standards and protocols, and storing them into a repository.
Corpus Analysis subsystem performs speci c analysis on the data collected from libraries
and infers new information used to support query processing and resource retrieval (e.g.
query expansion, terms disambiguation,..).
Web Services subsystem represents third party software providing speci c services (e.g.
linguistic analysis, translations,..).</p>
        <p>Query Processing subsystem: a set of components is devoted to process the original
monolingual user query, transforming and enriching it by means of translations and expansions.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>The rst step of our experiment consisted in the creation of a Lucene Search index starting from
the TEL logs; information contained in the query eld of the logs has been ltered in order to
remove terms pertaining to the query syntax (restrictions on elds, boolean operators,...) enriched
by means of shallow NLP techniques as lemmatization and named entities recognition and of a
language guesser facility used to individuate the query source language.</p>
      <p>
        The second step involved the CACAO search engine in order to create a resource containing all
possible translation candidates. The CACAO project (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) aims at providing a European level
answer to cross language information access to digital libraries, by exploiting the by now mature
technology of cross language information retrieval.
      </p>
      <p>Each distinct query contained in the logs has been used as input for the CACAO system in
order to obtain a set of translation candidates for the query. In fact CACAO system translated
the query from the TEL logs into all the languages natively supported (english, french, german,
polish, hungarian and italian) and then exploited such translations in order to search for related
queries in other languages; the result of this step consisted in a textual le containing for each
distinct query a list of translation candidates proposed by the CACAO engine.</p>
      <p>The third step of the algorithm consisted in exploiting the T-Like procedure in order to evaluate
the probabilities associated to the di erent translations candidates extracted from the logs and
thus obtain a list of proposed translations as well as some statistics on the retrieved translations.</p>
      <p>The TLike algorithm is based on three main resources:</p>
      <p>A system for Natural Language Processing able to perform for each relevant language basic
tasks such as part of speech disambiguation, lemmatization and named entity recognition.
A set of word based bilingual translation modules.</p>
      <p>
        A semantic component able to associate a semantic vectorial representation to words.
The basic idea beyond the TLike algorithm is to detect the probability for two queries to be one
a translation of the other and a detailed description of the strategy adopted can be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
3.1
      </p>
      <sec id="sec-3-1">
        <title>Experiment results</title>
        <p>Source Query in Logs
the road to glory [en]
la vita di gesu narratasales [it]
die russische sprache der gegenwart [de]
democratie [fr]
digital image processing [en]
biblia krolowej zo i [pl]
architecture [en]
inondation [fr]
guerre mondiale [fr]
quali varieta di meli e di peri [it]
national library of norway [en]
portrait de dorian gray [fr]
la guerre et la paix [fr]
production de l espace [fr]
exposition universelle 1900 [fr]
storia della chiesa [it]</p>
        <p>rmen landwirtschaftliche maschinen [de]
lord of the rings [en]
dictionnaire biographique [fr]
deutsche mythologie [de]
ancient maps [en]
round the world in 80 [en]</p>
        <p>Candidate Translations from Logs
en route pour la gloire [fr]
essai sur la vie de jsus [fr]
russian language composition and exercises [en]
the future of democracy [en]
cours de traitement numrique de l image [fr]
simbolis in the bible [en]
trattato di architettura [it]
after the ood [en]
guerra mondiale [it]
biology of apple and pear [en]
biblioteka narodowa [pl]
the portrait of dorian gray [en]
war+and+peace [en]
the production of space [en]
esposizione universale di roma [it]
church history [en]
lagriculture et les machines agricoles [fr]
le seigneur des anneaux [fr]
dizionario biogra co [it]
the mythology of aryan nations [en]
carte antique [fr]
le tour du monde en 80 [fr]
This paper represents the rst step of a research on NLP based query log analysis. The preliminary
results are quite incouraging and in the future we plan to extend this research into two directions:
We will consider all the information contained in query logs, such as session identi ers,
temporal distance, repetition of the same query, semantic distance among similar queries,
etc.</p>
        <p>
          We will try to extend the semantic matching method to cover cases where the semantic
vectors are not present in the semantic repository. This will imply the use of the web and
web search engines as a dynamic corpus([
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]).
5
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This work has been supported and founded by CACAO EU project (ECP 2006 DILI 510035).</p>
      <p>Using cooccurrence statistics and the web to discover synonyms in</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Lucene</surname>
          </string-name>
          .
          <article-title>The Lucene search engine</article-title>
          . URL: http://jakarta.apache.org/lucene/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosca</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Dini</surname>
          </string-name>
          .
          <article-title>Query expansion via library classication systems</article-title>
          .
          <source>LNCS proceedings on CLEF@TEL</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosca</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Dini</surname>
          </string-name>
          .
          <article-title>The role of logs in improving cross language access in digital libaries</article-title>
          .
          <source>In Proceedings of the International Conference on Semantic Web and Digital Libraries</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bisi</surname>
          </string-name>
          , S.: technical language
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>