<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Language Identification Strategies for Cross Language Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessio Bosca</string-name>
          <email>alessio.bosca@celi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Dini</string-name>
          <email>dini@celi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Celi s.r.l.</institution>
          ,
          <addr-line>Via S.Quintino 31 10131 Torino</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Same Language 46 8</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In our participation to the 2010 LogCLEF track we focused on the analysis of the European Library (TEL) logs and in particular we experimented with the identification of the natural language used in the queries. Language identification is in fact a key task within Cross Language Information Retrieval systems and the challenge is particularly difficult in the case of search queries where the contextual information available is scarce; function words (grammar particles highly connotative of a specific language like prepositions, pronouns, conjunctions, etc) are usually missing and the relevant presence of Named Entities can be misleading for the correct identification of the language used in the query. In order to face this challenge with acceptable performances the techniques applied should be different form the ones adopted for language guessing with more extensive and syntactically richer text fragments, like metadata or textual documents. In particular we experimented combining together different strategies: corpus based, character model based and a priori hypothesis. Since no official evaluation of the task is available we manually evaluated a sample of 100 queries and the results obtained are quite promising.</p>
      </abstract>
      <kwd-group>
        <kwd>Cross-Language Information Retrieval</kwd>
        <kwd>Language Identification</kwd>
        <kwd>Log Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The LogCLEF track proposes to investigate on the log analysis as a means to infer
new knowledge and in particular the task proposes to the participants to deal with logs
from The European Library (TEL). TEL provides access to national libraries of
several European countries, therefore users and contents come from many languages
and the logs provided in this task constitute a valuable opportunity and test-bed for
evaluating Language Identifier strategies, specifically tailored to search queries.
In the last decade the demand for IT systems capable of integrating and correlating
documents expressed in different languages generated a huge effort in the research
community in order to support multilingual resources and Cross-Language
Information Retrieval (CLIR) systems and different EU founded projects focused on
this challenge, like Europeana[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], CACAO[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or MICHAEL[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>A key resource in order to support multilingual resources in IT systems is the
capability of associating textual contents to the language in which they are expressed,
whenever this information is not explicitly included within the meta-data associated
to the resource itself. The same issue emerges in CLIR system supporting search
queries translation as a mean to leverage multilinguality and provide access to all the
documents satisfying user informational needs regardless of their language; the
approach of querying in one language and retrieving documents in all the available
languages is particularly significant whenever the contents of the exposed resources
are not textual (images, audio, etc) and the constraint of being expressed in a specific
language only concerns the meta-data.</p>
      <p>
        Language Identification techniques traditionally (see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) include models
based on the statistical distribution of character sequences or the presence in the text
of function words (grammar particles, highly connotative of a given language like
conjunctions, pronouns, modifiers, etc) or comparing the frequency of terms in given
language specific corpora. These different strategies present different needs with
respect to available resources, computational power and processing time and yield
different performances in different application context; therefore the most efficient
approach in dealing with language identification would be selecting the technique
with lower requirements for the given task.
      </p>
      <p>Language Identification for search queries constitutes the most difficult task for
language guesser components; in fact the contextual information available is scarce,
function words (grammar particles, highly connotative of a specific language like
prepositions, pronouns, conjunctions, etc) are usually missing and the relevant
presence of Named Entities can be misleading for the correct identification of the
language used in the query.</p>
      <p>In our experiment we investigated the weighted combination of different strategies:
corpus based, character model based and a priori hypotheses and applied these
techniques to the user queries from TEL logs; since no official evaluation of the task
is available we manually evaluated a sample of 100 queries and the results obtained
are very promising.</p>
      <p>This paper is organized as follows: we describe our experiments in Section 2 and
present conclusions in Section 3.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Experiments Description</title>
      <p>The first step in our investigations consisted in the extraction of all the distinct user
queries from the TEL logs along with their ID and the associated UI language; this
process resulted in about 450.000 user queries.</p>
      <p>We then applied different language identification strategies to these list of search
queries in order to evaluate their performances when applied singularly or combined
together. Each software module implementing a specific language strategy returned as
output a list of languages associated to a guess confidence value in the range of [0..1].
In particular we experimented the following strategies:
•
•
•
•
•</p>
      <p>Pure Corpus Based: languages are guessed comparing the frequencies of
terms in the search queries within language specific corpora. The guess
confidence value consists in the normalized sum of term frequencies.
Pure Character Model Based: languages are evaluated comparing language
model trained using textual contents from language specific corpora. The
guess confidence represents the distance of the input text from a specific
language model.</p>
      <p>The Mixed Approach combines together the two previous strategies with an
even weight (0.5 Corpus Based, 0.5 Character Model Based)
The Mixed Approach with a Priori Hypothesis introduces in the previous
strategy a default guess, here represented by the UI language. In different
application scenarios it could be the default language of the collection, the
language retrieved from the user profile, etc. The weighting scheme used for
this combined strategy is 0.4 Corpus Based, 0.4 Character Based, 0.2 A
priori Hypothesis
The Mixed Approach without NE investigates the effect on Language
Identification performances of removing NE (when the search queries is not
purely constituted of NE). Since a real NE recognizer module was
unavailable for our experiments we emulated its presence exploiting the
specific query syntax of TEL and removed the query terms pertaining to
creator by means of the search field prefix “CREATOR ALL”.</p>
      <p>Since no official evaluation of the task is available we manually evaluated a sample of
100 queries and the results obtained are presented in Table 1. All the search queries
containing only Named Entities have been considered expressed in the language of
origin of the referenced Named Entity (i.e. Hemingway ← 'en').</p>
      <p>From the experimental evidence emerges that the most significant contribution to
language identification came from the Corpus Based strategy although the
contribution from the Character based approach can increase the overall
performances.
Table 2 instead presents the statistical correlation of the language used in the search
query and the language used in the User Interface; from the experimental evidence
emerges the fact that the information on the language of the UI (here used as a priori
hypothesis) is not more significant with languages different from the default one (here
'en'), therefore is not a relevant information to be used in order to increase
performance of corpus based and character model based language guessing strategies.</p>
      <p>UI Language
English
not English</p>
    </sec>
    <sec id="sec-3">
      <title>3 Conclusion</title>
      <p>The preliminary results are quite encouraging and in the future we plan to extend this
research in order to include a full fledged Named Entity recognizer module.
has
been
supported
and
founded
Acknowledgments. This work
EuropeanaConnect EU project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Europeana project http://version1.europeana.eu/web/europeana-project/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. CACAO project: http://www.cacaoproject.eu/</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>3. MICHAEL project: http://www.michael-culture.eu/</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Cavnar</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Trenkle</surname>
          </string-name>
          , “
          <article-title>N-Gram-Based Text Categorization”</article-title>
          ,
          <source>In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>B.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cha</surname>
          </string-name>
          ,
          <article-title>"Language Identification from Text Using N-gram Based Cumulative Frequency Addition"</article-title>
          ,
          <source>Proceedings of CSIS</source>
          <year>2004</year>
          , Pace University, May 7th,
          <year>2004</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>H.</given-names>
            <surname>Ceylan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>"Language Identification of Search Engine Queries"</article-title>
          ,
          <source>Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP</source>
          , pages
          <fpage>1066</fpage>
          -
          <lpage>1074</lpage>
          , Suntec, Singapore,
          <fpage>2</fpage>
          -
          <lpage>7</lpage>
          August 2009
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>