<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dynamic Lexica for Query Translation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jussi Karlgren</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Magnus Sahlgren</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Timo Ja¨rvinen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rickard C¨oster SICS</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Multilingual information access applications, which are driven by modeling lexical correspondences between different human languages, are obviously reliant on lexical resources to a high degree the quality of the lexicon is the main bottleneck for quality of performance and coverage of service. While automatic text and speech translation have been the main multilingual tasks for most of the history of computational linguistics, today the recent awareness within the information access field of the multilingual reality of information sources has made the availability of lexica an all the more critical system component. Machine readable lexica in general, and machine readable multilingual lexica in particular, are difficult to come across. Manual approaches to lexicon construction vouch for high quality results, but are time- and labour-consuming to build, costly and complex to maintain, and inherently static as to their nature: tuning an existing lexicon to a new domain is a complex task that risks compromising existing information and corrupting usefulness for previous application areas. As a specific case, human-readable dictionaries, even if digitized and made available to automatic processing, are not vectored towards automatic processing. Dictionaries originally designed for human perusal leave much information unsaid, and belabor fine points that may not be of immediate use for the computational task at hand. Automatic lexicon acquisition techniques promise to provide fast, cheap and dynamic alternatives to manual approaches, but have yet to prove their viability. In addition to this, they typically require sizeable computational resources. This experiment utilises a simple and effective approach to using distributional statistics over parallellized bilingual corpora - text collections of material translated from one language to another - for automatic multilingual lexicon acquisition and query translation. The approach is efficient, fast and scalable, and is easily adapted to new domains and to new languages. We evaluate the proposed methodology by first extracting a bilingual lexicon from aligned Swedish-French data, translating CLEF topics from Swedish to French, and then retrieving documents using the resulting French queries and a mono-lingual retrieval system from the French section of the CLEF document database. The results clearly demonstrate the viability of the approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Our approach, the Random Indexing approach, by contrast with most other approaches to
distributionally based algorithms for bilingual lexicon acquisition, takes the context – an utterance,
a window of adjacency, or when necessary, an entire document – as the primary unit. Rather than
building a huge vector space of contexts by lexical item types, we build a vector space which is
large enough to accommodate the occurrence information of tens of thousands of lexical item types
in millions of contexts, yet compact enough to be tractable; constant in size in face of ever-growing
data sizes; and designed to model association between distributionally similar lexical items without
compilation or explicit dimensionality reduction. Our approach is described in detail in several
publications [
        <xref ref-type="bibr" rid="ref2 ref4 ref7 ref8">2, 4, 7, 8</xref>
        ]; this paper describes experiments made on this year’s CLEF data.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Experiment</title>
      <p>
        We use the document-aligned Europarl corpus [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] which consists of parallel texts from the
proceedings of the European Parliament, and is available in 11 European languages, freely from
http://www.isi.edu/˜koehn/europarl. From this multilingual corpus we extracted the
SwedishFrench section which we then lemmatized and normalized using the commercially available FDG
tools from Connexor. The resulting data consist of several tens of thousands of sentence-level
aligned document pairs. These data were used to extract a bilingual Swedish-French lexicon.
      </p>
      <p>The topic texts were lemmatized and normalized using the same morphological analysis tools
from Connexor as were used for the Swedish corpus.</p>
      <p>The queries were translated word-by-word from Swedish to French using the extracted lexicon.</p>
      <p>
        The text retrieval engine used for our experiments is a system being developed at SICS, The
system is described in more detail in our CLEF paper [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] from last year. The French target
collection was indexed by the system and the translated French queries were used to retrieve texts
from the French collection without manual intervention.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>p@100 12 best 26 on or above median 9 near but below median 8 worst p@1000 26 best 34 on
or above median 4 near but below median 4 worst ap 2 best 19 on or above median 10 near but
below median 2 worst</p>
      <p>The results were reasonably good with 34 of fifty queries on or above median, whereof 26
queries at top score for the “precision at 1000 documents” recall oriented score. For the other
established two scoring schemes (“average precision” and “precision at 100 documents”) the results
were slightly lower, but the majority of queries in each case on, above or near median submitted
scores. A more precision oriented evaluation scheme where average precision is calculated at 5
retrieved documents gives a satisfying score of 30 per cent.</p>
      <p>Closer result analysis is still in progress, but some of the failed queries can be observed as having
retrieved no documents at all. This is typically due to mistranslation or missing translation of
some crucial query term, most often a name.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>The work reported here is partially funded by the European Commission under contracts
IST2000-29452 (DUMAS) which is hereby gratefully acknowledged.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Brown</surname>
          </string-name>
          , S. Cocke,
          <string-name>
            <given-names>V.</given-names>
            <surname>Della Pietra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Della Pietra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jelinek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mercer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Roossin</surname>
          </string-name>
          .
          <article-title>A statistical approach to language translation</article-title>
          .
          <source>In Proceedings of the 12th Annual Conference on Computational Linguistics (COLING 88)</source>
          .
          <source>International Committee on Computational Linguistics</source>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kanerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kristofersson</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Holst</surname>
          </string-name>
          .
          <article-title>Random indexing of text samples for latent semantic analysis</article-title>
          .
          <source>In Proceedings of the 22nd Annual Conference of the Cognitive Science Society</source>
          , page 1036.
          <string-name>
            <surname>Erlbaum</surname>
          </string-name>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Hans</given-names>
            <surname>Karlgren</surname>
          </string-name>
          .
          <article-title>Term-tuning, a method for the computer-aided revision of multi-lingual texts</article-title>
          .
          <source>International Forum for Information and Documentation</source>
          ,
          <volume>13</volume>
          (
          <issue>2</issue>
          ):
          <fpage>7</fpage>
          -
          <lpage>13</lpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <article-title>From words to understanding</article-title>
          . In Y. Uesaka,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kanerva</surname>
          </string-name>
          , and H. Asoh, editors,
          <source>Foundations of Real-World Intelligence</source>
          , pages
          <fpage>294</fpage>
          -
          <lpage>308</lpage>
          . CSLI Publications,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn. Europarl</surname>
          </string-name>
          :
          <article-title>A multilingual corpus for evaluation of machine translation</article-title>
          .
          <source>net resource</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Melamed</surname>
          </string-name>
          .
          <article-title>Models of translational equivalence among words</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>26</volume>
          (
          <issue>2</issue>
          ):
          <fpage>221</fpage>
          -
          <lpage>249</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <article-title>Automatic bilingual lexicon acquisition using random indexing of aligned bilingual data</article-title>
          .
          <source>In Proceedings of the fourth international conference on Language Resources and Evaluation</source>
          ,
          <string-name>
            <surname>LREC</surname>
          </string-name>
          <year>2004</year>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          .
          <article-title>Automatic bilingual lexicon acquisition using random indexing of parallel corpora</article-title>
          .
          <source>Natural Language Engineering</source>
          , forthcoming.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Magnus</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          , Jussi Karlgren, Rickard C¨
          <article-title>oster, and Timo Ja¨rvinen. Automatic query expansion using random indexing</article-title>
          .
          <source>In Proceedings of CLEF</source>
          <year>2002</year>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>