<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cross-Lingual Information Retrieval System for Indian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jagadeesh Jagarlamudi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>INDIA</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper describes our ¯rst participation in the Indian language sub-task of the main Adhoc monolingual and bilingual track in CLEF1 competition. In this track, the task is to retrieve relevant documents from an English corpus in response to a query expressed in di®erent Indian languages including Hindi, Tamil, Telugu, Bengali and Marathi. Groups participating in this track are required to submit a English to English monolingual run and a Hindi to English bilingual run with optional runs in rest of the languages. We had submitted a monolingual English run and a Hindi to English cross-lingual run. We used a word alignment table that was learnt by a Statistical Machine Translation (SMT) system trained on aligned parallel sentences, to map a query in source language into an equivalent query in the language of the target document collection. The relevant documents are then retrieved using a Language Modeling based retrieval algorithm. On CLEF 2007 data set, our o±cial cross-lingual performance was 54.4% of the monolingual performance and in the post submission experiments we found that it can be signi¯cantly improved up to 73.4%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The rapidly changing demographics of the internet population [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the plethora of multilingual
content on the web [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] has attracted the attention of Information Retrieval(IR) community to
develop methodologies for cross-lingual information accessing. Since the past decade [
        <xref ref-type="bibr" rid="ref1 ref11 ref14 ref6">1, 6, 11, 14</xref>
        ]
researchers are looking at ways to retrieve documents in a language in response to a query in
another language. This fundamentally assumes that users can read and understand documents
written in foreign language but unable to express their information need in that language. There
are arguments against this assumption as well: For example, [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] argues that it is unlikely that
the information in another language will be useful unless users are already °uent in that language.
However, we argue that in speci¯c cases such methodologies could still be valid. For example,
in India students learn more than one language from their childhood and more than 30% of the
population can read and understand Hindi apart from their native language [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This situation
exhibits great utility for systems with the capability to retrieve relevant documents in languages
di®erent from the language in which information need is expressed.
      </p>
      <p>
        Lack of resources is still a major reason for relatively less number of e®orts in the cross-lingual
setting in Indian subcontinent. Research communities working in Indian Languages, especially on
Machine Translation (MT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], have built some necessary resources like morphological analyzer
and bilingual dictionaries for some languages. Even though these resources are built mainly for
MT, they can still be used as a good starting point to build a Cross-Lingual Information Retrieval
(CLIR) system, as we demonstrate in this paper. More speci¯cally, in this paper we will describe
our ¯rst attempt in building a CLIR system using the bilingual statistical dictionary that was
learnt automatically during the training phase of a SMT [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] system.
      </p>
      <p>The rest of the paper is organized as follows. We will ¯rst de¯ne the problem in section 2,
followed by the presentation of our approach in section 3. In sections 4.1 &amp; 4.2 we will describe
data set along with the resources used and present the performance of our system (section 4.3)
in the CLEF competition. Section 4 includes some analysis of the results. Section 5 presents our
conclusion and identi¯es our plans for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem Statement</title>
      <p>Cross Language Evaluation Forum (CLEF) aims at promoting research in the design of
multilingual, multi-modal retrieval systems by providing an opportunity for the research communities
working in di®erent languages to collaborate and share their experiences. Each year it organizes a
series of evaluation tracks to test di®erent aspects of cross-language information retrieval system
development.</p>
      <p>We have participated in the CLEF competition, speci¯cally in the Indian Language sub-task
of the main CLEF 2007 Ad-hoc monolingual and bilingual track. This track tests the performance
of systems in retrieving the relevant documents in response to a query in the same and di®erent
language from that of the document set. In the Indian language track, documents are provided in
English and queries are speci¯ed in di®erent languages including Hindi, Telugu, Bengali, Marathi
and Tamil. The system has to retrieve 1000 relevant documents as response to a query in any of
the above mentioned languages. All the systems participating in this track are required to submit
a English to English monolingual run and a Hindi to English bilingual run. Runs in rest of the
languages are optional.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>
        Converting the information expressed in di®erent languages to a common representation is
inherent to cross-lingual applications to build the language barrier. In CLIR, either the query or the
document or both need to be mapped into the common representation to retrieve the relevant
documents. Translating all documents into the query language is less desirable due to the enormous
resource requirements. Usually the query is translated into the language of the target collection
of documents. Typically three types of resources are exploited for translating the queries:
bilingual machine readable dictionaries, parallel texts and machine translation systems. MT systems
typically produce one candidate translation thus some potential information which could be of
use to IR system is lost. Researchers [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] have also explored considering more than one possible
translation to avoid the loss of useful information. Another di±culty in using the MT system
comes from the fact that most of the search queries are very short and lack necessary syntactic
information required for translation. Hence most approaches use bilingual dictionaries.
      </p>
      <p>
        In our work, we have used a statistically aligned Hindi to English word alignments that were
learnt during the training phase of machine translation. The query in Hindi language is translated
into English using word by word translation. For a given Hindi word, all English words which
have translation probability above certain threshold are selected as candidate translations. Only
top `n' of these candidates are selected as ¯nal translations to reduce ambiguity in the translation.
The aligned bilingual dictionary may not contain some of the query words because either the
word is not available in the parallel corpus or the translation probabilities are less than the
threshold. In such cases, we attempt to transliterate the query word into English. We have used a
noisy channel model based transliteration algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The phonemic alignments between Hindi
characters and corresponding English characters are learnt automatically from a training corpus
of parallel names in Hindi and English. These alignments along with their probabilities are used,
during viterbi decoding, to transliterate a new Hindi word into English. As reported, this system
will output the correct(fuzzy match) English word in top 10 results, with an accuracy of about
30%(80%). Target language vocabulary along with approximate string matching algorithms like
soundex and edit distance measure [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] were used to ¯lter out the correct word from the incorrect
ones among the possible candidate transliterations.
      </p>
      <p>
        Once the query is translated into the language of the document collection, standard IR
algorithms can be used to retrieve the relevant documents. We have used Langauge Modeling [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
in our experiments. In Language Modeling framework, both query formulation and retrieval of
relevant documents are treated as simple probability mechanisms. Essentially, each document
is assumed as a language sample and query as a sample from this document. The likelihood of
generating a query from the document (p(qjd)) is associated with the relevance of the document
to the query. A document which is more likely to generate the user query is considered to be more
relevant. Since a document considered as bag of words is very small when compared to whole
vocabulary, most of the times the resulting document models are very sparse. Hence smoothing
of the document distributions is very crucial. Many techniques have been explored and because
of its simplicity and e®ectiveness we chose the relative frequency of a term in the entire collection
to smooth the document distributions.
      </p>
      <p>
        In a nutshell, structural query translation [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is used to translate query into English. The
relevant documents are then retrieved using a Language Modeling based retrieval algorithm.
Following section describes our approach applied in the CLEF 2007 participation and some further
experiments to calibrate the quality of our system.
4
4.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>CLEF Data set for Adhoc track</title>
        <p>
          In both the Adhoc bilingual `X' to English track and Indian language sub track, the target
document collection consisted of 135,153 English news articles published in Los Angeles Times, from
the year 2002. During the indexing of this document collection, only text portion (embedded in
&lt;LD&gt; and &lt;TE&gt; tags) was considered. Note that the results reported in this paper does not
make use of other potentially useful information present in the document, such as, the document
heading (with in &lt;DH&gt; tag) and the photo caption (in &lt;CP&gt; tag), even though we believe that
including such an information would improve the performance of the system. The resulting 85,994
non-empty documents were then processed to remove the stop words and the remaining words
were reduced into their base form using Porter stemmer [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>50 topics originally created in English and translated later into other languages were distributed
among the participants. For processing Hindi queries, a list of stop words was formed based on
the frequency of word in the monolingual corpus obtained corresponding to the Hindi part of the
parallel data. This list was then used to remove any less informative words occurring in the topic
statements. The processed query was then translated into English using the word alignment table.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Word Alignment Table as Bilingual Dictionary</title>
        <p>
          We have used the word alignment table that was learnt by the SMT [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] system trained on 100K
Hindi to English parallel sentences to translate Hindi queries. Since these alignments were learnt
primarily for machine translation purpose, the alignments included words that occurs in their
in°ectional forms. For this reason we have not converted the query words into their base form
during the translation. Table 1 shows the statistics about the coverage of the alignment table
corresponding to di®erent levels of threshold on the translation probability (column 1), note that
a threshold value of 0 correspond to having no threshold at all. Columns 2 and 3 indicate the
coverage of the dictionary in terms of source and target language words. The last column denote
the average number of English translations for a Hindi word. It is very clear and intuitive that as
the threshold increases the coverage of the dictionary decreases. It is also worth noting that, as the
threshold increases, the average translations per source word decreases indicating that the target
language words which are related to the source word but not synonymous are getting ¯ltered.
        </p>
        <sec id="sec-4-2-1">
          <title>Threshold</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Hindi words</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>English words Translations per word 0 0.1</title>
          <p>Each of the participating systems was required to submit 1000 relevant documents for each topic.
For each query, a pool of candidate relevant documents is created by combining the documents
submitted by all systems. From the pool of such candidate documents assessors ¯lter out
actual relevant documents from non relevant ones. These relevance judgements are then used to
automatically evaluate the quality of cross-lingual retrieval of participating CLIR systems.</p>
          <p>In this section we discuss the results of our monolingual English run and Hindi to English
bilingual run. In our case we speci¯cally participated in only one Indian language - Hindi, though
the data was available in 5 Indian languages. For our o±cial submission, with the aim of reducing
noise in the translated query, we have used a relatively high threshold of 0.3 for the translation
probability. To avoid ambiguity, when there are many possible English translations for a given
Hindi word, we allowed only 2 best possible translations according to the statistical alignments
learnt by SMT. Table 2 shows the o±cial results of our submission. We have submitted di®erent
runs using title, description (td) and title, description and narration (tdn) as query. Narration
seems to be improving the cross-lingual retrieval performance, in terms of Mean Average Precision
(MAP), more than that in monolingual setting. In the rest of the experiments it is assumed
that narration is included as a part of the query unless explicitly mentioned. Figure 1 shows a
comparison of cross-lingual and mono-lingual submissions in terms of precision at di®erent levels
of interpolated recall.</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>Monolingual LM(td) LM(tdn) 0.3916 0.3964 0.456 0.454</title>
        </sec>
        <sec id="sec-4-2-5">
          <title>Crosslingual LM(td) LM(tdn) 0.1994 0.2156 0.216 0.294</title>
          <p>In a second set of experiments, we experimented with various levels of threshold and the
number of translations above the threshold and their e®ect on MAP score. The results obtained
by monolingual system and cross-lingual system with varying threshold are shown plotted in Figure
2. The x-axis represents the number of top words considered when many of the target language
words have translation probabilities above the chosen threshold. Note that y-axis represents only
a subset (0.15-0.45) of the entire possible range (0-1) in which a MAP score can lie. The right most
bar represents the monolingual performance of the system. The ¯gure shows that the performance
increases as the threshold decrease but again it drops if you consider all the possible translations
(dip when 10 and 15 translations were considered) perhaps due to the shift in query focus with the
inclusion of many less synonymous target language words. For the CLEF data set, we found that,
considering 4 most possible translations with out any threshold (left most bar) on the translation
probability gave us the best results (73.4% of monolingual IR performance).</p>
          <p>
            As the threshold decrease potentially two things can happen; words which were not translated
previously can get translated or new target language words whose translation probability was
below the threshold will now become the candidates of the translated query. Table 3 shows the
fraction of query words that were translated corresponding to each of these thresholds. Table 3
and Figure 2 show that as the threshold on the translation probability decrease, fraction of query
words getting translated increases, resulting in an overall increase in system performance. But the
performance increase between having no threshold and a threshold of 0.1 compared to the small
fraction of new words that got translated suggest that even noisy translations, even though they
are not true synonymous, might help CLIR. This perhaps due to the fact that for the purposes of
IR, having a list of associated words may be su±cient to identify the context of the query [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>This paper describes our ¯rst attempts in building a CLIR system with the help of a word alignment
table learned, from a parallel corpora, primarily for statistical machine translation. We present
our experience and results of our participation in the Indian language sub-task of the Adhoc
monolingual and bilingual track of CLEF 2007. In post submission experiments we found that, on
CLEF data set, a Hindi to English cross lingual information retrieval system using a simple word
by word translation of the query with the help of a word alignment table, was able to achieve »
73% of the performance of the monolingual system. Empirically we found that considering 4 most
probable word translations with no threshold on the translation probability gave the best results.</p>
      <p>Since the quality of the dictionary will a®ect the performance of the system, in future we would
like to explore the e®ect of size and quality of the parallel data on the word alignments, and
subsequently on the CLIR performance. We would also like to compare the use of a statistically learned
word alignments with respect to a hand crafted dictionary of similar size for CLIR application.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Lisa</given-names>
            <surname>Ballesteros</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Dictionary methods for cross-lingual information retrieval</article-title>
          .
          <source>In DEXA '96: Proceedings of the 7th International Conference on Database and Expert Systems Applications</source>
          , pages
          <volume>791</volume>
          {
          <fpage>801</fpage>
          , London, UK,
          <year>1996</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Akshar</given-names>
            <surname>Bharati</surname>
          </string-name>
          , Rajeev Sangal,
          <string-name>
            <surname>Dipti M Sharma</surname>
            ,
            <given-names>and Amba P Kulakarni.</given-names>
          </string-name>
          <article-title>Machine Translation activities in India: A survey</article-title>
          .
          <source>In Workshop on survey on Research and Development of Machine Translation in Asian Countries</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bhogal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Macfarlane</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Smith.</surname>
          </string-name>
          <article-title>A review of ontology based query expansion</article-title>
          .
          <source>Inf</source>
          . Process. Manage.,
          <volume>43</volume>
          (
          <issue>4</issue>
          ):
          <volume>866</volume>
          {
          <fpage>886</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Grey</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Burkhart</surname>
          </string-name>
          , Seymour E. Goodman, Arun Mehta, and Larry Press.
          <article-title>The internet in india: better times ahead? Commun</article-title>
          . ACM,
          <volume>41</volume>
          (
          <issue>11</issue>
          ):
          <volume>21</volume>
          {
          <fpage>26</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>[5] GlobalReach. http://www.global-reach.biz/globstats/evol.html.</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>David</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hull</surname>
            and
            <given-names>Gregory</given-names>
          </string-name>
          <string-name>
            <surname>Grefenstette</surname>
          </string-name>
          .
          <article-title>Querying across languages: a dictionary-based approach to multilingual information retrieval</article-title>
          .
          <source>In SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>49</volume>
          {
          <fpage>57</fpage>
          , New York, NY, USA,
          <year>1996</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Internet</surname>
          </string-name>
          . http://www.internetworldstats.com.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumaran</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tobias</given-names>
            <surname>Kellner</surname>
          </string-name>
          .
          <article-title>A generic framework for machine transliteration</article-title>
          .
          <source>In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>721</volume>
          {
          <fpage>722</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Kui</given-names>
            <surname>Lam</surname>
          </string-name>
          <string-name>
            <surname>Kwok</surname>
          </string-name>
          , Sora Choi, and
          <string-name>
            <given-names>Norbert</given-names>
            <surname>Dinstl</surname>
          </string-name>
          .
          <article-title>Rich results from poor resources: Ntcir4 monolingual and cross-lingual retrieval of korean texts using chinese and english</article-title>
          .
          <source>ACM Transactions on Asian Language Information Processing (TALIP)</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ):
          <volume>136</volume>
          {
          <fpage>162</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Vladimir</surname>
            <given-names>I. Levenshtein.</given-names>
          </string-name>
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          .
          <source>In English translation in Soviet Physics Doklady</source>
          , pages
          <volume>707</volume>
          {
          <fpage>710</fpage>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Paul</given-names>
            <surname>McNamee</surname>
          </string-name>
          and James May¯eld.
          <article-title>Comparing cross-language query expansion techniques by degrading translation resources</article-title>
          .
          <source>In SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>159</volume>
          {
          <fpage>166</fpage>
          , New York, NY, USA,
          <year>2002</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Isabelle</given-names>
            <surname>Moulinier</surname>
          </string-name>
          and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Schilder</surname>
          </string-name>
          .
          <article-title>What is the future of multi-lingual information access?</article-title>
          <source>In SIGIR 2006 Workshop on Multilingual Information Access</source>
          <year>2006</year>
          , Seattle, Washington, USA,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Franz</given-names>
            <surname>Josef</surname>
          </string-name>
          Och and
          <string-name>
            <given-names>Hermann</given-names>
            <surname>Ney</surname>
          </string-name>
          .
          <article-title>A systematic comparison of various statistical alignment models</article-title>
          .
          <source>Comput. Linguist.</source>
          ,
          <volume>29</volume>
          (
          <issue>1</issue>
          ):
          <volume>19</volume>
          {
          <fpage>51</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Ari</surname>
            <given-names>Pirkola</given-names>
          </string-name>
          , Turid Hedlund, Heikki Keskustalo, and Kalervo JÄarvelin.
          <article-title>Dictionary-based crosslanguage information retrieval: Problems, methods</article-title>
          , and research ¯ndings. Inf. Retr.,
          <volume>4</volume>
          (
          <issue>3</issue>
          - 4):
          <volume>209</volume>
          {
          <fpage>230</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Jay</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ponte</surname>
            and
            <given-names>W. Bruce</given-names>
          </string-name>
          <string-name>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>A Language Modeling Approach to Information Retrieval</article-title>
          .
          <source>In Research and Development in Information Retrieval</source>
          , pages
          <volume>275</volume>
          {
          <fpage>281</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>An algorithm for su±x stripping</article-title>
          .
          <source>pages 313{316</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>