<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Database Research Group</string-name>
          <email>{a.aleahmad, e.kamalloo, a.zareh}@ece.ut.ac.ir, rahgozar@ut.ac.ir</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Department of Computer Science</string-name>
          <email>oroumchian@acm.org</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rahgozar</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Electrical and Computer Engineering</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Wollongong in Dubai</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2008</year>
      </pub-date>
      <abstract>
        <p>In this study we will discuss our cross language text retrieval (CLIR) experiments of Persian ad hoc track at CLEF 2008. Two teams from University of Tehran were involved in cross language text retrieval part of the track using two different CLIR approaches that are query translation and document translation. For query translation we used a method named Combinatorial Translation Probability (CTP) calculation for estimation of translation probabilities. In the document translation part we used the Shiraz machine translation system for translation of documents into English. Then we create a Hybrid CLIR system by score-based merging of the two retrieval system results. In addition, we investigated N-grams and a light stemmer in our monolingual experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Persian English cross language</kwd>
        <kwd>Farsi bilingual text retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The Persian language is categorized as a branch of Indo-European languages and is the official language of Iran,
Afghanistan and Tajikistan and is also spoken in some other countries in the Middle East. Morphological
analysis of the language is relatively hard because of its grammatical rules. For example the word “ﺮﺒﺧ” is an
Arabic word that is used in Persian. This word has two plural forms in Persian “رﺎﺒﺧا” and “ﺎهﺮﺒﺧ”, the first plural
form obeys Arabic grammatical rules and the second plural form is obtained by use of Persian rules.
After creation of 50 new bilingual topics and standardization of Hamshahri collection according to CLEF
standards, we could investigate CLIR on Persian. Persian@CLEF 2008 is our first attempt to evaluate cross
language information retrieval on the language. Our aim is to investigate two main approaches of cross language
text retrieval on Persian that are query translation and document translation.</p>
      <p>We used the Hamshahri collection [7] for evaluation of our retrieval methods. Documents of this collection are
actually news articles of Hamshahri newspaper from year 1996 to 2002. The collection contains 160,000+
documents from variety of subjects. The documents size varies from short news (under 1 KB) to rather long
articles (e.g. 140 KB) with the average of 1.8 KB. Also we used Apache Lucene [8] and Lemur toolkit [5] for
indexing and retrieval on the collection.</p>
      <p>The remaining parts of this paper are organized as follows: section 2 introduces our monolingual experiments,
section 3 discusses our query translation method and its results, section 4 contains document translation
experimental results and finally we will conclude our paper in section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experiments on Monolingual Persian Text Retrieval</title>
      <p>We had no efficient morphological analyzer for Persian, so in our monolingual experiments we tried to
investigate some alternative methods like n-grams. Also, we used a stop word list in monolingual part of our
experiments to improve retrieval results.</p>
      <p>In order to create the stop word list we manually inspected most frequent words of the collection and extracted
actual stop words. Then we added some other words from the Bijankhan Persian corpus [6] that were marked
with tags like proposition and conjunction. The final stop word list contains 796 items.</p>
      <p>In our monolingual experiments, we submitted top 100 retrieved documents of six monolingual runs that are
summarized in table 1 and their description is as follows:</p>
      <p>Run #1: Vector space retrieval model using a light stemmer</p>
      <p>Run #2: Term based vector space model retrieval




1
2
3
4
5
6</p>
      <p>Run #3: Using 3-grams with Language Modeling retrieval
Run #4: Using 4-grams with Language Modeling retrieval
Run #5: Using 5-grams with Language Modeling retrieval</p>
      <p>Run #6: Term-based Language Modeling retrieval
Run#
In all of these runs we used just title part of the 50 Persian topics that was made available at CLEF 2008. In the
first run, we used a light Persian stemmer that works like the Porter algorithm but it could not improve our
results because of the simple algorithm of the stemmer. As an example consider the word “ﻢﻠﻴﻓ” that was a term in
topic no 559. This word is a noun that means ‘film’ in English but our light stemmer considers the final ‘م’ letter
of the word as a suffix and converts it to ‘ﻞﻴﻓ’ that means ‘elephant’ in English.</p>
      <p>Also, it worth mentioning that we do not cross word boundaries for building N-grams. For example 4-gram of
the word “نوﺪﻠﺒﻤیو” is “نوﺪﻝ +وﺪﻠﺏ +ﺪﻠﺒﻡ +ﻞﺒﻤی +ﺐﻤیو ” by use of our method.</p>
    </sec>
    <sec id="sec-3">
      <title>3. CLIR by Query Translation</title>
      <p>This section illustrates our query translation experiments at Persian ad hoc track of CLEF 2008. As the users
query is expressed in English and the collection’s documents are written in Persian, we used an English-Persian
dictionary with 50,000+ entries for translation of the query terms. In addition, we inserted some proper nouns
into the dictionary. The query translation process is accomplished as follows.</p>
      <p>Let M be the number of query terms, then we define users query as:</p>
      <p>Q = {qi } (i = 1,..., M )
Then we looked each qi up in the dictionary and after finding translations of qi we split the translations into its
constituent tokens. Then we eliminate those tokens that are included in our Persian stop word list.
If we define T as the translation function that returns Persian translations set of a given English term qi as
described above, then we have |T(q1)|×|T(q2)|× . . .×|T(qM)| different possible translations for the query Q and as
one can expect |T(qi)|&gt;1 for most of query terms. So, we need a retrieval model which enables us to take
translation probabilities into consideration. This model is briefly introduced in section 3.1 and in section 3.2 we
propose our method for translation probability calculation. Then our query translation CLIR experimental results
are presented in section 3.3.</p>
      <sec id="sec-3-1">
        <title>3.1. Probabilistic Structured Query Method</title>
        <p>
          Information retrieval systems rely on two basic statistics: the number of occurrences of a term in a document
(Term Frequency or TF) and the number of documents in which a term appears (Document Frequency or DF). In
case of bilingual text retrieval, when no translation probabilities are known, Pirkola’s “structured queries” have
been repeatedly shown to be among the most effective known approaches when several plausible translations are
known for some query terms [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          The basic idea behind Pirkola’s method is to treat multiple translation alternatives as if they were all instances of
the query term. Darwish and Oard later extended the model to handle the case in which translation probabilities
are available by weighting the TF and DF computations, an approach they called probabilistic structured queries
(PSQ) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. They found that Pirkola’s structured queries yielded declining retrieval effectiveness with increasing
numbers of translation alternatives, but that the incorporation of translation probabilities in PSQ tended to
mitigate that effect. In our bilingual text retrieval experiments we use the PSQ method [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] in which TF and DF
are calculated as follows:
        </p>
        <p>TF (e, Dk ) = ∑ p( fi | e) ×TF ( fi , Dk )</p>
        <p>fi
DF(e) = ∑ p( fi | e) × DF( fi )
fi
(1)
Where p(fi|e) is the estimated probability that e would be properly translated to fi. Our method for calculation of
the translation probability is presented in the next section.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Combinatorial Translation Probability</title>
        <p>Translation probability is generally estimated from parallel corpus statistics. But as no parallel corpus is
available for Persian, in this section we introduce a method which estimates English to Persian translation
probabilities by use of the Persian collection itself. As most user queries contain more than two terms (e.g. in the
Hamshahri collection all queries has two or more terms), the main idea is to use co-occurrence probability of
terms in the collection for translation probability calculation of adjacent query terms.</p>
        <p>Consider M as the number of user’s query terms then we define the users query as Q = {qi} (i=1,…,M). For
translation of Q, we look up Q members in an English to Persian dictionary to find their Persian equivalents.
Considering T as the translation function, then we define set of translations of Q members as:</p>
        <p>E = {T (q1 ),T (q2 ),...,T (qM )}
Then the probability that two adjacent query terms qi and qi+1 are translated into E[i,x] and E[i+1,y]
respectively, is calculated from the following equation:</p>
        <p>P(qi → E[i, x] ∧ qi+1 → E[i + 1, y]) =</p>
        <p>| Dqi I Dqi+1 |
c + Min(| Dqi |,| Dqi+1 |)
(2)
( x = 1.. | T (qi ) |, y = 1.. | T (qi+1 ) |)
Where Dqi is a subset of collection’s documents that contains the term qi and the constant c is a small value to
prevent the denominator to become zero. In the next step we create translation probability matrix Wk for each
pair of adjacent query terms:</p>
        <p>Wk = {wm,n } (m = 1.. | T (qk ) |, n = 1.. | T (qk +1 ) |)
Where wm,n is calculated using equation (2). Then Combinatorial Translation Probability (CTP) is a
|T(q1)|×|T(qM)| matrix that is calculated by multiplication of all of the Wk matrices:</p>
        <p>CPT (Q) = W1 × K× W (k = 1KM − 1)
k
In other words, CTP matrix contains probability of translation of Q members into their different possible
translations in Persian. Given the CTP(Q) matrix, the algorithm in table 2 returns the TDimes matrix which
contains dimensions of E = {T (q1 ),T (q2 ),...,T (qM )}matrix that correspond to top n most probable translations of
the query Q = {qi} (i=1,…,M).</p>
        <p>1.
2.
3.
else</p>
        <p>Let TDimes [i,j] = the culomn number of the largest element of Rth row
of Wi-1</p>
        <p>
          Output the TDimes matrix
Having TDimes matrix, we are able to extract different translation of the users query from
E = {T (q1 ),T (q2 ),...,T (qM )} and their weight from CTP. For example if we consider an English query that has
three terms then the most probable Persian translation of the query terms would be E[1,TDimes [
          <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
          ]],
E[2,TDimes [
          <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
          ]] and E[3,TDimes [
          <xref ref-type="bibr" rid="ref1">1,3</xref>
          ]] respectively and the translated query’s weight would be
CTP[TopColumns[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ],TopRows[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Query Translation Experimental Results</title>
        <p>We translated the queries through term lookup in an English-Persian dictionary as described before and using
methods of section 3.1 and 3.2. All of our query translation experiments were run using title of the English
version of the 50 topics except run #8 in which we used title + description of the topics. In this part of our
experiments we had eight runs that are summarized in table 3 and their description is as follows:







</p>
        <p>Run #1: In this run we concatenate all meanings of each of the query terms to formulate a Persian
query.</p>
        <p>Run #2: The same as previous run but uses top 5 Persian meanings of each of the query words for
query translation.</p>
        <p>Run #3: The same as previous run but uses the first Persian meaning of each of the query words for
query translation.</p>
        <p>Run #4: Uses all Persian meanings of query terms for query translation for calculating CTP. Then
we used the PSQ method with top 10 most probable Persian translations of the query.</p>
        <p>Run #5: In this run we first look up top 5 meanings of query terms in the dictionary and then we
convert them into 4-grams for calculating CTP. Then we use PSQ method with top 10 most
probable Persian translations of the query to run 4-gram based retrieval.</p>
        <p>Run #6: The same as previous run but we use 5-grams instead of 4-grams.</p>
        <p>Run #7: This run is the same as run #3 but in this run we use the Lucene vector space retrieval
model.</p>
        <p>Run #8: This run is the same as run #7 but in this run we use title + description. We eliminate
common words such as ‘find’, ‘information’, from the topics description.</p>
        <p>We used the Lemur toolkit [5] for implementation of our algorithm for run #1 to run #5. The default retrieval
model of the lemur’s retrieval engine (Indri) is language modeling. The Indri retrieval engine supports structured
queries and we could easily implement the PSQ method using CPT for translation probability estimation. Also,
run #7 and run #8 are implemented by use of the Lucene retrieval engine.</p>
        <p>Run#
1
2
3
4
5
6
7
8</p>
        <p>Run Name
UTNLPDB1BA
UTNLPDB1BT5
UTNLPDB1BT1
UTNLPDB1BA10
UTNLPDB1BT4G
UTNLPDB1BT5G
CLQTR
CLQTDR
Also Figure 1 depicts the precision-recall graph of the eight runs for top 100 retrieved documents that are
calculated by use of the Trec_Eval tool. According to the ‘comparison of median average precision’ figure that
was released at Persian@CLEF 2008, this method could over perform monolingual retrieval results for some
topics like topic no 570. This is because of the implicit query expansion effect of this method. The topic’s title is
‘Iran dam construction’ and after its translation into Persian, the CTP method adds the word ‘بﺁ’ to the query
that means water in English.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. CLIR by Document Translation</title>
      <p>In order to translate the Hamshahri collection’s documents from Persian into English, we used the Shiraz
machine translation system that is prepared at the New Mexico State University [3]. The Shiraz machine
translation system is an open source project that is written with the C language [4]. This system uses a bilingual
Persian to English dictionary consisting of approximately 50,000 terms, a complete morphological analyzer and
a syntactic parser. The machine translation system is mainly targeted at translating news material.
Document translation is not a popular approach because this approach of CLIR is not computationally efficient.
This fact was also apparent in our experiments. We ran the Shiraz machine translation on a PC with 2G of RAM
and an Intel 3.2G CPU and it took more than 12 days to translate nearly 80 percent of the collection. Finally we
could translate 134165 out of 166774 documents of the collection and we skipped translation of long documents
to save time. In our document translation experiments we had one run, named CLDTDR, by use of document
translation that is described below:
</p>
      <p>Run #9: In this run we use the English version of the 50 topics of Persian@CLEF 2008. Then we
retrieved translated documents of the collection using the Lucene vector space retrieval engine.</p>
      <p>This run utilizes title + description part of the topics.</p>
      <p>Furthermore, we tried a hybrid CLIR method by score-based merging of the results of query translation and
document translation methods. For this purpose we used merge results of the CLDTDR and UTNLPDB1BT4G
runs. The two runs used different retrieval engines and hence their retrieval scores were not in the same scale. To
address this problem we used the following equation to bring the scores of the two retrieval lists into the same
scale:</p>
      <p>Scorei =</p>
      <p>xi − Min(Li,q )</p>
      <p>
        Max(Li,q ) − Min(Li,q )
In which xi and Scorei are the old and the normalized scores, Min(Li,q) and Max(Li,q) are the minimum and
maximum scores in the ith retrieved list for the query q (i=1,2 for the two runs). This normalization normalizes
the scores into the range [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. Then for obtaining the merged results we chose top 100 documents with highest
weight from the two lists.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Future works</title>
      <p>In Persian ad hoc track of ninth CLEF campaign in addition to some monolingual retrieval systems, we
evaluated a number of cross language information retrieval systems. In monolingual part of our experiments we
evaluated N-grams and a light stemmer on the Persian language and in cross language part we evaluated query
translation and document translation approaches of English-Persian cross language information retrieval. We
used combinatorial translation probability method for query translation that uses statistics of the target language
for estimating translation probabilities. Result of our hybrid cross language information retrieval experiments
also suggests usefulness of combining document translation and query translation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>We would like to thank CLEF 2008 organizers for their supports in development of the Hamshahri collection.
[3] Amtrup, Jan W., Hamid Mansouri Rad, Karine Megerdoomian and Rémi Zajac (2000). Persian-English
Machine Translation: An Overview of the Shiraz Project. NMSU, CRL, Memoranda in Computer and
Cognitive Science (MCCS-00-319).
[4] Shiraz Project, http://crl.nmsu.edu/Research/Projects/shiraz
[5] Lemur Toolkit, http://www.lemurproject.org/
[6] Bijankhan Corpus, http://ece.ut.ac.ir/dbrg/bijankhan/
[7] Hamshahri Collection, http://ece.ut.ac.ir/dbrg/hamshahri
[8] Apache Lucene project, http://lucene.apache.org/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ari</given-names>
            <surname>Pirkola</surname>
          </string-name>
          .
          <article-title>The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval</article-title>
          .
          <source>In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>55</fpage>
          -
          <lpage>63</lpage>
          . ACM Press,
          <year>August 1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Kareem</given-names>
            <surname>Darwish</surname>
          </string-name>
          and
          <string-name>
            <given-names>Douglas W.</given-names>
            <surname>Oard</surname>
          </string-name>
          .
          <article-title>Probabilistic structured query methods</article-title>
          .
          <source>In Proceedings of the 21st Annual 26th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>338</fpage>
          -
          <lpage>344</lpage>
          . ACM Press,
          <year>July 2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>