<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dictionary based Amharic - English Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Atelach Alemu Argaw and Lars Asker,</string-name>
          <email>[atelach, asker]@dsv.su.se</email>
          <email>asker@dsv.su.se</email>
          <email>atelach@dsv.su.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rickard Cöster and Jussi Karlgren</string-name>
          <email>[rick, jussi]@sics.se</email>
          <email>jussi@sics.se</email>
          <email>rick@sics.se</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer and Systems Sciences, Stockholm University/KTH</institution>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Swedish Institute of Computer Science</institution>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present two approaches to the Amharic - English bilingual track in CLEF 2004. Both experiments use a dictionary based approach to translate the Amharic queries into English Bags-of-words, but while one approach removes non-content bearing words from the Amharic queries based on their IDF value, the other uses a list of English stop words to perform the same task. The resulting translated (English) terms are then submitted to a retrieval engine that supports the Boolean and vector-space models. In our experiments, the second approach (based on a list of English stop words) performs slightly better than the one based on IDF values for the Amharic terms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2 Method</title>
    </sec>
    <sec id="sec-3">
      <title>2.1 Translation and Transliteration</title>
      <sec id="sec-3-1">
        <title>The English topic sets were translated into Amharic by human translators. Amharic uses its own and unique</title>
        <p>alphabet (Fidel) and there exist a number of fonts for this, but to date there is no standard for the language. The</p>
      </sec>
      <sec id="sec-3-2">
        <title>Amharic topics were originally represented using a Unicode compliant Ethiopic font called Visual Geez. For ease of</title>
        <p>1
use and compatibility reasons we transliterated it into an ASCII representation using SERA .</p>
        <p>The title and description fields of the original 50 Amharic topics contained 781 terms (493 unique) distributed over
808 words (because a few Amharic terms consisted of more than one word). Out of these 493 unique terms 397 were
found in the original Amharic – English Machine Readable Dictionary. This dictionary consists of a little more than
14,600 entries. The remaining 96 terms were included in a manually constructed dictionary consisting of these terms
and their translation of the relevant sense. Almost all of the 96 terms in this dictionary were proper names.
1. Amharic topic set</p>
        <p>1a. Transliteration
2. Transliterated Amharic topic set</p>
        <p>2a. Semi automatic crude stemming (only prefixes and suffixes)
3. Stemmed Amharic topic set</p>
        <p>3a. IDF-based stop word removal
4. Reduced Amharic topic set</p>
        <p>4a. Dictionary lookup
5. Topic set (in English) including all possible translations</p>
        <p>5a. Manual disambiguation
6. English terms (bag of words)</p>
        <p>6a. Retrieval (Indexing, keyword search, ranking)
7. Retrieved Documents</p>
        <p>Fig 1. Flow chart for AmEnI
1. Amharic topic set</p>
        <p>1a. Transliteration
2. Transliterated Amharic topic set</p>
        <p>2a. Semi automatic crude stemming (only prefixes and suffixes)
3. Stemmed Amharic topic set</p>
        <p>3a. Dictionary lookup
4. Topic set (in English) including all possible translations</p>
        <p>4a. Manual disambiguation
5. Translated English terms and phrases</p>
        <p>5a. Stop word removal
6. English terms (bag of words)</p>
        <p>6a. Retrieval (Indexing, keyword search, ranking)
7. Retrieved Documents</p>
        <p>Fig 2. Flow chart for AmEnA</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2 Stemming</title>
      <sec id="sec-4-1">
        <title>Amharic is a Semitic language which is morphologically complex [2]. Words are inflected with prefixes, suffixes and infixes. Once the topic set was transliterated, a semi automatic crude stemming that stripped off the prefixes and suffixes from each word was performed. The MRD used in the experiments is one that consisted of an entry for words and their derivational variants. The infixed words were represented separately in the dictionary.</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>2.3 Dictionary Lookup and Disambiguation</title>
      <sec id="sec-5-1">
        <title>A machine readable dictionary consisting of about 14,600 words was used in the experiments to perform the lexical lookup in translating the Amharic queries to English. The dictionary consisted of entries for words and their derivational variants.</title>
      </sec>
      <sec id="sec-5-2">
        <title>The stemmed words in the Amharic query were automatically looked up for possible translations in the MRD. In</title>
        <p>cases where there was a match and there was only one sense of the word, then the corresponding English
word/phrase in the dictionary was taken as the possible translation. When there was more than one sense to the term,
then all possible translations were picked out and a manual disambiguation was performed. For most of the proper
names there was no entry in the MRD. Hence the terms were added manually.</p>
      </sec>
      <sec id="sec-5-3">
        <title>The Amharic query set contained 493 unique terms. Of these, 285 occurred in the dictionary with only one possible</title>
        <p>translation, 112 occurred in the dictionary with more than one sense (average number of senses for this group was</p>
      </sec>
      <sec id="sec-5-4">
        <title>2.55), and 96 terms (mostly proper names) did not occur at all. The 96 terms that did not occur in the MRD were manually added in a separate dictionary</title>
      </sec>
      <sec id="sec-5-5">
        <title>In the MRD some of the translations were phrasal, and when the phrases are taken, it introduced more words in the query. Some of the Amharic entries were also phrasal (22 total/14 unique), which in turn reduced the number of words in the query.</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>2.4 Stop Word Removal</title>
      <sec id="sec-6-1">
        <title>The main difference between the two approaches is in the way words that are likely to be less informative are</title>
        <p>identified and removed from the queries. For the first approach (AmEnI) the number of Amharic words was
reduced by removing those that have an Inverted Document Frequency (IDF) value below a threshold value of 3.00.</p>
      </sec>
      <sec id="sec-6-2">
        <title>The IDF values were calculated from an Amharic news corpus consisting of approximately 2 million words of text.</title>
      </sec>
      <sec id="sec-6-3">
        <title>With a threshold value of 3.00, 123 of the 493 unique Amharic words were removed (25%). The second approach</title>
        <p>(AmEnA) removed those words from the translated queries that occurred in a list of 517 English stop words. With
this approach, 118 unique terms were removed and the total number of remaining words in the resulting English
query set was 559 compared to 547 for the AmEnI approach. Thus the two approaches left approximately the same
number of words.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>2.5 Retrieval Engine</title>
      <sec id="sec-7-1">
        <title>The underlying retrieval engine is an experimental system developed at SICS2. The system supports the Boolean and</title>
        <p>
          the Vector Space model, as well as structured queries. It is designed to handle a large amount of documents and
queries, using effective algorithms for information retrieval as described in e.g.[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. More information on the retrieval
engine can be found in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>2 Swedish Institute of Computer Science</title>
        <p>
          For document scoring, we use Pivoted Unique Normalization [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The score for a document d given a query with m
query terms is defined as
(1 − slope) × pivot + slope × no _ of _ unique _ terms
where tfi,d is the term frequency of query term i in document d, and average_tfd is the average term frequency in
document d. The slope parameter was set to 0.3, and the pivot to the average number of unique terms in a document,
as suggested in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>3 Results</title>
      <sec id="sec-8-1">
        <title>We participated in the cross language Amharic to English run. Two runs were performed on the data set using two sets of queries. In the first run stop word removal using IDF weights was done before the translation of terms, in the second one, the stop word removal was done only after the terms were translated into English. The following is a table summarizing the results for the two runs.</title>
      </sec>
      <sec id="sec-8-2">
        <title>The results obtained in both runs is reported in Table 3. below. The number of relevant documents, the retrieved relevant documents, the non-interpolated average precision as well as the precision after R (=num_rel) documents retrieved (R-Precision) are summarized as follows for the runs. AmEnI</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4 Conclusions</title>
      <sec id="sec-9-1">
        <title>Avg Precision 0. 3615 0.4009</title>
      </sec>
      <sec id="sec-9-2">
        <title>R-Precision 0.3251 0.3663 Table 3. Results from both runs</title>
      </sec>
      <sec id="sec-9-3">
        <title>We have described our experiments at the CLEF 2004 Amharic-English cross language track. The approach we</title>
        <p>followed is a dictionary based one to translate the Amharic queries into English Bags-of-words. One of the
experiments reported removes non-content bearing words from the Amharic queries based on their IDF value, while
the other uses a list of English stop words to perform the same task. The resulting translated (English) terms are then
submitted to a retrieval engine that supports the Boolean and vector-space models.</p>
      </sec>
      <sec id="sec-9-4">
        <title>As can be seen from the results in the above section, the second approach (based on a list of English stop words) has</title>
        <p>an average precision of 0.4009 while the first approach (based on IDF values for the Amharic terms) reports 0.3615.</p>
      </sec>
      <sec id="sec-9-5">
        <title>This could be attained to the fact that although non content bearing words were removed from the Amharic queries in the first approach, a lot of stop words were introduced while performing the dictionary lookup, hence introducing noise. A combination of the two approaches may result in a better performance in terms of precision, while means of query expansion in order to increase the recall remains open for investigation.</title>
      </sec>
      <sec id="sec-9-6">
        <title>In future experiments we plan to investigate the possibility to automatize some of the tasks that have been done</title>
        <p>manually in these experiments (sense disambiguation, addition of proper names in the MRD) using various
techniques such as e.g. statistical co occurrence for disambiguation, cognate matching for proper names.</p>
      </sec>
      <sec id="sec-9-7">
        <title>Experimenting with different retrieval techniques, comparing the performance of the algorithms, and the effects of various levels of stemming (root, stem, word) etc are also issues that we plan to address.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Cöster</surname>
            ,
            <given-names>Rickard.</given-names>
          </string-name>
          <article-title>SICS text retrieval engine in CLEF02</article-title>
          .
          <source>Proceedings of CLEF</source>
          <year>2002</year>
          .
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Fissaha</surname>
          </string-name>
          , Sisay, and
          <string-name>
            <surname>Haller</surname>
          </string-name>
          , Johann.
          <article-title>Amharic verb lexicon in the context of Machine Translation</article-title>
          .
          <source>In Proceedings of TALN 2003 Workshop on Natural Language Processing of Minority Languages and Small Languages</source>
          ,
          <article-title>Batz-sur-</article-title>
          <string-name>
            <surname>Mer</surname>
          </string-name>
          , France, June,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>M..</given-names>
          </string-name>
          <article-title>Pivoted Document Length Normalization</article-title>
          .
          <source>In Proceedings of the 19th International Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>21</fpage>
          -
          <lpage>29</lpage>
          .
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>Ian H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moffat</surname>
            , Alistair and Bell, Timothy C.. Managing Gigabytes: Compressing and
            <given-names>Indexing</given-names>
          </string-name>
          <string-name>
            <surname>Documents</surname>
          </string-name>
          and Images Morgan Kaufmann Publishing.
          <source>2nd edition</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>