=Paper= {{Paper |id=Vol-1168/CLEF2002wn-adhoc-MartinezSantiagoEt2002 |storemode=property |title=SINAI on CLEF 2002: Experimenting with Merging Strategies |pdfUrl=https://ceur-ws.org/Vol-1168/CLEF2002wn-adhoc-MartinezSantiagoEt2002.pdf |volume=Vol-1168 |dblpUrl=https://dblp.org/rec/conf/clef/SantiagoML02 }} ==SINAI on CLEF 2002: Experimenting with Merging Strategies== https://ceur-ws.org/Vol-1168/CLEF2002wn-adhoc-MartinezSantiagoEt2002.pdf
SINAI∗on CLEF 2002: Experiments with merging
                 strategies
                      Fernando Martı́nez-Santiago, Maite Martı́n, Alfonso Ureña
                   Department of Computer Science, University of Jaén, Jaén, Spain
                                  {dofer,maite,laurena}@ujaen.es



                                                     Abstract
          For our first participation in CLEF multilingual task, We present a new approach to
       obtain a single list of relevant documents for CLIR systems based on query translation.
       This new approach, which we call two-step RSV, is based on the re-indexing of the
       retrieval documents according to the query vocabulary, and it performs noticeably
       better than traditional methods.


1      Introduction
A usual approach in CLIR is to translate the query to each language present in the corpus, and
then run a monolingual query in each language. It is then necessary to obtain a single ranking
of documents merging the individual lists from the separate retrieved documents. However, a
problem is how to carry out such a merge?. This is known as merging strategies problem and it is
not an unimportant problem, since the weight assigned to each document (Retrieval Status Value
- RSV) is calculated not only according to the relevance of the document and the IR model used,
but also the rest of monolingual corpus to which the document belongs is determinant [2].
    There are various approaches to standardise the RSV, but even so a large decrease of precision
is generated in the process (depending on the collection, between 20% and 40%) [15, 13]. Perhaps
for this reason, CLIR systems based on document translation tend to obtain results which are
noticeably better than which only translate the query.
    The rest of the paper is organized as follows. Firstly, we present a brief revision of the most
extended methods for merging strategies. Section 3 and 4 describe our proposed method. In
section 5, we detail the experiments carried out with the results obtained. Finally, we present our
conclusions and future lines of work.


2      A brief revision of the merging strategies
For each N language, we have N different lists of relevant documents each obtained independently
from the others. The problem is that it is necessary to obtain a single list by merging all the relevant
languages. If we suppose that each retrieved document of each list has the same probability to be
relevant and the similarity values are therefore directly comparable, then an immediate approach
would be simply to order the documents according to their RSV (this method is known as raw
scoring) [5, 8]. However, this method is not adequate, since the document scores computed by
each language are not comparable. For example, a document in Spanish that includes the term
“información”, can calculate a radically different RSV from another document in English with
the same term, “information”. In general, this is due to the fact that the different indexation
techniques take into account not only the term frequency in the document (tf ),but also consider
    ∗ Sistemas Inteligentes de Acceso a la Información, Intelligent Information Access Systems
how frequent such a term is in the rest of the documents, that is, the inverse document frequency
(idf ) [12]. Thus, the idf depends on each particular monolingual collection. A first attempt to
make these values comparable is to standardise in some way the RSV reached by each document:
    • By dividing each RSV by the maximum RSV reached in each collection:
                                               RSVi
                                   RSVi0 =             , 1 <= i <= N
                                             max(RSV )

    • A variant of the previous method is to divide each RSV by the difference between the
      maximum and minimum document score values reached in each collection [10]:

                                        RSVi − min(RSV )
                            RSVi0 =                         , 1 <= i <= N
                                      max(RSV ) − min(RSV )

    in which RSVi is the original retrieval status value, and max(RSV ) and min(RSV ) are the ma-
ximum and minimum document score values achieved by the first and last documents respectively.
N is the number of documents in the collection.
    However, the problem only is solved partially, since the normalization of the document score
is accomplished independently of the rest of the collections, and therefore, the differences in the
RSV are still great.
    Another approach is to apply a round-robin algorithm. In this case, the RSV obtained for each
retrieved document is not taken into account, but rather the relative position reached by each
document in their collection. A single list of documents is obtained and the document score m is
in the position m in the list. Thus for example, if we have five languages and we retrieve five lists
of documents, the first five documents of the single result list will coincide with the first document
of each list; the next five, with the second document of each list; and so on. This approach is not
completely satisfactory because the position reached by each document is calculated exclusively
considering the documents of the monolingual collection to the one which belongs.
    Finally, another approach, perhaps the most original, it is to generate a single index with all
the documents without taking into account the multilingual nature of the collection [1, 3, 7]. In
this way, a single index is obtained in which the terms from each language are intermixed. In
the same way in that all the documents in a single index are merged, we obtain a single query
where the terms in several languages also are intermixed. That is, the query must be translated
to each of the languages present in the multilingual collection. However, we do not generate a
query for each translation, but we merge all the translations forming a single query. This query
will then be the one which we contrast with the document collection. As with the approach based
on document translation, in this approach the system will always return a single list of documents
for each query. In spite of this, the problem is not eliminated: the ranking of each document
is dependent on the language in which it is written. Although a single index is generated, the
vocabulary of each language is practically exclusive. Two different languages rarely share terms.
For this reason, the weight obtained by each term will refer to the language to which it belongs,
and therefore, the similarity between documents will be correct with respect to the documents
expressed in the same language. Finally, it is necessary to mention that a notable exception are
proper names, which are frequently invariable in different languages. In such a case, this approach
proves very effective.


3    A useful structure to describe IR models
In this section we present a notation that will be used to describe the proposed model. A large
number of retrieval methods are based on this structure [14]:
                                         < T, Φ, D; f f, df >
    where:
    • D es is the document collection to be indexed.
    • Φ is the vocabulary used in the indices generated from D.
    • T is the set of all tokens τ present in the collection D, commonly the words or terms. Thus,
      the function
                                            ϕ : T → Φ, τ → ϕ(τ )
       maps the set of all tokens, T ,to the indexing vocabulary Φ. The function ϕ can be a
       simple process such as removing accents or another more complex such as root extraction
       (stemming), lemmatization...
    • f f is the feature frequency and denotes the number of occurrences of ϕi in a document dj :

                             f f (ϕi , dj ) :=| {τ ∈ T | ϕ(τ ) = ϕi ∧ d(τ ) = dj } |

       where d is the function that makes each token τ correspond to its document:

                                            d : T → D, τ → d(τ )

    • df is the document frequency and denotes the number of documents containing the feature
      ϕi at least once:

                          df (ϕi ) :=| {dj ∈ D | ∃τ ∈ T : ϕ(τ ) = ϕi ∧ d(τ ) = dj } |


4     Two-Step Retrieval Status Value
The proposed method [6] is a system based on query translation and it calculates RSV in two
phases, a pre-selection phase and a re-indexing phase. Although the method is independent of the
translation technique, it is necessary to know how each term translates.
    1. 1. The document pre-selection phase consists of translating and running the query on each
       monolingual collection, Di , as is usual in CLIR systems based on query translation. This
       phase produces two results:
         • we obtain a single multilingual collection of preselected documents (D0 collection) as a
           result of joining all retrieved documents for each language.
         • we obtain the translation to the other languages for each term from the original query
           as a result of the translation process. That is, we obtain a T 0 vocabulary, where each
           element τ is called “concept” and consists of each term together with its corresponding
           translation. Thus, a concept is a set of terms expressed independently of the language.
    2. The re-indexing phase consists of re-indexing the multilingual collection D0 , but considering
       solely the T 0 vocabulary. That is, only the concepts are re-indexed. Finally, a new query
       formed by the concepts in T 0 is generated and this query is executed against the new index.
       Thus for example, if we have two languages, Spanish and English, and the term “casa” is
       in the original query and it is translated by “house”, both terms represent exactly the same
       concept. If “casa” occurs a total of 100 times in the Spanish collection, and “house” occurs
       a total of 150 times in the English collection, then the term frequency would be 250. From
       a practical point of view, in this second phase each occurrence of “casa” is treated exactly
       just as each occurrence of “house”.
    Formally, the method can be described as follows:
    For each monolingual collection we begin with the already-known structure:

                                < Ti , Φi , Di , f f, df >, 1 <= i <= N
   Where N is the number of present languages in the multilingual collection to be indexed. Let
Q = {Qi , 1 <= i <= N }, be the set formed by the original query together with its translation
to the other languages, in such a way that Qi is the query expressed in the same language as
the collection Di . After each translation Qi has been run against its corresponding structure
< Ti , Φi , Di , f f, df >, it is possible to obtain a new and single structure:

                                      < T 0 , Φ0 , D, D0 , f f 0 , df 0 >
   where:
   • D is the complete multilingual document collection: D = {Di , 1 <= i <= N }.
   • D0 is the set of retrieved multilingual documents as consequence of running the query Q.
   • T 0 is the set of concepts τj , and denotes the vocabulary of the D0 collection. Since each
     query Qi is a translation of another, it is possible to align the queries at term level.

                         τj := {τij ∈ Qi , 1 <= i <= N }, 1 <= j = M, M = |Q|

     where τij represents all the translations of the term j of the query Q to the language i. Thus,
     τj denotes the concept j of the query Q independently of the language.
   • Φ0 is a new vocabulary to be indexed, such that each ϕj ∈ Φ0 is generated as follows:

                              ϕj := {ϕ(τij ), 1 <= i <= N }, 1 <= j <= M

   • The f f 0 function and df 0 function are interpreted as usual:
        – f f 0 is the number of occurrences of the concept j in the k document. That is, the sum
          of the occurrences of the term j in the query, expressed in language i:


                        f f 0 (ϕj , dk ) :=| {τ ∈ T | ϕ(τ ) = ϕj ∧ d(τ ) = dk } |
                                         := Σf f (ϕij , dk ), ∀ϕij ∈ ϕj , dk ∈ D0 , 1 <= i <= N

        – df 0 is the number of documents with the concept j in the D collection. That is, the
          sum of the documents with the term j in the query, expressed in language i:

                           df 0 (ϕj ) :=| {dk ∈ Di | ∃τ ∈ T : ϕ(τ ) = ϕj ∧ d(τ ) = dk } |
                                      := Σdf (ϕij ), ∀ϕij ∈ ϕj , dk ∈ D, 1 <= i <= N

            where df (ϕij ) is all the documents that contain the concept j in the monolingual co-
            llection Di .
    Given this structure, a new index is generated in run time, but only taking into account the
documents that are found in D0 . The df function operates on the whole collection D, not only on
the retrieved documents in the first phase, D0 . This is so because in practice, we have found that
the obtained results have been slightly better when the whole collection has been considered to
calculate the idf factor. Once the indices have been generated in this way, the query Q formed by
concepts, not by terms, is re-run on the D0 collection.
    In some ways, this method shares some ideas with the CLIR systems based on corpus transla-
tion, but instead of translating the complete corpus, it only translates the words that appear in the
query and the retrieved documents. These two simplifications allow the development of the system
in run-query time since the necessary re-indexing process in the second phase is computationally
possible due to small size of the D0 collection and to the scarce vocabulary T 0 (approximately, the
query terms multiplied by the number of present languages in D0 ).
    Some relevant aspects of two-step RSV are:
    • It is easily scalable to several languages.
    • The system requires the term-level alignment of the original query and the translation of its
      terms. Depending on the approach followed for the translation, this process can prove more
      or less complex.

    • A term together with its translation are treated in exactly the same way in the proposed
      model. This is not too realistic since it is usual for the original term and its translations not to
      be equally weighted. For example, it is possible that for a given language i, we maintain more
      than one translation for a given concept of the original query. Consequently, the concept
      frequency will be increased artificially in the documents expressed in the i language. In this
      case, if we know the translation probability of each term, we can weight each term according
      to its translation probability with respect to the original term. This can be modelled as
      follows:

                f f 0 (ϕj , dk ) := Σf f (ϕij , dk ) ∗ w(τij ), ∀ϕij ∈ ϕj , ϕ(τij ) = ϕij , 1 <= i <= N

      where w(τij ) represents the translation probability of each translation of term j in the query
      to language i, default it will be 1.


5     Description of Experiments and Results
5.1    Multilingual Experiments
The experiment has been carried out for the five languages of the multilingual task. Each collection
has been pre-processed as usual, using the stopword lists and stemming algorithms available for
the participants, except for Spanish, in which we have used a stemming algorithm provided by
the ZPrise system 1 . We have added to the stopword lists terms such as “retrieval”, “documents”,
“relevant”... Due to the German morphological wealth, compound words have been reduced to
simple words with the MORPHIX package [9]. Once the collections have been pre-processed, they
are indexed with the Zprise IR system, using the OKAPI probabilistic model [11]. This OKAPI
model has also been used for the on-line re-indexing process required by the calculation of two-step
RSV.

                               Table 1: Description of official Experiments
       Experiment                    Task          Form               Query                      Merging Strategy
     UJAMLTDRR                   Multilingual automatic Title+Description                          Round-Robin
    UJAMLTDNORM                  Multilingual automatic Title+Description                        Normalized score
    UJAMLTDRSV2                  Multilingual automatic Title+Description                          2-Step RSV
   UJAMLTDRSV2RR                 Multilingual automatic Title+Description                   2-Step RSV+Round-Robin
 UJABITD{SP,DE,FR,IT}             Bilingual      automatic Title+Description

    For each query, we have used the Title and Description sections. The method of query trans-
lation is very simple: we have used the Babylon2 electronic dictionary to translate query terms
[4]. For each term, we have considered the first two translations available by Babylon. Words
not found in the dictionary have not been translated. This approach allows us to carry out query
alignment at term level easily.


   1 ZPrise, developed by Darrin Dimmick (NIST). Available on demand at

http://www.itl.nist.gov/iaui/894.02/works/papers/zp2/zp2.html
   2 Babylon is available at http://www.babylon.com
                Table 2: Performance using different merging strategies (official runs)
                    Experiment        Avg. prec. R-Precision Overall Recall
                  UJAMLTDRR             0.2038         0.2787          4246/8068
                 UJAMLTDNORM            0.2068         0.2647          4297/8068
                 UJAMLTDRSV2            0.2774         0.3280          4551/8068




                                     Figure 1: 11-Pt precision




   The obtained results show that the calculation of the two-step RSV improves more than se-
ven points (36% more) the precision reached with respect to other approaches (table 2). This
improvement is approximately constant with short, medium and large queries (table 3).


           Table 3: Average precision using different merging strategies and query lengths
                  Merging strategy      Tit.      Tit.+Desc. Tit.+Desc.+Narr.
                    round-robin        0.1593       0.2038            0.2425
                  normalized score 0.1592           0.2068            0.2554
                     2-step RSV        0.2159       0.2774            0.3209




5.2    Bilingual Experiments
The differences in accuracy between the bilingual experiments may be due to the stemming al-
gorithms used, the quality of which varies according to language. Thus, the simplest stemming
algorithm is used for Italian: it removes only inflectional suffixes such as singular and plural word
forms or feminine and masculine forms, and it is in this language where the lowest level of accuracy
is achieved.
    Note that the multilingual document list has been calculated starting from the document lists
obtained in the bilingual experiments. The accuracy obtained in the UJAMLTDRSV2 experiment
is similar to that obtained in the bilingual experiments(table 4), surpassing even the accuracy for
German and Italian, and only two points short of that reached in Spanish.
                      Table 4: Bilingual experiments (Title+Description)
                   Experiment        Language        Avg. prec. R-Precision
                  UJABITDSP english → spanish          0.2991        0.3141
                  UJABITDDE english → german           0.2747        0.3077
                  UJABITDFR       english → french     0.3467        0.3365
                  UJABITDIT       english → italian    0.2438        0.2620


5.3     Merging Several Approaches
Finally, we have carried out an experiment merging several approaches through a simple linear
function. specifically, we have calculated document relevance with the function:

                          P os0i = 0.6 ∗ P osrsv2
                                             i    + 0.4 ∗ P osmerge−approach
                                                              i

Where P os0i is the new document position i. P osrsv2
                                                 i    is the document position reached using two-
step RSV, and P osmerge−approach
                    i            is the document   position  using the Round-Robin or normalized
score approach. As shown in table 5, not only is there no improvement, but the accuracy even
decreases slightly.


      Table 5: Merge of two-step RSV and round-robin/normalized score (Title+Description)
               Experiment           Merging strategies      Avg. prec. R-Precision
            UJAMLTDRSV2                   RSV2                0.2774        0.3280
          UJAMLTDRSV2RR            RSV2 and round-robin       0.2758        0.3265
            ujamltdrsv2norm      RSV2 and normalized score    0.2631        0.3162



6     Future work
We have presented a new approach to solve the problem of merging relevant documents in CLIR
systems. This approach has performed noticeably better than other traditional approaches. To
achieve this performance, it is necessary to align the query with its respective translations at term
level. Our next efforts are directed towards three aspects:

    • We suspect that with the inclusion of more languages, the proposed method will perform
      better than other approaches. Our objective is therefore to confirm this suspicion.
    • To test the method with other translation strategies. We have a special interest in the
      Multilingual Similarity Thesaurus, since this provides a measure of the semantic proximity
      of two terms. That semantic proximity can be used by our method as the translation
      probability of a term.
    • Finally, we could study the effect of the pseudo-relevance feedback in the first and second
      phase of the proposed method.


References
 [1] A. Chen. Multilingual Information Retrieval Using English and Chinese Queries. In Carol Pe-
     ters, editor, Proceedings of the CLEF 2001 Cross-Language Text Retrieval System Evaluation
     Campaign. Lecture Notes in Computer Science, pages 44–58. Springer Verlag, 2002.

 [2] S.T. Dumais. Latent Semantic Indexing (LSI) and TREC-2. In NIST, editor, Proceedings of
     TREC’2, volume 500, pages 105–115, Gaithersburg, 1994.
 [3] F. Gey, H. Jiang, A. Chen, and R. Larson. Manual Queries and Machine Translation in
     Cross-language Retrieval and Interactive Retrieval with Cheshire II at TREC-7. In E. M.
     Voorhees and D. K. Harman, editors, Proceedings of the Seventh Text REtrieval Conference
     (TREC-7), pages 527–540, 2000.
 [4] D.A. Hull and G. Grefenstette. Querying across languages. a dictionary-based approach to
     multilingual information retrieval. In Procedings of 19th ACM SIGIR Conference on Research
     and Development in Information Retrieval, pages 49–57, 1996.
 [5] K. L. Kwok, L. Grunfeld, and D. D. Lewis. TREC-3 ad-hoc, routing retrieval and thresholding
     experiments using PIRCS. In NIST, editor, Proceedings of TREC’3, volume 500, pages 247–
     255, Gaithersburg, 1995.
 [6] F. Martı́nez-Santiago and L.A. Ureña. Proposal for a Language-Independent CLIR System.
     In JOTRI’2002, pages 141–148, 2002.
 [7] P. McNamee and J. Mayfield. JHU/APL Experiments at CLEF:Translation Resources and
     Score Normalization. In Carol Peters, editor, Proceedings of the CLEF 2001 Cross-Language
     Text Retrieval System Evaluation Campaign. Lecture Notes in Computer Science, pages 193–
     208. Springer-Verlag, 2001.

 [8] A. Moffat and J. Zobel. Information retrieval systems for large document collections. In
     NIST, editor, Proceedings of TREC’3, volume 500, pages 85–93, Gaithersburg, 1995.
 [9] G. Neumann. Morphix software package. http://www.dfki.de/˜neumann/morphix/morphix.html.
[10] A. L. Powell, J. C. French, J. Callan, M. Connell, and C. L. Viles. The impact of data-
     base selection on distributed searching. In The ACM Press., editor, Proceedings of the 23rd
     International Conference of the ACM-SIGIR’2000, pages 232–239, New York, 2000.
[11] S. E Robertson, S. Walker., and M. Beaulieu. Experimentation as a way of life:okapi at trec.
     Information Processing and Management, 1(36):95–108, 2000.
[12] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill,
     London, U.K., 1983.
[13] J. Savoy. Report on CLEF-2001 Experiments. In Carol Peters, editor, Proceedings of the
     CLEF 2001 Cross-Language Text Retrieval System Evaluation Campaign. Lecture Notes in
     Computer Science, pages 27–43. Springer Verlag, 2001.
[14] P. Sheridan, P. Braschler, and P. Schäuble. Cross-language information retrieval in a mul-
     tilingual legal domain. In Proceedings of the First European Conference on Research and
     Advanced Technology for Digital Libraries, pages 253–268, 1997.
[15] E. Voorhees. The collection fusion problem. In NIST, editor, Proceedings of the 3th Text
     Retrieval Conference TREC-3, volume 500, pages 95–104, Gaithersburg, 1995.