=Paper=
{{Paper
|id=None
|storemode=property
|title=Initial Approaches on Cross-Lingual Information Retrieval Using SMT on User-Queries
|pdfUrl=https://ceur-ws.org/Vol-938/ontobras-most2012_paper2.pdf
|volume=Vol-938
|dblpUrl=https://dblp.org/rec/conf/ontobras/Costa-JussaPW12
}}
==Initial Approaches on Cross-Lingual Information Retrieval Using SMT on User-Queries==
<pdf width="1500px">https://ceur-ws.org/Vol-938/ontobras-most2012_paper2.pdf</pdf>
<pre>
  Initial approaches on Cross-Lingual Information Retrieval
    using Statistical Machine Translation on User Queries
        Marta R. Costa-jussà, Christian Paz-Trillo and Renata Wassermann

                             Computer Science Department
                             1

                        Institute of Mathematics and Statistics
                            University of São Paulo, Brazil
                     Rua do Matão 1010, São Paulo, SP 05508-090
                         {martarcj, cpaz, renata}@ime.usp.br

    Abstract. In this paper we propose a multilingual extension for OnAIR which
    is an ontology-aided information retrieval system applied to retrieve clips from
    a video collection. The multilingual extension basically involves allowing the
    user to search in several languages in a multilingual video collection. Particu-
    larly, the pair of languages we work in this paper are English and Portuguese.
    In order to perform query translation we use a statistical machine translation
    approach. Our experiments show that the multilingual system is capable of
    achieving almost the same quality of that obtained by the monolingual system.

    Resumo. Neste trabalho, propomos uma extensão multilingue para OnAir que
    é um sistema de recuperação de informação auxiliado por uma ontologia. O
    sistema é usado para recuperar clips de uma coleção de vı́deos. A extensão
    multilingue permite ao usuário fazer buscas em duas lı́nguas em uma coleção
    de vı́deo multilingue. Particularmente, o par de lı́nguas que trabalhamos neste
    artigo são Inglês e Português. Para realizar a conversão de consulta, usamos
    uma abordagem estatı́stica de tradução. As nossas experiências mostraram que
    o sistema multilingue é capaz de atingir quase a mesma qualidade do obtido
    pelo sistema monolingue.

1. Introduction
The information society is generating a vast quantity of multilingual information. Re-
cently, there is a growing interest in looking for information in digital videos. Generally,
the user can save time, by avoiding to browse through hours of video in order to find the
information he is looking for. Additionally, these videos may be in a foreign language.
Although he may be able to understand the foreign language, he may not be able to for-
mulate a query. This is the application we are focusing on in this paper in the context
of the OnAIR (Ontology-Aided Information Retrieval) system. OnAIR, started in 2003,
intended to allow users to look for information in video fragments through queries in nat-
ural language. The idea is save the user from the time consuming experience of having to
browse through hours of video in order to find an answer for his questions.
        The main contribution of this paper is the experimentation of concatenating a
state-of-the-art SMT system together with an IR retrieval system that uses ontologies.
This concatenation has been done for the Brazilian-Portuguese/English language pair and
it can be easily be extended to other pair of languages.


                                            25
        The remaining of this paper is organized as follows. Next section briefly explains
the related work in the area of Cross-language Information Retrieval. Section 3 describes
the OnAIR structure and architecture. Then, section 4 is dedicated to the OnAIR cross-
language extension. Finally, experiments and conclusions are reported in sections 5 and
6, respectively.


2. Related Work

The multilingual extension of OnAIR is basically a challenge of cross-language informa-
tion retrieval (CLIR). Given a query in a source language, the aim of CLIR is retrieving
related documents in a target language. (Oard and Diekema 1998) identified four types of
strategies for matching a query with a set of documents in the context of CLIR by: cog-
nate matching, document translation, query translation or interlingua techniques. From
these techniques the most used are the query translation and the interlingua techniques.
         Query translation methods translate user queries to the language that the docu-
ments are written. It is the most popular approach in CLIR experimental systems due
to its tractability and convenience. CLIR through query translation methods has been
mainly faced by using dictionary-based (i.e. using machine-readable dictionaries, MRD),
machine translation (MT) and/or parallel texts techniques (Chen and Bao 2009). Among
the different machine translation techniques, we have the corpus-based techniques such
as statistical or example-based (Way and Gough 2005) and the rule-based techniques
(Forcada 2006). In this paper we are using one of the most popular approaches nowa-
days which is the standard phrase-based statistical machine translation (SMT) approach
(Koehn et al. 2007a).
        Interlingua methods translate both documents and queries into a third repre-
sentation. The approach aims at associating related textual contents among different
languages by means of language-independent semantic representations. The conven-
tional interlingua-based CLIR approach uses latent semantic indexing (LSI) for con-
structing a multilingual vector-space representation of a given parallel document collec-
tion (Deerwester et al. 1990; Dumais et al. 1996; Chew and Abdelali 2007). Such a rep-
resentation is known to be noisy and sparse. That is why in order to obtain more efficient
vector-space representations, space reduction techniques such as latent semantic index-
ing and probabilistic latent semantic indexing (Hofmann 1999) are applied. The new
reduced-space dimensions are supposed to capture semantic relations among the words
and the documents in the collection. Recent approaches have achieved interesting results
by using regression canonical correlation analysis (an extension of canonical correlation
analysis) where one of the dimensions is fixed and demonstrate how it can be solved
efficiently (Rupnik and Shawe-Taylor 2008).


3. The OnAir system

OnAIR is in essence an information retrieval system which has been described in detail
in previous studies such as (Paz-Trillo et al. 2005). In this section we briefly describe the
most relevant characteristics of the system. First, we show how the information retrieval
is done and, second, we show how a monolingual ontology is used for query expansion.


                                             26
3.1. Information Retrieval
OnAIR relies on the vector space model (Baeza-Yates and Ribeiro-Neto 1999)for infor-
mation retrieval. It was built to receive videos and keywords or their transcriptions, with
timeline markers, as input, and to allow the users to query for video excerpts using natural
language. When a user query is presented, OnAIR returns a list of video excerpts that best
answer the user query.
        The video transcriptions are pre-processed, using traditional IR techniques: stem-
ming and stopword removal, then the vector space model is used for indexing and retriev-
ing. As usual in traditional IR systems, some additional techniques are needed to avoid
natural language difficulties like Polysemy and Synonymy.

3.2. Ontology description
Ontologies are defined in general as an explicit specification for a conceptualization
(Gruber 1993). As mainly used for Information Retrieval it can be seen as a set of con-
cepts related by hierarchies and other kind of properties in a specific domain (Ding 2001).
Ontologies have been commonly used in IR through query expansion and conceptual dis-
tance measures (Paz-Trillo et al. 2005).
       A domain ontology related to the topics from the videos is needed to be able to do
the query expansion. By definition, query expansion is the process of reformulating a seed
query to improve retrieval performance in information retrieval operations. In particular,
the domain ontology is used to measure the conceptual distance among seed query terms
and new ones.

4. Cross-lingual extension
In general, a statistical machine translation system relies on the translation of a source
language sentence s into a target language sentence t̂. Among all possible target language
sentences t we choose the one with the highest probability, as show in equation (1):


                              t̂ = arg max [P (t|s)]                                    (1)
                                           t
                                 = arg max [P (t) P (s|t)]                              (2)
                                           t


        The probability decomposition shown in equation (2) is based on Bayes’ theo-
rem and it is known as the noisy channel approach to statistical machine translation
(Brown et al. 1990). It allows to model independently the target language model P (t)
and the source translation model P (s|t). The basic idea of this approach is to segment
the given source sentence s into segments of one or more words, then each source seg-
ment is translated and the target sentence is composed from these segment translations.
On the one hand, the translation model weights how likely words in the foreign language
are translation of words in the source language; the language model, on the other hand,
measures the fluency of hypothesis t. The search process is represented as the arg max
operation.
       The translation model in the phrase-based approach (Koehn et al. 2003) is com-
posed of phrases. A phrase is a pair of m source words and n target words extracted from


                                               27
a parallel sentence that belongs to a bilingual corpus. The parallel sentences have previ-
ously been aligned at the word level (Brown et al. 1993). Then, given a parallel sentence
aligned at the word level, phrases are extracted following the next criteria: we consider
the words that are consecutive in both source and target sides and which are consistent
with the word alignment. We consider a phrase is consistent with the word alignment if
no word inside the phrase is aligned with one word outside the phrase. Finally, phrase
translation probabilities are estimated as relative frequencies (Zens et al. 2002).
        A language model assigns a probability to each target sentence. Standard lan-
guage models are computed following the n-gram strategy, which considers sequences of
n words. In order to compute the probability of an n-gram, it is assumed that the proba-
bility of observing the ith word in the context history of the preceding i-1 words can be
approximated by the probability of observing it in the shortened context history of the
preceding n-1 words. The main problem with this modeling is that it assigns probability
zero to strings that have never seen before. One way to solve this problem is assigning
non-zero probabilities to sentences they have never seen before by means of smoothing
techniques (Kneser and Ney 1995).
       A variation of the so-called noisy channel approach is the log-linear model
(Och and Ney 2002). It allows using several models or so-called features and to weight
them independently as can be seen in equation (3):

                                            "M                   #
                                             X
                             t̂ = arg max         λm hm (s, t)                         (3)
                                       t
                                            m=1


       This equation should be interpreted as a maximum-entropy framework and as a
generalization of equation (2) (Zens et al. 2002).
         Most common additional features that are used in the maximum-entropy frame-
word (in addition to the standard translation and language model) are the lexical models,
the word bonus and the reordering model. The lexical models are particularly useful in
cases where the translation model may be sparse. For example, for phrases which may
have appeared few times the translation model probability may not be well estimated.
Then, the lexical models provide a probability among words (Brown et al. 1993) and they
can be computed in both directions source-to-target and target-to-source. The word bonus
is used to compensate the language model which benefits shorter outputs. The reorder-
ing model is used to provide reordering between phrases. For example, the lexicalized
reordering model (Tillman 2004) classifies phrases by the movement they made relative
to the previous used phrase, i.e., for each phrase the model learns how likely it is fol-
lowed by the previous phrase (monotonous), swapped with it (swap) or not connected at
all (discontinuous).
        The different features or models are optimized in the decoder following the min-
imum error rate procedure (Och 2003). This algorithm searches for weights minimizing
a given error measure, or, equivalently, maximizing a given translation metric. This algo-
rithm enables the weights to be optimized so that the decoder produces the best transla-
tions (according to some automatic metric and one or more references) on a development
set of parallel sentences.


                                            28
5. Evaluation Framework
This section introduces the details of the evaluation framework. We report the translation
and the information retrieval system details including corpus statistics, a description of
how we built the systems and the evaluation details.

5.1. SMT data
The parallel corpus used to train the SMT system is taken from the Brazilian-Portuguese-
English bilingual collections of the online issue of the scientific news Brazilian magazine
REVISTA PESQUISA FAPESP (Aziz and Specia 2011). See statistics in Table 1.

                                                     PT-BR EN
                           Train   Sentences          160k 160k
                                     Words            4,1M 4,3M
                                   Vocabulary        99,5k 74.7k
                       Development Sentences          1375 1375
                                     Words           34.3k 37.6k
                                   Vocabulary         6.8k  5.7k
                          Test     Sentences          1608 1608
                                     Words           36.8k 38.3k
                                   Vocabulary         7.3k  6.2k

               Table 1. Basic characteristics of the SMT experimental dataset.


5.2. IR data
For testing the information retrieval system in Portuguese-Brazilian we used a video col-
lection compiled from interviews with Ana Teixeira, a Brazilian artist. The interviews
were made by Paula P. Braga, the domain expert and there have been used in previous
studies as (Paz-Trillo et al. 2005). The interview was developed in the domain of contem-
porary art and the system uses a domain ontology to expand queries with related terms.
To test the system, a battery of queries was synthesized both for English and Brazilian-
Portuguese. Statistics of these queries and the corresponding documents for retrieving are
shown in Table 2.

                                                     PT-BR EN
                            Query   Number             50   50
                                     Words            349  435
                                   Vocabulary         155  145
                         Documents  Number             48    -
                                     Words            8.2k   -
                                   Vocabulary         2.4k   -

    Table 2. Basic characteristics of the query and documents dataset for the Ana
    Teixerira videos.


                                             29
5.3. Translation system

In this paper, we use a system that combines the translation and the language model
together with the following additional feature functions: the word and the phrase bonus
and the source-to-target and target-to-source lexicon model and the reordering model. All
these features have been described in section 4.
        Our translation system was built using M OSES (Koehn et al. 2007b). We used the
default M OSES parameters. Word alignment (built with the standard software GIZA++
(Och and Ney 2003)) was performed in both direction source-to-target and target-to-
source. These word alignments were merged by using the so-called symmetrization of
the grow-diagonal-final-and which is a sophisticated extension of the standard union op-
eration (Koehn et al. 2005). For the translation model, we used phrases up to length 10.
Phrase probability is estimated including relative frequencies in both directions (source-
to-target and target-to-source), lexical weights and phrase bonus. The lexicalized reorder-
ing (Tillman 2004) is used to provide reordering accross sentences. The language model
used a 5-gram with Kneser-Ney smoothing. Finally, the word bonus was used to compen-
sate the preference of the language model for shorter outputs. All these different features
were combined in equation (3) and the optimization was done using MERT software
(Och 2003).
        In order to evaluate the translation quality, we used BLEU (Bilingual Evaluation
Understudy) (Papineni et al. 2001) which is one of the most popular SMT automatic eval-
uation metrics. BLEU uses a modified form of precision to compare a candidate transla-
tion against multiple reference translations. BLEU’s output is a number between 0 and
1. This value indicates how similar the candidate translation and reference texts are, with
values closer to 1 representing more similar texts.
        We evaluated the SMT quality using in-domain and out-domain tests. The former
is the one corresponding to the REVISTA PESQUISA FAPESP as shown in Table 1.
The out-domain test corresponds to the queries used to test the complete CLIR system as
shown in Table 2. Table 3 shows the results in terms of BLEU of the translation system
when evaluated in-domain and out-domain.

                                  Test    EN->PT-BR
                               In-domain    0.3649
                               Out-domain   0.1506

             Table 3. Evaluation of the translation system in terms of BLEU.


        Coherently      with    international   evaluations   such     as    WMT
(Callison-Burch et al. 2011), the out-domain test set has a lower performance than
the in-domain test set.

5.4. Comparing IR and CLIR system’s performance

We performed the following experiments: two experiments using a monolingual informa-
tion retrieval, recovered from previous publications (Paz-Trillo et al. 2005), and one using
a cross-lingual information system. We describe the corresponding systems as follows:


                                            30
    1. IR system: the original system analyzed was the system described in section
       3, with two configurations: mono-keywords, which uses only the keywords
       for retrieval and; mono-kw-fulltext-05 which uses the results of retrieval using
       keywords and transcriptions, the best configuration for OnAIR as described in
       (Paz-Trillo et al. 2005)
    2. CLIR system (smt-kw-fulltext-05): this system is the concatenation of the statisti-
       cal machine translation system described in the previous section and the informa-
       tion retrieval system from the point above in this list.


                     Figure 1. F-measure for the systems analyzed.


        Figure 1 shows the results of the f-measure run over the 50 queries analyzed in
our experiments in the three configurations presented above and the BLEU measure for
the translation of each query.
        Surprisingly, experiments show that the CLIR system, for specific queries, is ca-
pable of outperforming the IR system. For these queries, the translation system uses a
more adequate word, which means that it would be possible to use machine translation to
perform query expansion. It would be interesting to built the CLIR system with the n-best
translations.
        Figure 2 shows the f-measure in average for all systems that we experimented.
Here, we observe that the f-measure of with respect to the CLIR system (smt-kw-fulltext-
05) is slightly worst than its comparable IR system (mono-kw-fulltext-05). However, in


                                           31
                  Figure 2. Average f-measure for the systems analyzed.


average, the f-measure using SMT is not highly affected when compared to the best mono-
lingual result.
        Finally, Figure 3 shows some translation examples. It shows the input to the CLIR
system (smt-kw-fulltext-05), the corresponding translation and the corresponding refer-
ence (i.e. the input of the IR system). The two first examples report cases where the CLIR
system performs worse than the IR system (mono-kw-fulltext-05) in terms of f-measure.
The second two examples report cases where the CLIR system performs better than the
IR system in terms of f-measure. Coherently, in the first case, the translation shows a
poorer quality than in the second case.

6. Conclusions and future work
This paper has shown an ongoing work that generates a cross-lingual extension for the
OnAIR system, which is in essence an information retrieval system using ontologies to
expand queries. The cross-lingual extension has been done using a state-of-the-art statis-
tical machine translation system. Experiments show that the best configuration for the IR
system uses the results of retrieval using keywords and transcriptions. For the CLIR sys-
tem, we can get competitive results using a state-of-the-art statistical machine translation
system.
       As further work, we want to explore different linguistic and statistical techniques
(focusing on morphology and semantics) to be introduced in the state-of-the-art statistical
MT system in order to correctly translate queries which are out-of-domain of the training
corpus. Also it would be interesting to use MT as a query expansion method.


                                            32
 I NPUT: How did you become an artist?
 T RANSLATION: Como o senhor se um artista?
 R EFERENCE: Como você virou artista
 I NPUT: Do you make only interventions or also paintings, sculpture, etc?
 T RANSLATION: O senhor faz apenas intervenções ou também pinturas, escultura etc?
 R EFERENCE: Você só faz intervenções ou faz também pintura, escultura, etc?
 I NPUT: I loved his work.
 T RANSLATION: Adorei seu trabalho.
 R EFERENCE: Adorei seu trabalho.
 I NPUT: Have you ever exposed abroad?
 T RANSLATION: O senhor já exposta no exterior?
 R EFERENCE: Você já expôs no exterior?

                              Figure 3. Translation examples.


7. Acknowledgements
This work has been supported by FAPESP through the OnAir project (2010/19111-9) and
the visiting researcher program (2012/02131-2), and by the Spanish Ministry of Economy
and Competitiveness through the BUCEADOR project (TEC2009-14094-C04-01) and
the Juan de la Cierva fellowship program.

References
[Aziz and Specia 2011] Aziz, W. and Specia, L. (2011). Fully automatic compilation of
  a Portuguese-English parallel corpus for statistical machine translation. In STIL 2011,
  Cuiabá, MT.
[Baeza-Yates and Ribeiro-Neto 1999] Baeza-Yates, R. and Ribeiro-Neto, B. (1999).
  Modern Information Retrieval. Addison Wesley Longman.
[Brown et al. 1990] Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek,
  F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. (1990). A Statistical Approach to
  Machine Translation. Computational Linguistics, 16(2):79–85.
[Brown et al. 1993] Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer,
  R. L. (1993). The mathematics of statistical machine translation: Parameter estimation.
  Computational Linguistics, 19(2):263–311.
[Callison-Burch et al. 2011] Callison-Burch, C., Koehn, P., Monz, C., and Zaidan, O.
  (2011). Findings of the 2011 workshop on statistical machine translation. In Pro-
  ceedings of the Sixth Workshop on Statistical Machine Translation, pages 22–64, Ed-
  inburgh, Scotland.
[Chen and Bao 2009] Chen, J. and Bao, Y. (2009). Cross-language search: The case of
  google language tools. First Monday, 14(3-2).
[Chew and Abdelali 2007] Chew, P. and Abdelali, A. (2007). Benefits of the passively
  parallel rosetta stone? Cross-Language information retrieval with over 30 languages.
  In Proc of the 45th Annual Meeting of the Association for Computational Linguistics,
  volume 45, page 872.
[Deerwester et al. 1990] Deerwester, S., Dumais, S., Furnas, G. W., Landauer, T. K., and
  Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American
  Society for Information Science, 41(6):391–407.


                                              33
[Ding 2001] Ding, Y. (2001). Ir and ai: The role of ontology. In International Conference
   of Asian Digital Libraries.
[Dumais et al. 1996] Dumais, S. T., Landauer, T. K., and Littman, M. L. (1996). Auto-
   matic cross-linguistic information retrieval using latent semantic indexing. In SIGIR96
   Workshop on Cross-Linguistic Information Retrieval.
[Forcada 2006] Forcada, M. L. (2006). Open-source machine translation: an opportunity
   for minor languages. In Strategies for developing machine translation for minority
   languages (5th SALTMIL workshop on Minority Languages).
[Gruber 1993] Gruber, T. R. (1993). A translation approach to portable ontologies.
   Knowledge Acquisition, 5(2):199–220.
[Hofmann 1999] Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceed-
   ings of Uncertainty in Artificial Intelligence, UAI99, pages 289–296.
[Kneser and Ney 1995] Kneser, R. and Ney, H. (1995). Improved backing-off for n-gram
   language modeling. In IEEE Inte. Conf. on Acoustics, Speech and Signal Processing,
   pages 49–52, Detroit, MI.
[Koehn et al. 2005] Koehn, P., Axelrod, A., Mayne, A. B., Callison-Burch, C., Osborne,
   M., and Talbot, D. (2005). Edinburgh system description for the 2005 IWSLT speech
   translation evaluation. In Proceedings of the Int. Workshop on Spoken Language Trans-
   lation (IWSLT’05), Pittsburg, USA.
[Koehn et al. 2007a] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M.,
   Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Con-
   stantin, A., and Herbst, E. (2007a). Moses: Open source toolkit for statistical machine
   translation. In Proceedings of the 45th Annual Meeting of the Association for Compu-
   tational Linguistics (ACL’07), pages 177–180, Prague, Czech Republic.
[Koehn et al. 2007b] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M.,
   Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Con-
   stantin, A., and Herbst, E. (2007b). Moses: Open source toolkit for statistical machine
   translation. In Proc. of the ACL, pages 177–180, Prague, Czech Republic.
[Koehn et al. 2003] Koehn, P., Och, F., and Marcu, D. (2003). Statistical Phrase-Based
   Translation. In Proc. of the 41th Annual Meeting of the Association for Computational
   Linguistics.
[Oard and Diekema 1998] Oard, D. W. and Diekema, A. R. (1998). Cross-Language
   information retrieval. Annual Review of Information Science and Technology (ARIST),
   33:223–256.
[Och 2003] Och, F. (2003). Minimum Error Rate Training In Statistical Machine Trans-
   lation. In Proc. of the 41th Annual Meeting of the Association for Computational
   Linguistics, pages 160–167.
[Och and Ney 2002] Och, F. and Ney, H. (2002). Dicriminative training and maximum
   entropy models for statistical machine translation. In Proc. of the 40th Annual Meeting
   of the Association for Computational Linguistics, pages 295–302, Philadelphia, PA.
[Och and Ney 2003] Och, F. J. and Ney, H. (2003). A systematic comparison of various
   statistical alignment models. Computational Linguistics, 29(1):19–51.
[Papineni et al. 2001] Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2001). BLEU:
   A Method for Automatic Evaluation of Machine Translation. IBM Research Report,
   RC22176.
[Paz-Trillo et al. 2005] Paz-Trillo, C., Wassermann, R., and Braga, P. P. (2005). An in-
   formation retrieval application using ontologies. J. Braz. Comp. Soc., 11(2):17–31.


                                           34
[Rupnik and Shawe-Taylor 2008] Rupnik, J. and Shawe-Taylor, J. (2008). Multi-
   view canonical correlation analysis and cross-lingual information retrieval. In
   http://videolectures.net/lms08 rupnik rcca/.
[Tillman 2004] Tillman, C. (2004). A Block Orientation Model for Statistical Machine
   Translation. In HLT-NAACL.
[Way and Gough 2005] Way, A. and Gough, N. (2005). Comparing example-based and
   statistical machine translation. Natural Language Engineering, 11(3):295–309.
[Zens et al. 2002] Zens, R., Och, F., and Ney, H. (2002). Phrase-based statistical machine
   translation. In Verlag, S., editor, Proc. German Conference on Artificial Intelligence
   (KI).


                                           35

</pre>