=Paper=
{{Paper
|id=None
|storemode=property
|title=Pivot Strategies as an Alternative for Statistical Machine Translation Tasks Involving Iberian Languages
|pdfUrl=https://ceur-ws.org/Vol-824/paper3.pdf
|volume=Vol-824
}}
==Pivot Strategies as an Alternative for Statistical Machine Translation Tasks Involving Iberian Languages==
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)
Pivot strategies as an alternative for statistical machine
translation tasks involving iberian languages∗
Estrategias pivote como alternativa a las tareas de traducción automática
estadı́stica entre idiomas ibéricos
Carlos Henrı́quez† , Marta R. Costa-jussà? , Rafael E. Banchs‡ , Lluis Formiga† and José B. Mariño†
†
Universitat Politècnica de Catalunya-TALP
C/Jordi Girona, 08034, Barcelona
{carlos.henriquez,lluis.formiga,jose.marino}@upc.edu
?
Barcelona Media Innovation Center
Av Diagonal, 177, 9th floor, 08018 Barcelona, Spain
marta.ruiz@barcelonamedia.org
‡
Institute for Infocomm Research
1 Fusionopolis Way 21-01, Singapore 138632
rembanchs@i2r.a-star.edu.sg
Resumen: Este artı́culo describe diferentes aproximaciones para construir sistemas
de traducción automática estadı́sticas (SMT por sus siglas en inglés) entre idio-
mas de escasos recursos paralelos. La estrategia es especialmente interesante para
España, un paı́s con tres idiomas oficiales (catalán, vasco y gallego) aparte del cas-
tellano, en donde es difı́cil conseguir corpus paralelo entre cualquiera de los tres
primeros pero es comparativamente fácil hacerlo entre castellano y cualquiera de
ellos. Tal particularidad nos permite aprovechar el castellano como puente o pivote
para construir sistemas que traduzcan entre catalán e inglés, por ejemplo. Estos
sistemas son de gran utilidad para los idiomas minoritarios pues ayudan a darles
una presencia global y a promover su uso. Como caso de uso, se describe un sistema
catalán-inglés siguiendo la estrategia pivote de corpus sintético, la comparamos con
una aproximación de cascada y comentamos sobre mejoras adicionales que pudieran
implementarse para este par de idiomas en particular.
Palabras clave: idioma pivote, traducción automática estadı́stica, corpus paralelo
escaso, cascada, pseudo-corpus, modelos de traducción, frases, n-gramas
Abstract: This paper describes different pivot approaches to built SMT systems for
language pairs with scarce parallel resources. The strategy is particularly interesting
for Spain, a country with three official languages (Catalan, Basque, and Galician)
besides Spanish, where it is difficult to find parallel corpora between two of the first
three mentioned languages but it is relatively easy to collect it between Spanish and
any of them. This characteristic, however, allow us to develop machine translation
systems from major languages like English, to Catalan for instance, using Spanish as
pivot. Such systems help these minority languages giving them global presence and
promoting their use in content collaboration. We describe a English-Catalan base-
line system built following the synthetic approach, we compare it with the transfer
approach and comment about future enhancement that could be implemented for
this language pair.
Keywords: pivot language, statistical machine translation, scarse parallel corpora,
cascade, pseudo-corpus, phrase-based, ngram-based, translation models
22
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)
1. Motivation as follows. Section 2 reports a brief descrip-
tion of the phrase-based and Ngram-based
Spain is a multilingual country with four
translation approaches. Section 3 presents
official languages: Catalan, Euskera, Galician
the pivot approaches used in this paper. Sec-
and Spanish. Catalan is spoken by 11.5 mi-
tion 4 describes the English-Catalan SMT
llion people, Euskera by 1.2 million people,
system. Section 5 compares the pivot strate-
Galician by 3.2 million people and Spanish by
gies in terms of translation quality and Sec-
400 million people. Given the high number of
tion 6 presents the most relevant conclusions.
Spanish speakers compared to the other lan-
guages, Spanish has much more linguistic and 2. Statistical Machine
data resources.
Translation approaches
The quantity of resources is relevant in
statistical machine translation. The more pa- As mentioned in the previous section, we
rallel text we have, the better the transla- are working with two SMT systems: the
tion quality. In order to face the lack of re- phrase-based (Koehn, Och, y Marcu, 2003)
sources in translation, there are many re- and Ngram-based systems (Mariño et al.,
search works on pivot approaches which con- 2006; Casacuberta y Vidal, 2004), which are
sist on using a pivot language to perform briefly described as follows.
a source to target translation (Bertoldi et
2.1. Phrase-based
al., 2008a) (Costa-jussà, Henrı́quez, y Ban-
chs, 2011). For example, in order to translate This approach to SMT performs the trans-
from Galician to Catalan, we could use Spa- lation splitting the source sentence in seg-
nish as pivot language. There are much mo- ments and assigning to each segment a bi-
re resources in Galician-Spanish and Spanish- lingual phrase from a phrase-table. Bilin-
Catalan than between Galician and Catalan gual phrases are translation units that con-
directly. The same could happen when inter- tain source words and target words, e.g. <
ested in translating Catalan, Euskera or Ga- unidad de traducción | translation unit >,
lician into English. In this work, we introdu- and have different scores associated to them.
ce a state-of-the-art English-Catalan trans- These bilingual phrases are then selected to
lation system recently built for the free web maximize a linear combination of feature fun-
translator N-II1 . ctions. Such strategy is known as the log-
The main differences with the Catalan- linear model (Och y Ney, 2002) and it is for-
English SMT system presented in (de Gis- mally defined as:
pert y Mariño, 2006) are that in this pa-
per we use an extended corpus and we pro- " M
X
#
pose to build a hybrid system which uses ê = arg máx λm hm (e, f ) (1)
e
an Ngram-based system for Catalan-Spanish m=1
and a phrase-based system for Spanish-
English. The Ngram-based system outper- where hm are different feature functions with
forms the phrase-based system in Catalan- weights λm . The two main feature functions
Spanish (Farrús et al., 2009) while the op- are the translation model (TM) and the tar-
posite occurs for the case of Spanish-English get language model (LM). Additional models
(Costa-Jussà y Fonollosa, 2009). Additiona- include POS target language models, lexical
lly, for the Catalan-Spanish system we are weights, word penalty and reordering models
using a further competitive system using ru- among others.
les and statistical features (Farrús et al., Moses (Koehn et al., 2007) was used to
2011). build the phrase-based system.
The remainder of this paper is organized 2.2. Ngram-based
∗
The research leading to these results has recei- The base of the Ngram approach is the
ved funding from the European Community’s Seventh concept of tuple. Tuples are bilingual units
Framework Programme (FP7/2007-2013) under grant with consecutive words both on the source
agreement 247762 (FAUST) and from the Spanish Mi- and target side that are consistent with the
nistry of Science and Innovation through the Juan de
la Cierva research program and the Buceador project word alignment. They must provide a uni-
(TEC2009-14094-C04-01). que monotonic segmentation of the sentence
1
available at http://www.n-ii.org pair and they cannot be inside another tuple
23
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)
in the same sentence. This unique segmenta- step. At the end, mn-best hypotheses are pro-
tion allows us to see the translation model as duced, which are reranked by using Minimum
a language model, where the language is com- Bayes Risk (MBR) (Kumar y Byrne, 2004),
posed of tuples instead of words. That way, allowing the introduction of additional featu-
the context used in the translation model is res such as new language models.
bilingual and implicitly works as a language
model with bilingual context as well. In fact, 3.2. Pseudo-corpus or synthetic
while a language model is required in phrase- approach
based and hierarchical phrase-based systems,
in Ngram-based systems it is considered just Instead of considering the two language
an additional feature. pairs independently, this approach produces
This alternative approach to a translation a single source-target SMT system. Assuming
model defines the probability as: we have a source-pivot and a pivot-target pa-
rallel corpus, we build and tuned a pivot-
target SMT system and we use it to translate
N
Y the pivot part from the source-pivot corpus.
P (f, e) = P (f, e)n | (f, e)n−1 , . . . , (f, e)1 This results in a source-target synthetic cor-
n=1 pus (hence the name) which is finally used to
(2) build the source-target SMT system. For the
where (f, e)n is the n-th tuple of hypothesis tuning process, we could also use a synthe-
e for the source sentence f . tic development corpus but an actual source-
As additional features, we used a Part-Of- target corpus is prefered, if possible. A sim-
Speech (POS) language model for the target ple variation for this approach is to build a
side and a target word bonus model. pivot-source SMT system in order to transla-
We used the open source decoder MARIE te the pivot part of the pivot-target corpus,
(Crego, de Gispert, y Mariño, 2005) to build and use the resulting source-target synthetic
the Ngram-based system. corpus to build the final system.
3. Pivot Approaches
4. Building an English-Catalan
The best approaches to build a SMT sys-
SMT using Spanish as pivot
tem through a pivot language are: the cas-
cade system, also known as the transfer ap- We present an English-Catalan SMT ba-
proach and the pseudo-corpus or synthetic seline system, using Spanish as the pivot lan-
approach. Other pivot approaches do not out- guage. In this case, the parallel corpus avai-
perform these two (Wu y Wang, 2007) (Cohn lable for the Catalan-Spanish language pair
y Lapata, 2007). The cascade and the pseudo- was provided by the bilingual newspaper “El
corpus approaches have been evaluated and Periódico”2 and the English-Spanish corres-
compared in works such as (de Gispert y Ma- ponds to the train corpora provided during
riño, 2006; Bertoldi et al., 2008a; Bertoldi the 2010 WMT’s translation task3 , i.e. Eu-
et al., 2008b). Consistently, both works ha- roparl and News Commentary. We followed
ve shown that the pseudo-corpus approach is the synthetic approach described before to
the best performing strategy. build the final system. Therefore, the Spanish
part from the WMT Corpus was translated
3.1. Cascade or transfer method into Catalan and a English-Catalan phrase-
This approach considers the language based SMT system was built using the resul-
pairs source-pivot and pivot-target indepen- ting synthetic corpus. Table 1 shows a sum-
dently. It consists in training and tuning two mary of the statistics of both corpora. We
different SMT systems and combine them in also used the Catalan-Spanish baseline toget-
a two-step process: first, we translate a source her with the Spanish-English baseline system
sentence using the source-pivot system; then, presented in the 2010’s WMT (Henrı́quez Q.
we use the resulting sentence as input for the et al., 2010) to build the other direction and
pivot-target translation. A common variation compare the different approaches in it.
for this strategy presented in (Khalilov et al.,
2008) considers a n-best output instead of the 2
http://www.elperiodico.es
single-best during the first translation and 3
http://www.statmt.org/wmt10/translation-
then produce a m-best translation in the last task.html
24
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)
Corpora Catalan Spanish Pivot approach Direction BLEU
Training sents. 4,6M 4,6M Cascade cat-eng 21,63
Running words 96,94M 96,86M Cascade eng-cat 24,29
Vocabulary 1,28M 1,23M Pseudo-corpus cat-eng 23,19
Development sents. 1966 1966 Pseudo-corpus eng-cat 26,97
Running words 46765 44667
Vocabulary 9132 9426 Cuadro 2: English-Catalan results
Corpora Spanish English
Training sents. 1,18M 1,18M 5. Results
Running words 26,45M 25,29M
Table 2 shows the BLEU score of the cas-
Vocabulary 118073 89248
cade and pseudo-corpus approaches in both
Development sents. 1729 1729 directions. The test set was the one provided
Running words 37092 34774 as internal test set during the WMT transla-
Vocabulary 7025 6199 tion task. It is also important to mention that
Test sents. 2525 2525 the score was computed using one reference.
Running words 69565 65595
The final quality of the Catalan-English
Vocabulary 10539 8907
system is determined by the quality of the
Cuadro 1: Catalan-Spanish and Spanish- Spanish-English corpus, whose baseline has a
English corpora (M stands for Millions) BLEU around 24 (Henrı́quez Q. et al., 2010).
The Catalan-Spanish baseline has a BLEU
around 80 (Farrús et al., 2009). Also there is a
4.1. Spanish-Catalan baseline negative effect given the difference in domain
system between the Catalan-Spanish corpus (a regio-
As mentioned before, the Spanish-Catalan nal newspaper) and Spanish-English corpus
SMT system (named N-II) is based on the (Europarl).
corpus provided by the bilingual newspaper Using paired bootstrap resampling
“El Periódico”. It is a Ngram-based SMT (Koehn, 2004), we can see that for these
system that includes several improvements systems, the Pseudo-corpus approach is
specific to the language pair: a homonym better than Cascade with 95 % statistical
disambiguation for the Catalan verb ‘soler’ significance.
and Catalan possessives, special considera-
tion for pronominal clitics, upper-case words 6. Conclusions and further work
and the Catalan apostrophe, gender concor-
dance, numbers and time categorization and We have presented an English-Catalan
text processing for common mistakes found SMT system built using Spanish as pivot lan-
when writing in Catalan. The full description guage, given the scarce resources for English-
can be found in (Farrús et al., 2011). Catalan.
Similarly to previous research work, we
4.2. English-Catalan system have seen here that, in the particular trans-
description lation task under consideration, the pseudo-
corpus approach constitutes the best stra-
Once obtained the Catalan translation
tegy for pivot translation. Although the cas-
from the Spanish section of the WMT corpus,
cade approach clearly performs worse than
a phrase-based SMT system was built using
the pseudo-corpus approach, it could be also
Moses as the decoder. Apart from the base-
beneficial to consider a system combination
line pipeline, the system also includes a POS
between these two strategies to further boost
target language model computed with TnT
the quality of the translations.
(Brants, 2000), numbers and time categori-
zation similar to N-II and the parallel corpus Further work should focus on building
was aligned considering the Catalan lemmas Spanish-pivot systems between all the offi-
computed with Freeling (Padró et al., 2010) cial languages and English, as well as among
and the English stems of words obtained with them. The similarities between the languages
Snowball4 . (except Basque) and the availability of para-
llel corpora between Spanish and the others
4
http://snowball.tartarus.org encourage the approach.
25
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)
Bibliografı́a y J. A. R. Fonollosa. 2011. Overco-
Bertoldi, N., R. Cattoni, M. Federico, y ming statistical machine translation limi-
M. Barbaiani. 2008a. FBK @ IWSLT- tations: error analysis and proposed so-
2008. En Proc. of the International lutions for the catalan-spanish language
Workshop on Spoken Language Transla- pair. Language Resoures and Evaluation,
tion, páginas 34–38, Hawaii, USA. 45(2):181–208.
Farrús, M., M. R. Costa-jussà, M. Poch,
Bertoldi, Nicola, Madalina Barbaiani, Mar-
A. Hernández, y J. B. Mariño. 2009.
cello Federico, y Roldano Cattoni. 2008b.
Improving a catalan-spanish statistical
Phrase-Based Statistical Machine Trans-
translation system using morphosyntac-
lation with Pivot Languages. En Procee-
tic knowledge. En Proceedings of Euro-
dings of IWSLT.
pean Association for Machine Translation
Brants, T. 2000. TnT – a statistical 2009.
part-of-speech tagger. En Proc. of the
Henrı́quez Q., C. A., M.R. Costa-jussà,
Sixth Applied Natural Language Proces-
V. Daudaravicius, R. E. Banchs, y J. B.
sing (ANLP-2000), Seattle, WA.
Mariño. 2010. Using collocation segmen-
Casacuberta, F. y E. Vidal. 2004. Machine tation to augment the phrase table. En
translation with inferred stochastic finite- Proceedings of the Joint Fifth Workshop
state transducers. Computational Lin- on Statistical Machine Translation and
guistics, 30(2):205–225. MetricsMATR, páginas 104–108, Uppsala,
Sweden, July.
Cohn, T. y M. Lapata. 2007. Machine Trans-
lation by Triangulation: Making Effective Khalilov, M., M. R. Costa-Jussà, C. A.
Use of Multi-Parallel Corpora. En Proc. Henrı́quez, J. A. R. Fonollosa,
of the ACL. A. Hernández, J. B. Mariño, R. E.
Banchs, B. Chen, M. Zhang, A. Aw, y
Costa-Jussà, M. R. y J. A. R. Fonollosa. H. Li. 2008. The TALP & I2R SMT
2009. Phrase and ngram-based statistical Systems for IWSLT 2008. En Proc. of
machine translation system combination. the International Workshop on Spoken
Applied Artificial Intelligence: An Inter- Language Translation, páginas 116–123,
national Journal, 23(7):694–711, August. Hawaii, USA.
Costa-jussà, M.R., C. Henrı́quez, y R. Ban- Koehn, P. 2004. Statistical significance tests
chs. 2011. Evaluación de estrategias pa- for machine translation evaluation. En
ra la traducción automática estadı́stica de Proceedings of EMNLP, volumen 4, pági-
chino a castellano con el inglés como len- nas 388–395.
gua pivote. En Proc. of the SEPLN, Huel-
va. Koehn, P., H. Hoang, A. Birch, C. Callison-
Burch, M. Federico, N. Bertoldi, B. Co-
Crego, J.M., A. de Gispert, y J.B. Mariño. wan, W. Shen, C. Moran, R. Zens,
2005. An Ngram-based Statistical Machi- C. Dyer, O. Bojar, A. Constantin, y
ne Translation Decoder. En Proceedings E. Herbst. 2007. Moses: Open Source
of 9th European Conference on Speech Toolkit for Statistical Machine Transla-
Communication and Technology (Inters- tion. En ACL ’07: Proceedings of the 45th
peech). Annual Meeting of the ACL on Interactive
de Gispert, A. y J.B. Mariño. 2006. Catalan- Poster and Demonstration Sessions, pági-
English Statistical Machine Translation nas 177–180, Morristown, NJ, USA.
without Parallel Corpus: Bridging th- Koehn, P., F.J. Och, y D. Marcu. 2003. Sta-
rough Spanish. En Proc. of LREC tistical phrase-based translation. En Proc.
5th Workshop on Strategies for develo- of the 41th Annual Meeting of the Associa-
ping Machine Translation for Minority tion for Computational Linguistics.
Languages (SALTMIL’06), páginas 65–68,
Kumar, S. y W. Byrne. 2004. Minimum
Genova.
bayes-risk decoding for statistical machine
Farrús, M., M. R. Costa-jussà, J. B. Mariño, translation. En Proceedings of the Human
M. Poch, A. Hernández, C. Henrı́quez, Language Technology and North American
26
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)
Association for Computational Linguis-
tics Conference (HLT/NAACL’04), pági-
nas 169–176, Boston, USA, May.
Mariño, José B., Rafael E. Banchs, Josep M.
Crego, Adrià de Gispert, Patrik Lam-
bert, José A. R. Fonollosa, y Marta R.
Costa-jussà. 2006. Ngram-based Machi-
ne Translation. Computational Linguis-
tics, 32(4):527–549.
Och, F. J. y H. Ney. 2002. Discriminati-
ve Training and Maximum Entropy Mo-
dels for Statistical Machine Translation.
En Proceedings of the 40th Annual Mee-
ting of the Association for Computational
Linguistics (ACL).
Padró, Ll., M. Collado, S. Reese, M. Lloberes,
y I. Castellón. 2010. FreeLing 2.1: Five
Years of Open-Source Language Proces-
sing Tools. En Proceedings of 7th Langua-
ge Resources and Evaluation Conference
(LREC 2010), La Valleta, Malta, May.
Wu, H. y H. Wang. 2007. Pivot Langua-
ge Approach for Phrase-Based Statistical
Machine Translation. En Proc. of the
ACL, páginas 856–863, Prague.
27