=Paper= {{Paper |id=Vol-1885/201 |storemode=property |title=MonoTrans: Statistical Machine Translation from Monolingual Data |pdfUrl=https://ceur-ws.org/Vol-1885/201.pdf |volume=Vol-1885 |authors=Rudolf Rosa |dblpUrl=https://dblp.org/rec/conf/itat/Rosa17 }} ==MonoTrans: Statistical Machine Translation from Monolingual Data== https://ceur-ws.org/Vol-1885/201.pdf

J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 201–208
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 R. Rosa

MonoTrans: Statistical Machine Translation from Monolingual Data

Rudolf Rosa

Charles University, Faculty of Mathematics and Physics,
Institute of Formal and Applied Linguistics,
Malostranské náměstí 25, 118 00 Prague, Czech Republic
rosa@ufal.mff.cuni.cz

Abstract: We present MonoTrans, a statistical machine ity. Typically, the systems treat the source language words
translation system which only uses monolingual source and the target language words completely independently,
language and target language data, without using any par- usually by representing each word with a unique identi-
allel corpora or language-specific rules. It translates each fier, with source word identifiers and target word identi-
source word by the most similar target word, according to fiers belonging to different domains. Only in case of out-
a combination of a string similarity measure and a word of-vocabulary source words (OOVs), which are not part
frequency similarity measure. It is designed for trans- of the available source language vocabulary and therefore
lation between very close languages, such as Czech and cannot be translated by the system, approaches that do ac-
Slovak or Danish and Norwegian. It provides a low- knowledge potential interlingual word similarity are some-
quality translation in resource-poor scenarios where par- times applied, such as transliteration by a character-based
allel data, required for training a high-quality translation translation model [4]; although, in most cases, OOVs are
system, may be scarce or unavailable. This is useful e.g. simply left untranslated.2
for cross-lingual NLP, where a trained model may be trans- In our work, we create a data-driven MT system for very
ferred from a resource-rich source language to a resource- close languages, based on utilizing only monolingual cor-
poor target language via machine translation. We evaluate pora and a set of heuristics with a high level of language
MonoTrans both intrinsically, using BLEU, and extrinsi- independence. In this way, we target languages which are
cally, applying it to cross-lingual tagger and parser trans- very low on available resources: the only resource we re-
fer. Although it achieves low scores, it does surpass the quire for both the source and the target language is a plain-
baselines by respectable margins. text monolingual corpus, i.e. any text written in that lan-
guage (even a short one). Arguably, this is the lowest pos-
sible requirement to perform any text-based processing of
1 Introduction a language: at least a textual input must be available, oth-
erwise there is nothing to process. The key assumption be-
In machine translation (MT), the most common and most hind our approach is that corresponding words often have
successful approach is to train a translation model from the following two properties:
parallel text corpora, i.e. from a set of bilingual sentence
pairs with corresponding meanings. This approach was pi- • They are similar on the character level, i.e. the string
oneered by the IBM models [2], which led to the develop- similarity of the source word and the corresponding
ment of many phrase-based MT systems, with Moses [8] target word is often high.
being the most well-known and wide-spread one. In re-
cent years, Neural MT [1] is taking the lead, with one of • They appear in the language with a similar frequency,
the main representants being Nematus [15]. Still, all of i.e. the frequency of the source word in a source lan-
these systems rely on parallel corpora as the key resource. guage corpus and the frequency of the target word in
Fortunately, parallel text corpora are a naturally occur- a target language corpus is usually similar.
ring resource. They can be mined from film subtitles, book While these assumptions obviously do not hold in general,
translations, documents published by international institu- we believe that they are mostly valid in case of very close
tions, software localization data, etc. Probably the largest languages (such as Czech and Slovak or Danish and Nor-
freely available collection of parallel data is Opus [18], wegian, which we use in our evaluation).
providing parallel corpora for roughly 100 languages for Our general approach to translating a given source lan-
download,1 comprising many smaller preexisting collec- guage word is to look through all of the target words
tions. However, rough estimates of the number of world’s present in our corpus, and to return the most similar one
languages are in thousands, which means that for the vast
2 This can also be understood as a very rough way of acknowledg-
majority of existing languages, parallel data are not avail-
ing interlingual word similarity, in the sense that it is implicitly assumed
able easily, or not available at all. that the unknown source word may happen to be also used in the target
A common feature of language that is not usually ex- language in an identical form. This assumption may often be true, espe-
ploited in main-stream MT, is interlingual word similar- cially in case of named entities, which constitute a major share of OOVs.
Still, even in such cases, transliteration or similar transformations may
1 http://opus.lingfil.uu.se/ be necessary to obtain the correct target word form.
202 R. Rosa

as the most probable translation, based on the two dimen- a large number of predictors, including both orographic
sions of similarity noted above and described in detail in similarity and frequency similarity as we do,4 and also us-
Section 3. ing contextual similarity, temporal similarity, topic simi-
In practice, an exhaustive search over the full target vo- larity, etc. However, most of these predictors rely on at
cabulary is not viable. Therefore, we introduce a number least small amounts of bilingual data, in the form of paral-
of heuristics to speed up the search, and describe the whole lel corpora, bilingual dictionaries, and/or comparable cor-
translation process, in Section 4. pora; some also require other meta data, such as segmen-
In Section 5.1, we evaluate MonoTrans instrinsically tation of the data into documents, or a time stamp marking
with BLEU [12]. As could be expected, our method deliv- the date of creation of the text. In our work, we omit the
ers a very low quality translation, far beyond the best re- predictors which require such additional data, and focus on
ported translation scores for the evaluated language pairs. fine-tuning the two most resource-light predictors instead
However, our focus is on scenarios where none of the – the string similarity and the frequency similarity.
better-performing approaches are applicable, as neither There is also a handful of older work attempting to con-
parallel data nor a rule-based translation system are avail- struct a bilingual lexicon and/or to perform machine trans-
able. In that regard, the only baseline for us is to leave the lation without parallel corpora, most notably by Koehn and
text untranslated, which we surpass by large margins. Knight [9], Persman and Padó [13], Ravi and Knight [14],
The quality of MonoTrans translation is too low to be and Vulić and Moens [21].
useful when targeting humans; moreover, for a speaker of
the target language, a similar close language is typically
partially intelligible even without translation. In fact, it 3 Interlingual Word Similarity
is exactly this partial cross-lingual intelligibility of similar
languages, common with humans but generally inacces- The key component of MonoTrans is a word similarity
sible to machines, that we want to simulate with Mono- measure, composed of a string similarity simstr and a fre-
Trans. We focus on the task of cross-lingual transfer of quency similarity sim f ; the string similarity is itself com-
trained NLP tools, namely part-of-speech (POS) taggers posed of a Jaro-Winkler-based similarity sim jw∗ and a
and dependency parsers, where even a low-quality trans- length-based similarity siml :
lation can provide the tools with a partial understanding
sim(wsrc , wtgt ) = simstr (wsrc , wtgt ) · sim f (wsrc , wtgt )
of an unknown language and allow them to be applied to
that language, even if their performance will inevitably be simstr (wsrc , wtgt ) = sim jw∗ (wsrc , wtgt ) · siml (wsrc , wtgt )
low. We evaluate MonoTrans extrinsically in this setup in (1)
Section 5.2.
where wsrc and wtgt are the source and target word, respec-
tively. The following subsections provide detailed descrip-
2 Related Work tions of each of these components.

While being a stranger in data-driven machine translation, 3.1 Jaro-Winkler-Based Similarity
interlingual word similarity has often been utilized in rule-
based MT, in particular when translating between similar Our string similarity measure is based on the Jaro-
languages. While rule-based MT has generally been su- Winkler (JW) similarity [22], which has an interesting
perseded by statistical MT, a number of fully rule-based property of giving more importance to the beginnings of
or hybrid3 machine translation systems do exist, such as the strings than to their ends. This nicely suits our setting,
Apertium [5]; for an overview of MT systems for related as in flective languages, most of the inflection usually hap-
languages, see e.g. [20]. Moreover, when focusing on spe- pens at the end of the word, while the beginning of the
cial classes of words, such as technical terminology, sys- word tends to carry more of the lexical meaning. Thus, we
tematic interlingual word similarity can be exploited even expect the JW similarity to give more weight to the simi-
across very distant languages, such as Czech and English, larity of the meanings of the words than to the particular
as investigated already in [7]. However, these systems still inflected forms in which they appear.5
require sets of language-specific rules, large bilingual dic- However, JW similarity does not account for a number
tionaries, and/or parallel corpora, to perform the end-to- phenomena that are common in interlingual word similar-
end translation. To the best of our knowledge, devising a ity. We believe the following two to be of the highest im-
machine translation system for such a low-resource setting portance:
is rather unique.
The somewhat solitary work of Irvine and Callison-
Burch [6] does go extraordinarily far in a similar direction 4 Interestingly, the authors seem to use a measure of frequency sim-

to ours, estimating the correspondence of words based on ilarity very close to ours, although the provided formula (4) seems to be
inverted by mistake, measuring frequency dissimilarity instead.
3 A hybrid MT system is a system which combines rule-based and 5 From another perspective, we could say that JW similarity implic-

statistical components. itly performs a simple soft stemming of its arguments.
MonoTrans: Statistical Machine Translation from Monolingual Data 203

• diacritical marks tend to be cross-lingually inconsis- pair of words is often too high. Therefore, we introduce
tent, and languages are usually intelligible even when an additional penalty for words that differ in length:
diacritics are stripped from the text,
1
• consonants tend to carry more meaning than vowels sim0l (wsrc , wtgt ) = (5)
1 + L · |len(wsrc ) − len(wtgt )|
and tend to be more cross-lingually consistent.
Therefore, we introduce two preprocessing steps that where len(w) is the length of word w. The length impor-
can be employed to simplify the word forms before the tance L is used to put less weight on length similarity than
computation of the JW similarity: transliteration to ASCII, on the other string similarity components; we use L = 0.2.
and devowelling. The length similarity is computed both on the original
The transliteration to ASCII, provided by the words as well as on their devowelled variants:
unidecode Python module,6 maps all characters into
ASCII, trying to replace each non-ASCII character by one siml = sim0l (wsrc , wtgt ) · sim0l (D(wsrc ), D(wtgt )) (6)
“near what a human with a US keyboard would choose”.
However, it does not handle non-alphabetic scripts, such as 3.3 Frequency Similarity
Chinese or Japanese. We denote transliteration of a word
w by T (w). We expect corresponding words to appear with a similar
The devowelling strips all vowel characters, i.e. all frequency in the monolingual corpora. Of course, one of
characters that, after transliteration to ASCII, belong to the them may be several times more frequent than the other,
following group: a, e, i, o, u, y. We denote devowelling of but their frequencies should be similar at least in orders
a word w by D(w). of magnitude. Thus, we compare the logarithms of the
The JW-based similarity sim jw∗ is then computed as a frequencies to calculate the similarity:
multiplication of several components:
1
sim jw∗ (wsrc , wtgt ) = ∏ simJ (wsrc , wtgt ) sim0f (wsrc , wtgt ) =
1 + |log( fwsrc ) − log( fwtgt )|
(7)
J∈{ jw, jwT, jwD, jwDT }
(2)
The first component, sim jw , is the JW similarity without The occurrence frequencies are computed from the mono-
any preprocessing. However, as it is undefined for empty lingual corpora:
words (ε), we modify it slightly:
countcorpus (w) + S
 1 fwcorpus = (8)

 1+len(wsrc ) if wtgt = ε sizecorpus
1
sim jw (wsrc , wtgt ) = 1+len(wtgt ) if wsrc = ε (3)

 0 using a smoothing factor S that allows us to output, with a
sim jw (wsrc , wtgt ) otherwise low probability, even unknown words; we use S = 0.1.
where len(w) is the number of characters in word w, Such a measure seems to be appropriate for corpora of
and sim0jw is the Jaro-Winkler similarity provided by the similar sizes. However, if one of the corpora is signif-
icantly smaller than the other, the frequencies of words
pyjarowinkler Python module.78
in the smaller corpus are somewhat boosted due to the
The following components are the JW similarity of
smaller number of word types appearing in the corpus
transliterated words (sim jwT ), the JW similarity of devow-
among which the total mass is distributed. Therefore, we
elled words (sim jwD ), and the JW similarity of transliter-
downscale the frequencies computed on the smaller corpus
ated and devowelled words (sim jwDT ):
by upscaling its size used in (8):
sim jwT (wsrc , wtgt ) = sim jw (T (wsrc ), T (wtgt )) ( q
sim jwD (wsrc , wtgt ) = sim jw (D(wsrc ), D(wtgt )) |A| · |B| if |A| < |B|
sizeA = |A| (9)
sim jwDT (wsrc , wtgt ) = sim jw (D(T (wsrc )), D(T (wtgt ))) |A| otherwise
(4)
where |X| denotes the number of words in the corpus X.
3.2 Length-Based Similarity We found that the definition of frequency similarity
in (7) does a good job in removing many bad target lan-
A target word that is significantly shorter or longer than a guage candidates; usually these are very infrequent words
given source word is unlikely to be its translation; how- that are by chance string-wise similar to the source word.
ever, we found that the Jaro-Winkler similarity for such a However, we also found that it inappropriately boosts tar-
6 https://pypi.python.org/pypi/Unidecode get words with a low similarity to the source word that
7 https://pypi.python.org/pypi/pyjarowinkler
8 Interestingly, the module method get_jaro_distance does not
by chance have an extremely similar frequency. Thus, we
need to keep the similarity harsh for low values, but soften
provide the Jaro-Winkler distance d jw , but the Jaro-Winkler similarity
1 − d jw ; i.e., a value of 1 corresponds to identical strings, and the value it for high values. Therefore, if the value of sim0f is higher
of 0 to completely dissimilar strings. than a threshold T f , we push the part of it which is above
204 R. Rosa

the threshold down, by multiplying it by a decay fac- source word by the most similar target word from the tar-
tor D f : get word list, based on the similarity measure described
( in the previous section. The translation of each word is
T f + D f · (sim0f − T f ) if sim0f > T f performed independently.
sim f = (10)
sim0f otherwise

We use T f = 0.5 and D f = 0.1. 4.1 Computational Efficiency

In theory, for each input source word, the translation com-
3.4 Discussion ponent could always go through all target words in the tar-
While the similarity measure we use is intended to be get language word list, measure the similarity of the source
language-independent, we do acknowledge that it was word and each candidate target word, and then emit the
hand-tuned particularly on the Czech-Slovak language most similar target word as the translation.
pair, as the authors have a strong knowledge of both of However, this is only feasible in cases where the tar-
these languages, and may not be fully adequate for all lan- get word list is very small, containing hundreds or at most
guage pairs. In particular, we expect it to work best for thousands of words, allowing us to translate each source
flective languages with a preference for word-final inflec- word in a matter of seconds at most. Once the target
tion. For an optimal performance, it should be hand-tuned word list goes into tens of thousands of words and beyond
or machine-tuned on a more diverse set of languages. (which it usually does in our experiments), the translation
The transliteration component is useless for languages times become far too long for an exhaustive search to be
that only use ASCII characters. Moreover, due to its im- practical.
plementation, it cannot handle non-alphabetic languages, Therefore, we introduce a range of heuristics and tech-
such as Chinese or Japanese. nical measures, both lossless and lossy, to sufficiently
Also, we fail to identify systematic differences in the speed up the translation process while trying to keep the
languages, such as “w” in Polish consistently correspond- translation quality as high as possible. We describe the
ing to “v” in Czech or Slovak. We believe that the method most important two of them in the following subsec-
would highly benefit from being able to find such cor- tions; we also use other less interesting measures, such as
respondences automatically in the monolingual data, e.g. caching of method calls.
by exploring the distributions of character unigrams or n-
grams, and/or by employing an EM-like approach to find 4.2 Word List Partitioning
a likely mapping.
Finally, the measure completely ignores the fact that The main speedup comes from a hard partitioning of the
words with similar meanings can be expected to appear in word lists, which is the only lossy procedure we employ.
similar contexts. While reflecting on that fact would most Following our observations in Section 3, we assume that
probably make the computation of the similarity much for a pair of corresponding source and target words:
more complicated and slower, it could allow the method
to detect even corresponding words that are dissimilar ac- • the lengths of the devowelled transliterated words dif-
cording to the string similarity measures. A more viable fer by at most 1,
approach could be to at least account for the fact that a
given target word is likely to appear in similar target con- • the first two characters of the devowelled transliter-
texts, as mediated e.g. by a language model. ated words are identical.

None of these assumptions hold universally, but we be-
4 The Translation System lieve that they do hold for a vast majority of words that
can be translated by our system (i.e. words that are suffi-
The MonoTrans translation system consists of two compo- ciently similar to their target counterparts). Most impor-
nents: a training component, and a translation component. tantly, they let us only go through a tiny part of the target
The training component creates a pair of word fre- word list when translating a source word, bringing a key
quency lists, based on source and target monolingual cor- speedup to the translation system.
pora. Any monolingual corpora can be used for the train- Thus, instead of using a flat word list, the training
ing, with larger corpora leading to better results. The fre- component stores each word in a specific partition, ad-
quency similarity measure works more reliably when the dressed by a compound key, consisting of the first two
source and target corpora are of similar sizes, at least in or- transliterated consonants of the word and the length of the
ders of magnitude. However, if the source language is very transliterated devowelled word. The translation compo-
low on available resources, the input text to be translated nent then only traverses three of these partitions, corre-
can by itself also serve as the only source corpus. sponding to the first two transliterated consonants of the
The translation component then performs a word- source word and the length of the transliterated devow-
based 1:1 monotone translation, trying to translate each elled source word, increased by +1, 0, and -1.
MonoTrans: Statistical Machine Translation from Monolingual Data 205

4.3 Frequency-Based Early Stopping
Table 1: Evaluation of MonoTrans with BLEU.
Even after the partitioning, many of the partitions are too Langs SrcLex MonoTrans Rel.diff.
large to be traversed exhaustively for each matching source cs-sk 10.1 13.1 30%
word. We can deal with that issue thanks to the following sk-cs 10.1 14.8 46%
two observations: da-no 16.8 18.3 8%
• word frequency similarity is a powerful component sv-no 7.7 14.7 90%
of our similarity measure, no-da 16.6 17.7 7%
no-sv 7.7 12.5 63%
• words in a language have a Zipf-like distribution, ca-es 5.5 8.3 51%
with a small number of frequent words and a high es-ca 5.4 7.9 46%
number of rare words. AVG 10.0 13.4 43%

Therefore, we can achieve a significant speedup by or-
dering the words in each partition descendingly by fre- Table 2: Evaluation of MonoTrans with BLEU, using large
quency, and introducing the following early-stopping cri- monolingual corpora for training.
terion: once we reach a target word so infrequent that Langs SrcLex MonoTrans Rel.diff.
its frequency-based similarity to the source word alone is cs-sk 10.1 15.6 55%
lower than the total similarity of the most similar target sk-cs 10.1 17.1 70%
word found so far, we can stop processing the current par- AVG 10.1 16.4 62%
tition, as none of the remaining target words would be able
to surpass the currently best candidate; i.e., we stop once: use the first 10,000 target sentences and the last 10,000
fwtgt 0 ∗
0 < f wsrc ∧ sim f (wsrc , wtgt ) < sim(wsrc , wtgt ) (11) source sentences for training, and then evaluate the source-
to-target translation quality on the last 10,000 sentences;
0 is the current target word candidate, and w∗ is
where wtgt tgt i.e. the source side of the evaluation data is used for train-
the best target word found so far. ing, but the target side, which serves as the reference trans-
As we, by definition, encounter frequent words more lation, is not.
frequently than rare words, this allows us to often skip the Table 1 shows the BLEU scores achieved by Mono-
processing of the long tail of infrequent words; it only gets Trans, compared to the SrcLex baseline, i.e. to performing
processed if the source word is a rare one, or if it is not no translation at all; thanks to the high similarity of the
sufficiently similar to any of the frequent target words. languages, even the baseline achieves a non-trivial trans-
lation score. The BLEU scores are rather low, reach-
5 Evaluation ing 13.4 on average, whereas a state-of-the-art MT sys-
tem trained on large amounts of parallel data could easily
In this section, we evaluate MonoTrans both intrinsically reach scores around 30 BLEU points (or probably even
and extrinsically. However, for the real under-resourced more, provided that the source and target languages are
languages with no parallel data and no annotated corpora very similar). However, we can see a large and consistent
available, there is no way for us to perform an automatic improvement over the baseline of 3.4 BLEU points in av-
evaluation, and a manual evaluation would be difficult to erage. We also report the relative improvement over the
obtain for us. Therefore, as is usual in these scenarios, we baseline which, thanks to the very low scores achieved by
simulate the under-resourced setting by evaluating on pairs the baseline, is very high, reaching up to 90% (for sv-no)
of similar but resource-rich languages, allowing us to use and 43% on average.
standard automatic evaluation measures. Specifically, we Generally, we do not expect very large corpora to be
use the following language groups in our experiments: available for under-resourced languages. Still, to mea-
sure the scaling potential of our method, we also evaluated
• Czech (cs) and Slovak (sk),
MonoTrans trained on significantly larger Czech and Slo-
• Danish (da), Norwegian (no) and Swedish (sv),
vak monolingual corpora. For this experiment, we used
• Catalan (ca) and Spanish (es).
large web corpora, namely CWC for Czech [16] and sk-
Please note that we hand-tuned our method partially based TenTen for Slovak [3]. We used the first 100 millions of
on brief manual inspections of the results on the sk-cs pair. words from each of the corpora for training, i.e. roughly
a thousand times larger datasets, and then evaluated the
5.1 Instrinsic Evaluation translation system on the same OpenSubtitles data as in
the previous experiments.
We first evaluate the quality of the MonoTrans translation The results in Table 2 show that increasing the data size
itself with BLEU [12]. We use the OpenSubtitles2016 sub- improves the translation quality, with the improvement
corpus of the Opus collection [10], which contains trans- over the SrcLex baseline being nearly doubled. However,
lated movie subtitles from the OpenSubtitles website.9 We considering the factor by which we increased the training
9 http://www.opensubtitles.org/ data size, we find the improvement to be rather moderate.
206 R. Rosa

Table 3: Evaluation of MonoTrans with BLEU, using Table 5: Evaluation of MonoTrans in cross-lingual POS
smaller corpora for training; “0” corresponds to SrcLex, tagger transfer, using tagging accuracy.
“10,000” is identical to Table 1.
Langs SrcLex MonoTrans Supervised Err.red.
Training sentences cs-sk sk-cs cs-sk 78.0 82.7 94.1 29%
0 10.1 10.1 sk-cs 70.9 76.7 98.3 21%
10 10.0 10.0 da-no 76.8 78.3 97.0 8%
100 11.3 11.4 sv-no 64.7 72.8 97.0 25%
1,000 12.4 12.3 no-da 78.7 80.4 95.5 10%
10,000 13.1 14.8 no-sv 56.0 72.6 95.1 42%
ca-es 76.7 78.1 96.2 7%
es-ca 69.9 81.1 98.0 40%
Table 4: Examples of outputs of the Czech–Slovak and AVG 71.5 77.8 96.4 23%
Slovak–Czech translation.
Language Text
sk source Mal ho len vystrašit’ aby ho udržal mlčat’.
Table 6: Evaluation of MonoTrans in cross-lingual parser
cs transl. Měl ho loni vystrašit aby ho udržel mlčet. transfer, using LAS.
cs correct Měl ho jen vystrašit aby ho udržel mlčet. Langs SrcLex MonoTrans Supervised Err.red.
sk source Počúvaj, nemám rád prípady, ako je tento, cs-sk 46.6 56.3 68.7 44%
ale som vd’ačný za to, že ste ho chytili. sk-cs 35.9 42.9 73.1 19%
cs transl. Počkají, nemám rád případy, ako je tento, da-no 45.8 49.0 79.4 9%
ale sám vděčný za to, že set ho chytili. sv-no 30.2 40.0 79.4 20%
cs correct Poslouchej, nemám rád případy, jako je tento, no-da 46.3 48.7 71.4 10%
ale jsem vděčný za to, že jste ho chytili. no-sv 24.0 41.7 69.4 39%
cs source Když zemřela, neměla jsem chut’ o tom mluvit. ca-es 44.4 48.2 77.4 11%
sk transl. Kde zomrela, nemala jsem chut’ o tom milovat’. es-ca 39.5 51.3 80.3 29%
sk correct Ked’ zomrela, nemala som chut’ o tom hovorit’. AVG 39.1 47.3 74.9 23%

Next, we tried to downscale the training data instead, The task constitutes of using an annotated corpus of a
training and evaluating the system identically to Table 1 resource-rich source language to train an NLP model in
but using only a part of the training data. Table 3 shows such a way that it can be applied to analyzing a different
that already with 100 monolingual non-corresponding sen- but very similar resource-poor target language.
tences for each of the languages, i.e. very modest data, We loosely follow the approach of Tiedemann et
improvements of over +1 BLEU over the non-translation al. [19], proceeding in the following steps:
baseline can be achieved.
Finally, in Table 4, we show several examples of the in- • Translate the words in an annotated source corpus
puts and outputs of the MonoTrans system, taken from the into the target language by an MT system.
evaluation datasets translated by the systems trained on the
large datasets; the “correct” translation is not the reference • Train a lexicalized model on the resulting corpus.
translation, but a corrected version of the MonoTrans out-
put, and errors are highlighted. We can see many correctly • Apply the model to target language data.
translated words in the outputs, as well as many words cor-
Specifically, we employ the Universal Dependencies (UD)
rectly left untranslated. Moreover, many of the errors can
v1.4 treebanks [11] as the annotated data, MonoTrans as
be easily accounted to the word list partitioning that we
the translation tool, and the UDPipe tagger and parser [17]
employ, making it impossible for MonoTrans to perform
as the models to be trained.
translations such as “len–jen”, “som–jsem”, or “počúvaj–
The MonoTrans system is trained using the word forms
poslouchej”. This suggests that many of the errors are ac-
from the training part of the source treebank and the devel-
tually search errors, not scoring errors, and could be elim-
opment part of the target treebank, and applied to translate
inated if we had a better way of efficiently searching for
the training part of the source treebank into the target lan-
candidate target translations.
guage. Then, the UDPipe tagger and parser are trained
on the resulting corpus; the tagger is trained to predict the
5.2 Extrinsic Evaluation Universal POS tag (UPOS) based on the word form, and
the parser is trained to predict the labelled dependency tree
We also evaluated MonoTrans extrinsically, in the task based on the word form and the UPOS tag predicted by the
of cross-lingual transfer of trained NLP models across tagger. Finally, both the tagger and the parser are applied
closely related languages, inspired by the cross-lingual to the development part of the target language treebank,
parsing shared task of the VarDial 2017 workshop [23]. and evaluated against its gold-standard annotation.
MonoTrans: Statistical Machine Translation from Monolingual Data 207

We report the tagger accuracy in Table 5, and the parser only negligible improvements around +0.1 BLEU. We be-
LAS10 in Table 6. As a baseline, we also include SrcLex, lieve that this is mainly due to the fact that our approach in
i.e. using a tagger and parser trained on an untranslated these preliminary experiments was too rough and simplis-
source treebank, and as an upper bound, we include a su- tic, and that with proper tuning and a more sophisticated
pervised tagger and parser, trained on the training part of implementation, clear improvements may be gained.
the target treebank; as the languages are very similar, the
baselines are quite strong. This allows us to also compute 6.2 Better Searching for Candidate Translations
the error reduction, i.e. the proportion of the gap between
the baseline and the upper bound filled by our method. Based on inspection of the translation outputs, as well as
The taggers reach an average accuracy of 77.8% and the from the examples in Table 4, it is clear that the word
parsers an average LAS of 47.3%, which is not much in ab- list partitioning is way too crude, preventing the system
solute terms – when large parallel data are available, LAS from generating the correct translation in many cases, even
scores around 60% can be reached. However, in relative though its similarity to the source word is sufficiently high.
terms, the scores are rather impressive, obtaining an aver- On the other hand, it is completely impossible for the sys-
age 23% error reduction in both the tagging and the pars- tem to search through all possible translations, and some
ing, and reaching average absolute improvements of +6.3 kind of harsh pruning of the search space is vital.
in tagging accuracy and +8.2 in parsing LAS. Given our As a quick remedy, we tried to use the trigram lan-
setting, we find the results to be wonderful, provided that guage model to generate additional translation candidates.
we only used small monolingual corpora to train the MT Specifically, for each source word we also investigated N
system; in fact, in the target language, we only used the candidate translations taken from N words that are, ac-
evaluation input data, which is probably the lowest imag- cording to the language model, the most likely to follow
inable data requirement. the words selected as translations of the previous words.
With N = 20, we observed a promising improvement of
+0.6 BLEU for cs-sk, while the translation times remained
6 Further Possible Improvements competitive (they doubled). With N = 1000, the improve-
ment for sk-cs further jumped to +1.3 BLEU; however, at
6.1 Language Model Scoring this point, the translation became about 50 times slower
(taking 10 hours to translate 10,000 sentences), showing
As we mentioned in Section 3.4, a clear shortcoming of
that this approach is somewhat promising in terms of trans-
our method is the fact that the translation is performed
lation quality but too computationally demanding. For sk-
in a context-independent way. Employing an n-gram lan-
cs, negligible or no improvements were observed.
guage model is a standard way of getting better machine
An interesting possibility of clustering the search space
translation outputs, only plaintext target-language data are
which was suggested to us is to use a standard clustering
needed to create one, and there already exists a plethora
algorithm, such as k-means or hierarchical k-means, with
of state-of-the-art ready-to-use language modelling tools.
the word similarity used as the distance of the target words.
Therefore, it may seem straightforward to employ a lan-
This is expected to be permissibly fast to compute as well
guage model in MonoTrans as well.
as to allow a sufficiently fast search for translation candi-
However, there is a range of technical issues that need to
dates; however, due to time constraints, we have not been
be overcome. If we were able to generate a translation lex-
able to design an experiment to test that.
icon, we could then even easily plug it into a full-fledged
MT system, such as Moses, easily combining it with a lan-
guage model; however, generating the lexicon would be 7 Conclusion
computationally prohibitive in our case, for reasons men-
tioned in Section 4.1. At best, we could potentially try to We presented MonoTrans, a data-driven translation system
generate a translation lexicon only for the words that ap- trained only on plaintext monolingual corpora, intended
pear in the test data. Moreover, even using a beam search for low-quality machine translation between very similar
in MonoTrans decoding is too costly for us, as it prohibits languages in a low-resource scenario.
the employment of the early stopping mechanisms. We showed that even with very small training corpora
So far, we have only managed to perform a set of pre- available, the system shows respectable performance ac-
liminary experiments, adding a simple trigram language cording to both intrinsic and extrinsic evaluation, consis-
model and using its score as an additional scoring compo- tently surpassing the no-translation baseline by large mar-
nent; as we found that using the score directly had a too gins. Moreover, we showed that the system performance
strong and negative effect on the translations, we weakend scales with larger training data, even though rather slowly.
it by taking its fourth root. When evaluated on the large In particular, when evaluated extrinsically as a compo-
Czech and Slovak corpora in both directions, we observed nent of cross-lingual tagger and parser transfer, employ-
ing MonoTrans leads to high improvements in both tag-
10 Labelled Attachment Score, i.e. the number of correctly predicted ging accuracy and parser LAS with respect to the base-
labelled dependency relations in the output tree. lines, achieving an average 23% error reduction in both of
208 R. Rosa

the tasks when supervised models are taken as the upper [10] Pierre Lison and Jörg Tiedemann. Opensubtitles2016: Ex-
bounds. tracting large parallel corpora from movie and tv subtitles.
In LREC, 2016.
[11] Joakim Nivre et al. Universal dependencies 1.4, 2016. LIN-
Acknowledgments DAT/CLARIN digital library at the Institute of Formal and
Applied Linguistics, Charles University.
This work has been supported by the grant
[12] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
No. DG16P02B048 of the Ministry of Cul- Zhu. Bleu: a method for automatic evaluation of machine
ture of the Czech Republic, the grant No. translation. In ACL, pages 311–318. Association for Com-
CZ.02.1.01/0.0/0.0/16_013/0001781 of the Ministry putational Linguistics, 2002.
of Education, Youth and Sports of the Czech Republic, [13] Yves Peirsman and Sebastian Padó. Cross-lingual induc-
and the SVV 260 453 grant. This work has been using tion of selectional preferences with bilingual vector spaces.
language resources and tools developed, stored and dis- In HLT-NAACL, HLT ’10, pages 921–929, Stroudsburg,
tributed by the LINDAT/CLARIN project of the Ministry PA, USA, 2010. Association for Computational Linguis-
of Education, Youth and Sports of the Czech Republic tics.
(project LM2015071). We would also like to thank the [14] Sujith Ravi and Kevin Knight. Deciphering foreign lan-
anonymous reviewers and our colleagues from the ÚFAL guage. In ACL-HLT, HLT ’11, pages 12–21, Stroudsburg,
MT group (especially Jindřich Libovický) for helpful PA, USA, 2011. Association for Computational Linguis-
comments and suggestions for improvement. tics.
[15] Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra
Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-
References Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone,
Jozef Mokry, et al. Nematus: A toolkit for neural machine
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. translation. EACL 2017, page 65, 2017.
Neural machine translation by jointly learning to align and [16] Johanka Spoustová and Miroslav Spousta. A high-
translate. arXiv preprint arXiv:1409.0473, 2014. quality web corpus of Czech. In Nicoletta Calzo-
[2] Peter Brown, John Cocke, S Della Pietra, V Della Pietra, lari (Conference Chair), Khalid Choukri, Thierry Declerck,
Frederick Jelinek, Robert Mercer, and Paul Roossin. A Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani,
statistical approach to language translation. In COLING, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, edi-
pages 71–76. Association for Computational Linguistics, tors, LREC, Istanbul, Turkey, may 2012. European Lan-
1988. guage Resources Association (ELRA).
[3] Masaryk University NLP Centre. skTenTen, 2011. LIN- [17] Milan Straka, Jan Hajič, and Jana Straková. UDPipe: train-
DAT/CLARIN digital library at the Institute of Formal and able pipeline for processing CoNLL-U files performing to-
Applied Linguistics, Charles University. kenization, morphological analysis, pos tagging and pars-
[4] Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp ing. In LREC, Paris, France, May 2016. European Lan-
Koehn. Integrating an unsupervised transliteration model guage Resources Association (ELRA).
into statistical machine translation. In EACL, volume 14, [18] Jörg Tiedemann. Parallel data, tools and interfaces in
pages 148–153, 2014. OPUS. In LREC, volume 2012, pages 2214–2218, 2012.
[5] Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, [19] Jörg Tiedemann, Željko Agić, and Joakim Nivre. Treebank
Jim O’Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez- translation for cross-lingual parser induction. In CoNLL,
Ortiz, Felipe Sánchez-Martínez, Gema Ramírez-Sánchez, 2014.
and Francis M Tyers. Apertium: a free/open-source plat- [20] Jernej Vičič, Vladislav Kuboň, and Petr Homola. Česílko
form for rule-based machine translation. Machine transla- goes open-source. The Prague Bulletin of Mathematical
tion, 25(2):127–144, 2011. Linguistics, 107(1):57–66, 2017.
[6] Ann Irvine and Chris Callison-Burch. End-to-end statistical [21] Ivan Vulic and Marie-Francine Moens. A study on boot-
machine translation with zero or small parallel texts. Jour- strapping bilingual vector spaces from non-parallel data
nal of Natural Language Engineering, 22:517–548, 2016. (and nothing else). In EMNLP, pages 1613–1624. ACL,
[7] Zdeněk Kirschner. On a device in dictionary operations 2013.
in machine translation. In COLING, COLING ’82, pages [22] William E. Winkler. String comparator metrics and en-
157–160, Czechoslovakia, 1982. Academia Praha. hanced decision rules in the fellegi-sunter model of record
[8] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris linkage. In Proceedings of the Section on Survey Research
Callison-Burch, Marcello Federico, Nicola Bertoldi, Methods (American Statistical Association), pages 354–
Brooke Cowan, Wade Shen, Christine Moran, Richard 359, 1990.
Zens, et al. Moses: Open source toolkit for statistical ma- [23] Marcos Zampieri, Shervin Malmasi, Nikola Ljubešić,
chine translation. In ACL, pages 177–180. Association for Preslav Nakov, Ahmed Ali, Jörg Tiedemann, Yves Scher-
Computational Linguistics, 2007. rer, and Noëmi Aepli. Findings of the VarDial evaluation
[9] Philipp Koehn and Kevin Knight. Learning a translation campaign 2017. In VarDial, Valencia, Spain, 2017.
lexicon from monolingual corpora. In ULA, ULA ’02,
pages 9–16, Stroudsburg, PA, USA, 2002. Association for
Computational Linguistics.