Improving Sinhala-Tamil Translation through
            Deep Learning Techniques

    A. Arukgoda1[0000−0001−5953−9332] , A. R. Weerasinghe2[0000−0002−1392−7791] ,
                    and R. Pushpananda3[0000−0001−9082−1280]
          1
            University of Colombo School of Computing, Colombo, Sri Lanka
                            anupama.arukgoda@gmail.com
          2
            University of Colombo School of Computing, Colombo, Sri Lanka
                                arw@ucsc.cmb.ac.lk
          3
            University of Colombo School of Computing, Colombo, Sri Lanka
                                rpn@ucsc.cmb.ac.lk


        Abstract. Neural Machine Translation (NMT) is currently the most
        promising approach for machine translation. But still, due to the data
        hungry nature of NMT, many of the low resourced language pairs strug-
        gle to apply NMT and generate intelligible translations. Additionally,
        when the language pair is morphologically rich and also when the corpora
        is multi-domain, the lack of a large parallel corpus becomes a significant
        barrier. This is because morphologically rich languages inherently have
        a large vocabulary, and inducing a model for such a large vocabulary
        requires much more example parallel sentences to learn from. In this re-
        search, we investigated translating from and into both a morphologically
        rich and a low resourced language pair, Sinhala and Tamil, exploring the
        suitability of different techniques proposed in the literature in the con-
        text of Sinhala and Tamil. Through the course of our experiments, we
        gained a statistically significant improvement of approximately 11 BLEU
        points for Tamil to Sinhala translation and an improvement of 7 BLEU
        points for Sinhala to Tamil translation over our baseline systems. In this
        process we also designed a new language-independent technique that per-
        forms well when even the amount of monolingual sentences are limited
        and could support the translation of one direction on the translation of
        the other direction, given two languages.

        Keywords: Neural Machine Translation (NMT) · Low-resource trans-
        lation · Sinhala · Tamil.


1     Introduction
Neural Machine Translation (NMT) represents a significant step-forward over a
basic statistical approach, therefore is considered as the state-of-the-art for Ma-
chine Translation. NMT provides a direct translation mechanism from source lan-
guage to the target language providing a stronger generalization power whereas
in Statistical Machine Translation (SMT) the translation consists of two main
stages; translation model and language model, which are trained separately and


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
2       A.Arukgoda et al.

combined later [3]. Moreover, NMT systems are capable of modeling longer
dependencies thanks to the recurrent neural network (RNN) encoder-decoder
model. But one of the inherent drawbacks of NMT is that it requires large parallel
corpora which limits their applicability to translate under-resourced languages.
Sinhala and Tamil are the national languages of Sri Lanka. Sinhala and Tamil
are both morphologically rich, they are low resourced (limited number of parallel
corpora available) and have limited or no publicly available linguistic resources
such as POS taggers and morphological analyzers which could have supported
the translation. These properties make the task of translating Sinhala and Tamil
more challenging. In an early research in SMT for Sinhala and Tamil translation
[18], it has been shown that due to the co-evolution of the Sinhalese and Tamils
in Sri Lanka, the linguistic distance between Sinhala and Tamil is less than that
between Sinhala and English, thereby making the translation between Sinhala
and Tamil theoretically easier than that of Sinhala and English. In addition, the
two languages Sinhala and Tamil, are syntactically-similar, which also provide
the flexibility to alter the word order. These commonalities between Sinhala and
Tamil increase the feasibility of their translation. The currently available best
open-domain Sinhala-Tamil translator is based on SMT [8](will be referred to
as the SMT study). Their work has been conducted for only Tamil to Sinhala
translation direction. We have employed the exact same parallel corpus used in
the SMT study to make a fair comparison between the performance of SMT
versus NMT on the same parallel corpus.
    This paper has two main contributions: By identifying the two main chal-
lenges of translating the two languages under consideration, we explored suitable
techniques to treat them and the observed results were compared with the ac-
curacy reported by the SMT study on the exact same parallel corpus. Such a
detailed analysis to improve the open-domain translation of Sinhala and Tamil
using NMT have not been conducted before, to the best of our knowledge. Sec-
ondly, an observation throughout the research was that Tamil to Sinhala trans-
lation performs better than Sinhala to Tamil translation which prompted us to
propose a more general-purpose, language-independent method which can make
use of the accuracy improvement in one translation direction on the improve-
ment of the other translation direction which can be explored and adopted by
other such low-resourced languages.


2     Literature Review
2.1   Neural Machine Translation
Machine translation is a sequence prediction problem. Not only both input and
output are sequences, they are sequences of different lengths which made the
task more challenging. But the pioneering work of [14] presented an end-to-end
sequence learning approach that makes minimal assumptions on the sequence
length and structure, outperforming the traditional phrase-based translation sys-
tems. The most popular architecture for NMT is the encoder-decoder architec-
ture. This model predicts a target word, based on the context vectors associated
      Improving Sinhala-Tamil Translation through Deep Learning Techniques        3

with the source positions and all the previous generated target words. This is the
traditional architecture used in NMT and since its introduction much research
work have been conducted to improve this architecture [1].


2.2    Translating Morphologically Rich Languages

A Morphologically Rich Language (MRL) is one which grammatical relations
such as tense, singularity/plurality, predicate, gender, age etc. [15], are indi-
cated by changes to the words instead of relative position or addition of particles.
Dealing with morphologically rich languages is an open problem in language pro-
cessing as the complexity in the word forms inherent to these languages makes
translation complex. Most of the machine translation systems are trained using
a fixed vocabulary. But translation is an open vocabulary problem. Therefore,
having to deal with out-of-vocabulary (OOV) words, and rare words is unavoid-
able. If the translated languages are low-resourced (size of the parallel corpora
is small) and the corpora are multi-domain, this problem is worsened because of
the increased size of the vocabulary and the increased amount of word senses.
Hence, translation mechanisms that go below the word level such as Byte-Pair-
Encoding (BPE)[12] and Morfessor[13] have been proposed.


2.3    Translating Low-Resourced Languages

Similar to many other deep learning tasks, the success of NMT is strongly depen-
dent on the availability of large parallel corpora. Since this is a luxury many of
the languages (especially minority languages) do not have, many techniques have
been proposed over the years to address this. One such technique is to incorpo-
rate monolingual corpora by translating monolingual corpora with a translator
trained in the backward direction and thereby create synthetic parallel sentences
making the overall parallel corpus size larger [21, 10]. The intuition is that even
though it is quite difficult to obtain a large parallel corpus for two languages, it
is much easier to obtain large monolingual corpora for the two languages sepa-
rately. This technique has been applied for back-translation of both source-side
monolingual corpora [21] and target-side monolingual corpora [10]. While this
paved way to improve the translation quality of low-resourced languages making
maximum use of both parallel and monolingual corpora, it has also been shown
empirically that such models tend to ’forget’ the correct semantics of translation
if trained on much more synthetic data than authentic parallel data [7], imposing
a constraint on the amount of monolingual data that can be used.
     Another reason for the popularity of the back-translation technique was it
required no changes to the network architecture. Therefore, techniques have been
introduced to improve the quality of the back-translator, since it is yet another
imperfect MT system. Imankulova et al. [5], proposes a filtering technique which
chooses the back-translated synthetic sentences with the highest quality. This
had improved the final translation quality leading to higher BLEU scores.
4      A.Arukgoda et al.

2.4   Sinhala - Tamil Translation

The best Sinhala-Tamil translator to-date has been produced through the most
recent research for morphologically rich languages [8], based on statistical ma-
chine translation. The authors have integrated an unsupervised morphological
modification approach called Morfessor, suggested in a previous research on Sin-
hala morphological analysis [19] to overcome the issues related to morphological
richness. This has resulted in dramatic improvements in the translation quality
and the reliability of the Sinhala - Tamil translation.


3     Methodology

In this research, we first treated the first challenge for translating between Sin-
hala and Tamil, the morphological richness of both languages by exploring the
impact of two sub-word segmentation techniques and thereby reduced the size of
the vocabulary of the corpus and treated the OOV problem. Next, we explored
the suitability of two back-translation techniques using our open-domain mono-
lingual corpora to increase the corpus size and thereby treated the second main
challenge for the translation of Sinhala and Tamil and improved the translation
accuracy further. We constrained ourselves to mainly explore back-translation
techniques as they can make the maximum use of our open-domain parallel and
monolingual corpora. Since our baseline SMT study has made use of the same
corpora, we were able to make a fair comparison between SMT and NMT in the
context of Sinhala and Tamil this way. Finally we proposed a novel technique
called Incrementally Filtered Back-Translation and explored its impact on the
translation of Sinhala and Tamil.


4     EXPERIMENTAL SETUP

4.1   Data-set Details

For our experiments we used a parallel corpus of approximately 25000 sentences
which have a sentence length between 8 and 12 words, collected in the research
[8]. Figure 1 shows an example parallel sentence from this corpus. When explor-
ing back-translation techniques to improve Sinhala-Tamil translation accuracy,
a 10-million-word monolingual corpus [16] and on the Tamil end, a 4.3-million-
word Sri Lankan Tamil monolingual corpus [17] were used. Out of these original
monolingual corpora, sentences having a length of 8 to 12 words were extracted
for our work. Both these corpora are suitable for an open-domain translation as
they have been collected from sentences from different domains such as newspa-
per articles, technical writing and creative writing. The corpus statistics for the
parallel and monolingual corpora are provided in Tables 1 and 2 respectively.
      Improving Sinhala-Tamil Translation through Deep Learning Techniques        5


             Fig. 1. An example Sinhala and Tamil parallel sentence pair


                         Table 1. Parallel Corpus Statistics

                    Corpus Statistics         Sinhala Tamil
                    Sentence Pairs                 26,187
                    Vocabulary Size (V)       38,203 54,543
                    Total number of words (T) 262,082 227,486
                    V/T %                     14.58    23.98


4.2    Pre-Processing
We first obtained three different representations of our original corpora. The first
one being the original full word-form corpus (i.e the corpus as it is without any
pre-processing) and the second being the corpora segmented into morpheme-like
units. For this segmentation, we used the tool Morfessor 2.0 which provides a
morpheme segmentation algorithm that works in an unsupervised manner and
aims to generate the most probable segmentation of words to their prefix, suffix
and stem by relying on the Minimum Description Length from the words in an
un-annotated/raw corpus. Figure 2 shows a few example outputs from Morfessor
2.0.
    In order to make the post-processing easier after translation of the sub-words,
we introduced a special character ’@@’ between the sub-words such that the
boundary of the words are maintained. Then, after the translation, the sub-
words can be put back to the original word form by merging each sub-word with
the special character with the immediately next sub-word.
    The third form of representation was obtained by pre-processing the full-word
corpora using the algorithm Byte-Pair-Encoding (BPE). This algorithm requires
the tuning of the number of merge operations, which is an input parameter that
solely depends on the language and the data-set. We empirically chose a value
of 750 for Sinhala to Tamil translation direction and a value of 1000 for Tamil
to Sinhala translation as the number of merge operations.
    Figure 3 shows an example Sinhala sentence pre-processed with each tech-
nique.


                      Table 2. Monolingual Corpus Statistics

                       Corpus Statistics Sinhala Tamil
                       Number of Sentences 180,793 40,453
                         Vocabulary Size     154,782 65,228
                      Total number of words 1,577,921 352,813
6         A.Arukgoda et al.


    Fig. 2. Examples of unsupervised morphological decomposition with Morfessor 2.0


                      Fig. 3. The three pre-processing techniques


4.3     Baseline Model

In order to compare our results we initiated two benchmarks. The first bench-
mark is the translation accuracy reported for Tamil to Sinhala translation with
SMT in the work of [8]. We used the exact same parallel corpus as given in the
SMT study, to see the performance of NMT in comparison to SMT with the
same parallel corpus.
    The SMT study has been conducted only in the direction of Tamil to Sinhala.
They have reported a translation accuracy of 13.11 BLEU points. Since in our
research work we are interested in the translation in both Sinhala to Tamil
and Tamil to Sinhala directions, we used a second baseline model by training
a network with the architecture provided under System Setup on our 25000
full-word form parallel corpus.


4.4     System Setup

We used the framework OpenNMT [6] for the experiments. We initiated our
experiments with a 2-layer Bidirectional Recurrent Neural Network (BRNN)
with 500 hidden units on both encoder and decoder. To speed-up the training
process GeForce GC 1080 Ti GPU was used with a GPU memory of 16 GB.
The sole indication of the translation accuracy throughout the experiments was
Bilingual Evaluation Understudy (BLEU) score. The reported BLEU scores were
obtained by performing 3-fold cross validation to reduce any biases.
      Improving Sinhala-Tamil Translation through Deep Learning Techniques        7

4.5    Manipulating the Network

Simplifying the network After translating the three differently pre-processed
parallel corpora, we chose the best pre-processing technique out of them. Next
we simplified the network by using a Google Neural Machine Translate (GNMT)
encoder [20] instead of a BRNN encoder which was being used in the experiments
so far. A BRNN encoder has bidirectional connections (to process each sentence
from left to right as well as from right to left) between the neurons in each layer,
whereas in the GNMT encoder only the first layer is a single bidirectional layer
and the other layers are unidirectional RNN layers. The bidirectional states in
this layer are concatenated and residual connections are fed to the next layers
which are uni-directional.


Checkpoint Smoothing We went a step further to improve the BLEU scores,
and that is by using an ensemble technique. Traditionally, ensemble methods
are learning algorithms that combine multiple individual methods to create a
learning algorithm that is better than any of the individual parts. Checkpoint
smoothing is one such ensemble technique which uses only a single training pro-
cess [4]. The idea is, rather than using the model generated from the final epoch,
we average the parameters of the models from multiple epochs and translate us-
ing the averaged models. Such an averaged model is expected to produce better
translations.


4.6    Back-Translation

Naive Back-Translation As proposed in [10] we trained a back-translator
using the authentic parallel sentences and iteratively added synthetic parallel
sentences obtained by translating randomly selected target-side monolingual sen-
tences until the ratio between authentic and synthetic parallel sentences is 1:3
for Tamil to Sinhala translation and the same value is 1:2 for Sinhala to Tamil
translation.


Filtered Back-Translation As proposed by Imankulova et al. [5] we filtered
the synthetic parallel sentences obtained by translating the target-side monolin-
gual corpus, using sentence-level similarity metric BLEU score and iteratively
added the synthetic parallel sentences with the best BLEU score until the ratio
between authentic and synthetic parallel sentences is 1:3 for Tamil to Sinhala
translation and the same value is 1:2 for Sinhala to Tamil translation.
    Throughout the experiments we observed that Tamil to Sinhala translation
direction performed significantly better than Sinhala to Tamil translation direc-
tion. This encouraged us to design a technique which can use the improvement
in one translation direction on the accuracy improvement in the other transla-
tion direction and vice verse, given two languages. As a solution we introduced
the algorithm presented as Incrementally Filtered Back-Translation shown under
Algorithm 1.
8        A.Arukgoda et al.


    Algorithm 1: Incrementally Filtered Back-Translation
   Input: authentic parallel sentences (auth-parallel), monolingual sentences
            from language-1 (monolang1 ), monolingual sentences from language-2
            (monolang2 ), k=1
 1 Let src = language-1
 2 Let tgt = language-2
 3 Let θ→ = model trained from src to tgt with auth-parallel
 4 Let θ← = model trained from tgt to src with auth-parallel
 5 Let D = auth-parallel
 6 repeat
 7     filtered-synthetic-parallel = Filter(θ→ , θ← ) ; // Call Filter algorithm
         provided in Algorithm 2
 8     D = D ∪ filtered-synthetic-parallel ;                      // Combine the
         filtered-synthetic parallel sentences with the authentic
         parallel corpus
 9     θ→ = θ←
10     θnew = Model trained on D from src to tgt
11     θ← = θnew
12     src = language-2
13     tgt = language1
                                                      
14 until convergence-condition k      |monotgt | = 0 ;
15 return Newly updated model θnew


4.7    Incrementally Filtered Back-Translation

With this new algorithm the translation in both translation directions are done
in parallel. The first step is similar to filtered back-translation technique. But
unlike filtered back-translation which used two models trained in each direction
using the authentic parallel corpus, with this technique, the updated models in
each iteration are used to create and filter the synthetic parallel sentences such
that the quality of the synthetic parallel sentence added in each iteration are
of much better quality. The translation accuracy improvement obtained in one
translation direction is used to improve the translation in the other translation
direction.
    The algorithm executes until the improvement in the BLEU score on the same
test-set on models from two consecutive iterations is insignificant (convergence-
condition) or if all the monolingual sentences are consumed in target-language
monolingual corpus.


5     Evaluation

The BLEU scores reported by the three representation forms are shown in Table
3. The initial results from the full-word form representation for Tamil to Sin-
hala translation direction was 5.41 and Sinhala to Tamil was 2.47, which were
discouraging. The translations did not resemble the semantics of the reference
    Improving Sinhala-Tamil Translation through Deep Learning Techniques           9


 Algorithm 2: Filter(θ→ , θ← )
   Input: /* Assume all the variables are being shared between Algorithm 1 and
           Algorithm 2 */
 1 Get synthetic src sentences (synth-src) by translating monotgt with θ←
 2 Get synthetic tgt sentences (synth-tgt) by translating synth-src with θ→
 3 BLEU(monotgt , synth-tgt) ;          // Calculate BLEU score by comparing
    synth-tgt against monotgt
 4 Sort monotgt in descending order of the BLEU score
 5 Choose the first x amount of monotgt sentences (x-monotgt ) and the
    corresponding synthetic source language sentences (x-synth-src) such that
    the ratio between authentic parallel sentences to synthetic sentences is 1:k
 6 Create a pseudo-parallel corpus S = { x-synth-src, x-monotgt }
 7 monotgt = monotgt - (x-monotgt ) ;       // Update monotgt by removing the
    chosen top-x mono sentences from monotgt
 8 k = k+1
 9 return S


sentences. Only a few words from each sentence were correctly translated but
these did not add any value to the underlying meaning of the sentence.
    As can be seen in Table 3, translations with both sub-word segmentation
have performed drastically better than the baseline full-word form. With Mor-
fessor, the improvement is 3.76 for Tamil to Sinhala and 3.59 for Sinhala to
Tamil. Compared to the full-word form BPE representation has improved the
BLEU score by 4.6 BLEU points for Tamil to Sinhala and by 3.94 for Sinhala to
Tamil. To justify this improvement we analyzed the OOV% of the parallel cor-
pus obtained with each representation. This analysis is provided in Table 4. The
OOV% has dropped drastically with sub-word segmentation techniques. This
means that sub-word segmentation increase the coverage of the model, enhanc-
ing the amount of words a model can see during the training. This has positively
affected the translation resulting in increased BLEU scores.
    Another observation from Table 3 is that BPE performs better than Mor-
fessor for Sinhala and Tamil. One reason could be as we see in Table 4, BPE
has decreased the OOV% much more than with Morfessor. Also we observed
that when pre-processed with Morfessor, more words were segmented into bet-
ter linguistic morphemes than with BPE (as can also be witnessed in Figure


           Table 3. BLEU scores of the three pre-processing techniques

                Representation         Tamil-Sinhala Sinhala-Tamil
                SMT (Baseline 1)          13.11             -
           Full-word form (Baseline 2)     5.41           2.47
                    Morfessor              9.17           6.06
                      BPE                 10.01           6.41
10     A.Arukgoda et al.

                            Table 4. OOV% Analysis

              Representation Tamil-Sinhala        Sinhala-Tamil
                               OOV%                  OOV%
               Full-word form   34.46                 24.54
                 Morfessor       6.27                  2.60
                    BPE            0                    0


              Table 5. BLEU Scores after manipulating the network

                  Technique        Tamil-Sinhala Sinhala-Tamil
                GNMT Encoder          10.57           6.94
             +Checkpoint Smoothing    11.76           7.51


3). Similar observations were made in the work of Banerjee et al. [2] on the
Bengali-Hindi pair of languages. They empirically show in their work that for
linguistically close languages, BPE performs better than when using Morfessor.
The above conclusions help us derive another conclusion. That is, since linguis-
tically similar languages like Sinhala and Tamil benefit from the fact that the
sub-units are not segmented into proper morphemes, it frees us from the need of
a morphological analyzer for NMT translation tasks. Therefore we could focus
our future efforts on improving the quality of the newly generated words when
translated after pre-processing with BPE by incorporating a Language Model.
    The next experiments were continued by using the pre-processing technique
BPE.
    When we simplified the network by replacing the BRNN encoder with a
GNMT encoder, the translation accuracy improved further as shown in Table
5, by 0.56 for Tamil to Sinhala and 0.53 for Sinhala to Tamil. We noticed that
the number of parameters used in the model with a BRNN encoder (22,683,128)
was almost as twice as the number of parameters of that with a GNMT encoder
(11,104,741). When the data-set is small, we cannot afford to fit models with a
high degree of freedom (too many parameters). This leads to the requirement of
a simpler model which has led to the improvement in the translation quality.
    Furthermore the ensemble of models from multiple epochs used to create the
averaged model has increased the generalization power of the models, resulting
in a significant improvement of 1.19 BLEU points for Tamil to Sinhala and 0.57
for Sinhala to Tamil in the translation accuracy.
    The series of experiments so far were conducted only using the parallel corpus
of approximately 25000 sentences. In the following experiments we attempted
to make use of our monolingual corpora to increase the net parallel corpus size
by using back-translation techniques. The results from naive and filtered back-
translation techniques are presented in the tables 6 and 7 respectively. As the
back-translators we used the best models we obtained with the authentic 25000
parallel sentences after employing Checkpoint Smoothing (Table 5).
    Improving Sinhala-Tamil Translation through Deep Learning Techniques        11

               Table 6. BLEU Scores from Naive Back-Translation

             Authentic : Synthetic Tamil-Sinhala Sinhala-Tamil
                     1:1              12.16           7.34
                     1:2              14.17           7.37
                     1:3              15.35             -


              Table 7. BLEU Scores from Filtered Back-Translation

             Authentic : Synthetic Tamil-Sinhala Sinhala-Tamil
                     1:1              14.04           7.23
                     1:2              14.75           7.58
                     1:3              15.93             -


    The Tamil to Sinhala translation direction has continuously improved when
the ratio between authentic to synthetic parallel sentences were increased. And as
expected, filtered back-translation performs better than naive back-translation
as filtered translation ensures that even the synthetic parallel sentences are of
good quality, unlike naive back-translation. The observations conformed to the
common wisdom of “more data is better data” in the context of Deep Learning.
But the expected improvement in the translation quality through these tech-
niques were not witnessed in the translations conducted from Sinhala to Tamil
which questions their applicability across languages.
    One of the consistent observations in our research work and previous work
[11, 9] is that given two languages, translation in one direction performs better
than the other. This distinction is more prominent when one language is mor-
phologically richer than the other. This prompted us to design an algorithm that
benefits from this fact and improve the quality of both translation directions. Our
algorithm, also known as ‘Incrementally Filtered Back-Translation’, manages to
help the translations reach high accuracy with minimum amount of monolingual
sentences. This is an original contribution by us to the body of knowledge. The
results from this newly proposed technique is given in Table 8. As expected, this
new technique was able to create better translation accuracy for both transla-
tion directions (shown in Table 8). Sinhala to Tamil direction had increased its


       Table 8. BLEU Scores from Incrementally Filtered Back-Translation

             Authentic : Synthetic Tamil-Sinhala Sinhala-Tamil
                     1:1              14.04             -
                     1:2                 -            9.41
                     1:3              15.39             -
                     1:4                 -            9.71
                     1:5              16.02             -
12     A.Arukgoda et al.

BLEU score by approximately 2 points more than the accuracy obtained with
only the parallel corpus, which is a significant improvement. The importance in
the technique is that, the improvement in the BLEU scores are seen at the earlier
stages. When the ratio between authentic to synthetic parallel sentences was 1:1,
Sinhala to Tamil translation direction obtained a BLEU score of 9.41 with the
newly proposed Incrementally Filtered Back-Translation, while the same value
was only 7.23 with filtered back-translation. This makes this technique ideal
when even the amount of monolingual sentences available for two languages is
limited as it makes maximum use of even the limited amount of monolingual
sentences. Furthermore, since this technique is language-independent, it can be
adopted and explored on any such low-resourced language pair.
    The final BLEU scores achieved by our study are compared against the bench-
marks in Table 9. We were also able to exceed the benchmark based on the
SMT study for Tamil to Sinhala translation direction by approximately 3 BLEU
points. While the same parallel corpus was used in the two studies, the SMT
study had used 850,000 Sinhala monolingual sentences for its language model
while the Sinhala monolingual corpus we used for our work consisted of only
180,793 sentences. Therefore by using less amount of resources, NMT had man-
aged to exceed the translation accuracy more than SMT. An observation we
made throughout the experiments was that the translation of Tamil to Sinhala
performed better than the translation for Sinhala to Tamil. If we consider the
characteristics of the Sinhala and Tamil parallel data-sets in Table 1, we can
clearly see that Vocabulary to Total Words ratio of the Tamil data-set is 23.98%
which is almost two times larger than the same value for Sinhala data-set, which
is 14.58%. This is an indication that Tamil is morphologically richer than Sinhala
within our corpora. Also it can be observed from Table 1, Sinhala has a higher
number of total words than Tamil. Since in the parallel corpus, the sentences of
the two languages have the exact meaning of each other, it can be stated that
Sinhala requires more words than Tamil to be used to convey the same mean-
ing. When a language is morphologically richer, the inflectional morphemes add
more information about time, count, singularity/plurality etc. Therefore a mor-
phologically richer language requires only fewer words to convey a message than a
relatively less morphologically rich language. Therefore we conclude that within
the context of our corpora, Tamil behaves morphologically richer than Sinhala.
    NMT is an end-to-end translation. In an encoder-decoder architecture, the
encoder encodes a source sentence in an almost language independent represen-
tation which will later be decoded on the decoder-side. When the source-side is


      Table 9. Comparison of final BLEU Scores against the Baseline Models

                     Model             Tamil-Sinhala Sinhala-Tamil
                SMT (Baseline 1)          13.11             -
           Full-word form (Baseline 2)     5.41           2.47
           Incrementally Filtered BT      16.02           9.71
    Improving Sinhala-Tamil Translation through Deep Learning Techniques       13

morphologically richer than the target-side, the encoder tends to encode more
information about the sentence, leading to a better decoding by the decoder.
When the source-language is less morphologically rich than the target-side, the
encoded sentences does not contain much information for the decoder to deduce
a good translation. This justifies why Tamil to Sinhala translation direction
produces better translations than for Sinhala to Tamil translations.


6   Conclusion

Our main goal in this research was to develop an NMT system to improve the
translation between the morphologically rich and low resource language pair
Sinhala and Tamil. By identifying the challenging properties of Sinhala and
Tamil and by treating them appropriately through a course of experiments, we
improved the NMT benchmark by a BLEU score of 11 for Tamil to Sinhala di-
rection, and 7 for Sinhala to Tamil translation direction. Using the same parallel
corpus as the SMT study we were able to exceed the SMT benchmark for Tamil
to Sinhala translation direction by approximately 3 BLEU points.
    This research paves way for the newly proposed Incrementally Filtered Back-
Translation technique to be explored by other low resource languages and es-
tablish its validity across languages. Furthermore, the techniques that have been
accepted as more suitable for Sinhala and Tamil translation can be explored and
adopted by other such agglutinative languages as well. While we hope that this
research contributes to the improvement of the information exchange between
Sinhala and Tamil communities, we have successfully addressed the gap in the
body of knowledge as research on open-domain Sinhala Tamil using NMT has
not been attempted before.


References

 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by
    jointly learning to align and translate. CoRR abs/1409.0473 (2014),
    http://arxiv.org/abs/1409.0473
 2. Banerjee, T., Bhattacharyya, P.: Meaningless yet meaningful: Morphol-
    ogy grounded subword-level nmt. In: Proceedings of the Second Work-
    shop on Subword/Character LEvel Models. pp. 55–60. Association for
    Computational Linguistics (2018). https://doi.org/10.18653/v1/W18-1207,
    http://aclweb.org/anthology/W18-1207
 3. Bentivogli, L., Bisazza, A., Cettolo, M., Federico, M.: Neural versus phrase-
    based machine translation quality: a case study. CoRR abs/1608.04631 (2016),
    http://arxiv.org/abs/1608.04631
 4. Chen, H., Lundberg, S., Lee, S.: Checkpoint ensembles: Ensemble methods from a
    single training process. CoRR abs/1710.03282 (2017)
 5. Imankulova, A., Sato, T., Komachi, M.: Improving low-resource neural machine
    translation with filtered pseudo-parallel corpus. In: WAT@IJCNLP. pp. 70–78.
    Asian Federation of Natural Language Processing (2017)
14      A.Arukgoda et al.

 6. Klein, G., Kim, Y., Deng, Y., Nguyen, V., Senellart, J., Rush, A.M.: Opennmt:
    Neural machine translation toolkit. In: AMTA (1). pp. 177–184. Association for
    Machine Translation in the Americas (2018)
 7. Poncelas, A., Shterionov, D., Way, A., de Buy Wenniger, G.M., Passban, P.: In-
    vestigating backtranslation in neural machine translation (2018)
 8. Pushpananda, R., Weerasinghe, R., Niranjan, M.: Statistical machine translation
    from and into morphologically rich and low resourced languages (04 2015)
 9. Sennrich, R., Birch, A., Currey, A., Germann, U., Haddow, B., Heafield, K.,
    Barone, A.V.M., Williams, P.: The university of edinburgh’s neural MT systems
    for WMT17. CoRR abs/1708.00726 (2017)
10. Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models
    with monolingual data (11 2015)
11. Sennrich, R., Haddow, B., Birch, A.: Edinburgh neural machine translation systems
    for WMT 16. CoRR abs/1606.02891 (2016)
12. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words
    with subword units. In: Proceedings of the 54th Annual Meeting of the Associ-
    ation for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Ger-
    many, Volume 1: Long Papers. The Association for Computer Linguistics (2016),
    http://aclweb.org/anthology/P/P16/P16-1162.pdf
13. Smit, P., Virpioja, S., Grönroos, S., Kurimo, M.: Morfessor 2.0: Toolkit for
    statistical morphological segmentation. In: Bouma, G., Parmentier, Y. (eds.)
    Proceedings of the 14th Conference of the European Chapter of the Associ-
    ation for Computational Linguistics, EACL 2014, April 26-30, 2014, Gothen-
    burg, Sweden. pp. 21–24. The Association for Computer Linguistics (2014),
    http://aclweb.org/anthology/E/E14/E14-2006.pdf
14. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with
    neural networks. In: Advances in neural information processing systems.
    pp. 3104–3112 (2014), https://papers.nips.cc/paper/5346-sequence-to-sequence-
    learning-with-neural-networks.pdf
15. Vylomova, E., Cohn, T., He, X., Haffari, G.: Word representation mod-
    els for morphologically rich languages in neural machine translation. In:
    Proceedings of the First Workshop on Subword and Character Level
    Models in NLP. pp. 103–108. Association for Computational Linguis-
    tics, Copenhagen, Denmark (Sep 2017). https://doi.org/10.18653/v1/W17-4115,
    https://www.aclweb.org/anthology/W17-4115
16. Weerasinghe, R., Herath, D., Welgama, V., Medagoda, N., Wasala, A., Jay-
    alatharachchi, E.: Ucsc sinhala corpus - pan localization project-phase i (2007)
17. Weerasinghe, R., Pushpananda, R., Udalamatta, N.: Sri lankan tamil corpus. Tech-
    nical report, University of Colombo School of Computing and funded by ICT
    Agency, Sri Lanka, (2013)
18. Weerasinghe, R.: A statistical machine translation approach to sinhala tamil lan-
    guage translation. In: In SCALLA 2004 (2004)
19. Welgama, V., Weerasinghe, R., Niranjan, M.: Evaluating a machine learning ap-
    proach to sinhala morphological analysis (12 2013)
20. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,
    M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X.,
    Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G.,
    Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Cor-
    rado, G., Hughes, M., Dean, J.: Google’s neural machine translation system: Bridg-
    ing the gap between human and machine translation. CoRR abs/1609.08144
    (2016)
    Improving Sinhala-Tamil Translation through Deep Learning Techniques    15

21. Zhang, J., Zong, C.: Exploiting source-side monolingual data in neural
    machine translation. In: Proceedings of the 2016 Conference on Empiri-
    cal Methods in Natural Language Processing. pp. 1535–1545. Association
    for Computational Linguistics (2016). https://doi.org/10.18653/v1/D16-1160,
    http://aclweb.org/anthology/D16-1160