Deep Learning Approach to English-Tamil and Hindi-Tamil Verb Phrase Translations D. Thenmozhi, B. Senthil Kumar and Chandrabose Aravindan Department of CSE, SSN College of Engineering, Chennai {theni d,senthil,aravindanc}@ssn.edu.in Abstract. Verb phrase (VP) translation focuses on translating all forms of verbs that helps in Machine translation (MT) task. This has several applications such as cross lingual information retrieval (CLIR), speech synthesis, natural language understanding and generation. VP transla- tion is a challenging task due to variations of characteristics, structure and families among the languages. Further, developing a language inde- pendent methodology for VP translation is an interesting task. In this paper, we present a deep learning methodology for English-Tamil and Hindi-Tamil VP translations. We have adopted neural machine trans- lation model to implement our methodology for VP translation. Our approach was evaluated using the data set given by VPT-IL@FIRE2018 shared task. Keywords: Verb Phrase Translation · Machine Translation · Text min- ing · Deep Learning · Indian Languages · Tamil Language. 1 Introduction Verb phrase (VP) translation is part of Machine translation (MT) task which focuses on translating all forms of verbs such as main verb, auxiliary verb, fi- nite verb, non-finite verb and negation verb. This has several applications such as MT [10, 3], cross lingual information retrieval (CLIR) [12, 13], speech syn- thesis, sentence simplification [5], natural language understanding and genera- tion. VPs carry several information like tense, modal and person-number-gender (PNG). VP translation is a challenging task due to the characteristics that vary from language to language. Some languages such as Tamil, Hindi and Telugu have subject-verb agreement and other languages such as English and Malay- alam may not have subject-verb agreement. For example, “avan vanthaan” and “avaL vanthaaL”, i.e the verb “vanthaan” or “vanthaaL” is decided by the sub- ject “avan” or “avaL”. However, in English “came” is the common verb for both “he” or “she”. Also, due to variation in structure namely subject-verb- object (SVO) or subject-object-verb (SOV) of the languages, VP translation is a challenging task. Several researches have been reported [4, 3, 5, 14, 9, 10, 6] with various methodologies such as rule-based, phrase-based, statistical-based, machine learning and hybrid techniques for machine translation. Government of India released1 a tool Sampark for performing machine translation among 1 https://sampark.iiit.ac.in/sampark/web/index.php/content 2 D. Thenmozhi et. al. Indian languages. Recently, Microsoft claims that developing deep neural net- work for Indian language translations brings more accuracy2 . Further, developing methodology that performs VP translation between different language families such as Indo-Aryan, Indo-European and Dravidian is a difficult task. The shared task VPT-IL@FIRE2018 focuses on VP translations between different language families. The goal of VPT-IL@FIRE2018 task is to research and develop tech- niques to English-Tamil and Hindi-Tamil VP translations. VPT-IL@FIRE2018 is a shared Task on Verb Phrase Translation in English and Indian languages collocated with Forum for Information Retrieval Evaluation (FIRE-2018). This paper focuses on developing a methodology which does not require any linguis- tic knowledge that can translate VPs between any two languages of different families. 2 Proposed Methodology A Sequence to Sequence (Seq2Seq) [11, 2] deep neural network is used in our approach for English-Tamil and Hindi-Tamil verb phrase translations. The steps used in our approach are given below. – Extract English / Hindi VP sequences and Tamil VP input sequences from the given training data (English / Hindi and Tamil sentences) using the VP mapping information. – Split the English / Hindi VP sequences and Tamil VP input sequences into training and development sets – Determine vocabulary from both English / Hindi VP input sequences and Tamil VP input sequences. – Build a deep neural network using Seq2Seq model with the layers namely em- bedding layer, encoding-decoding layer and projection layer with attention wrapper. – Extract English / Hindi VP sequences from English / Hindi sentences of the test data – Predict the Tamil VP output sequences for the English / Hindi VP sequences. – Construct the Tamil VP output sequences into required output format. The steps are detailed below. 2.1 Extraction of VP Sequences The given text consists of parallel sentences in English and Tamil languages for Task 1 and parallel sentences in Hindi and Tamil for Task 2. The input sentences are tagged with sentence id and language information. Figure 1 shows the example parallel sentences for English and Tamil and Figure 2 shows the parallel sentences for Hindi and Tamil. 2 https://news.microsoft.com/en-in/features/indian-language-translation-using-deep- neural-networks-announcement/ DL approach to EN-TA and HI-TA VP Translations 3 Fig. 1. English and Tamil Parallel Sentences. Fig. 2. Hindi and Tamil Parallel Sentences. We have prepared the data in such a way that Seq2Seq deep learning al- gorithm may be applied. The English / Hindi VP input sequences and Tamil VP input sequences are constructed separately by extracting verb phrases from English / Hindi and Tamil sentences based on the VP mapping which consists of information namely sentence id, source language, target language, VP id, VP source information and VP target information. The VP source and target information consists of VP start position and length fields. The format of VP mapping is given in Figures 3 and 4. Fig. 3. English-Tamil VP Mapping. The VP start position and length fields are used to extract the verb phrases present in sentences. For the above examples, the verb phrases are extracted as shown in Figures 5 and 6 4 D. Thenmozhi et. al. Fig. 4. Hindi-Tamil VP Mapping. Fig. 5. English and Tamil Verb Phrase. 2.2 Model Building using Seq2Seq Model We have adopted Neural Machine Translation (NMT) framework [8, 7] based on Seq2Seq model for VP translation task. Figure 7 shows the different layers used in deep neural network to build model for VP translation. The verb phrases that are extracted using the previous step are given to the deep neural network. Sequence of layers namely embedding layer, encoder- decoder layer and projection layer are employed in the neural network to obtain Tamil VPs. We have determined the vocabulary for both English / Hindi VP input sequences (source input sequences) and Tamil VP input sequences (target input sequences). The source input sequences and the target input sequences are splitted into training sets and development sets. The English / Hindi VP input sequences with m words x1 , x2 , ...xm and Tamil VP input sequences with n words y1 , y2 , ...yn where m need not be equal to n are given to the embedding layer. The embedding layer learns weight vectors from the source input sequences and target input sequence based on their vocabulary. These vectors are given to multi-layer LSTM that performs encoding and decoding operations. We have used an attention mechanism [1, 7] to obtain an overall word alignment between the source and target sequences. The main idea of attention mechanism is to have direct connection between the source and target by paying attention to relevant source words (English / Hindi) as we translate into Tamil phrase. projection Fig. 6. Hindi and Tamil Verb Phrases. DL approach to EN-TA and HI-TA VP Translations 5 Fig. 7. System Architecture for Verb Phrase Translation. layer that utilizes Softmax activation function is used to obtain the Tamil VP output sequences. 2.3 Prediction The model that is built by using deep neural network is used to predict Tamil VP output sequences for the given English / Hindi verb phrases of test data. For this, we have extracted English / Hindi verb phrases from English / Hindi sentences of test data using the VP mapping information. The sample sentences given for English / Hindi languages and the corresponding VP mapping information for test data are shown in Figures 8 and 9. Fig. 8. English and Hindi Sentences. The Tamil VP output sequences are obtained for the extracted English / Hindi VP input sequences using the deep neural model based on sequence map- ping. We have constructed the Tamil VP output sequences for the test data into the required output format which is shown in Figure 10. 6 D. Thenmozhi et. al. Fig. 9. VP Mapping for Test Data. Fig. 10. VP Translation Output. 3 Implementation We have used Python to extract VPs from English, Hindi and Tamil sentences. We have used TensorFlow for implementing the deep neural network. We have used the data set provided by VPTIL@FIRE2018 to evaluate our methodology. The data set used to evaluate the verb phrase translation task consists of a training set and test set for separately for English-Tamil and Hindi-Tamil. The training data contains parallel sentences in English and Tamil languages for Task 1 and parallel sentences in Hindi and Tamil languages for Task 2. VP mapping information is provided separately for both the tasks which consists of the attributes namely sentence id, source language, target language, VP id, VP source information and VP target information. The VP source and target information values have two parts namely VP start position and length of the VP. The details of the VPTIL@FIRE2018 data are given in Table 1. Table 1. Data Set for VPTIL Task Tasks Training Testing No. of Sentences No. of VPs No. of Sentences No. of VPs English-Tamil 1443 2275 1096 1865 Hindi-Tamil 1992 2617 1000 1384 We have extracted the textual part of the input sentences by removing the tags. For example, we have extracted the text “ENG:The General of the DL approach to EN-TA and HI-TA VP Translations 7 Chozha forces in Lanka at that time was Kodumbalur Poodhi Vikrama Kesari .” from the input ENG:The General of the Chozha forces in Lanka at that time was Kodumbalur Poodhi Vikrama Kesari .< /Sent> by removing and < /Sent>. We have used the VP start position and length fields of VP source and target information from VP mapping to extract the verb phrases present in source and target languages. For exam- ple, the text that starts at position 59 for 3 character length is extracted from English with sentId 1 using the VP mapping information . The obtained English VP input sequence with respect to vpId 1 is “was”. For some sentences, the tokens for the verb phrase may not be continuous. For exam- ple, the VP mapping conveys that the English VP se- quence is present in two postions 41 and 56 with the length 7 and 14 respectively in the sentence ENG:He opened his eyes and found the cat rubbing itself affectionately against him .< /Sent>. We have extracted the VPs in two positions as “rubbing” and “affectionately”, and concatenated them as a single VP ”rubbing affectionately” with respect to vpId 13 for English. The extracted English / Hindi VP input sequences and Tamil VP input sequences are splitted into train set and development set to feed into the deep neural network. The details of the splits are given in Table 2. Table 2. Number of Sequences for Model Building Tasks Training Development English-Tamil 1700 575 Hindi-Tamil 1817 800 We have used TensorFlow code based on tutorial code released by Neural Machine Translation 3 [7] that was developed based on Sequence-to-Sequence (Seq2Seq) models [11, 1, 8] to implement our deep learning approach for VP translations. We have implemented the Seq2Seq model using several parameters. The details are given below. – Recurrent unit: LSTM – Direction: Bi-directional – No. of layers: 8 – Dropout: 0.2 – Batch size: 128 – Attention: Bahdanau – Number of training steps: 50000 We have extracted the English / Hindi VP input sequences from the test data similar to training data. The Tamil VP output sequences are inferred with 3 https://github.com/tensorflow/nmt 8 D. Thenmozhi et. al. respect to the English / Hindi VP input sequences for the given test instances using our bi-LSTM model. Finally, we have converted the obtained Tamil VP output sequences into the required output format for the submission by adding the attribute as “translatedVP”. The output format is shown in Figure 10. 4 Results We have evaluated our models for English-Tamil and Hindi-Tamil VP transla- tions using the data set provided by VPT-IL@FIRE2018 shared task. Table 3 shows the precision and recall values we have obtained for the test data using our models. Table 3. Test Data Performance Tasks Precision(%) Recall(%) English-Tamil 10.06 16.53 Hindi-Tamil 16.84 18.21 It is observed from Table 3 that we have not obtained significant improvement in the performance. This is due to the size of the data set. 5 Conclusions We have presented a deep learning approach based on Seq2Seq model for English- Tamil and Hindi-Tamil VP translations. We have used the data set provided by VPT-IL@FIRE2018 shared task. We have extracted English / Hindi VP se- quences (source sequences) and Tamil VP input sequences (target sequences) from the given training data namely English / Hindi sentences and Tamil sen- tences respectively using verb phrase start position and length fields of source and target information present in the VP mapping file. These source and target input sequences are given to the deep neural network. The network consists of an embedding layer, encoding-decoding layer with 8-layer LSTM and a projection layer to translate the verb phrases from English / Hindi to Tamil. The embed- ding layer converts the source VP sequences and target VP input sequences into their vector representations based on the vocabulary of the source and target languages respectively. We have adopted Neural Machine Translation model for this task. The weight vectors learnt from embedding layer for training data are given to 8-layer LSTM where encoding and decoding are performed. We have used Bahdanau attention wrapper to obtain an overall word alignment between the source and target input sequences. Projection layer that uses Softmax acti- vation function is used to obtain the Tamil verb phrase output sequences. This model is used to infer the Tamil VP output sequences for English / Hindi verb phrases of test data. Finally, the translated Tamil VP output sequences are con- verted to the required output format for submission. We have obtained precision DL approach to EN-TA and HI-TA VP Translations 9 and recall values as 10.06% and 16.53% respectively for English - Tamil verb translations. For Hindi - Tamil verb translations, we have obtained precision and recall values as 16.84% and 18.21% respectively. The performance may be improved further with increased data set by incorporating more hidden layers, different attentions and increasing training steps. References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 2. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) 3. Devi, S.L., Pralayankar, P., Kavitha, V., Menaka, S.: Translation of hindi se to tamil in a mt system. In: Information Systems for Indian Languages, pp. 246–249. Springer (2011) 4. Devi, S.L., Pralayankar, P., Menaka, S., Bakiyavathi, T., Ram, R.V.S., Kavitha, V.: Verb transfer in a tamil to hindi machine translation system. In: Asian Language Processing (IALP), 2010 International Conference on. pp. 261–264. IEEE (2010) 5. Hasler, E., de Gispert, A., Stahlberg, F., Waite, A., Byrne, B.: Source sentence simplification for statistical machine translation. Computer Speech & Language 45, 221–235 (2017) 6. Jadoon Khan, N., Anwar, W., Durrani, N.: Machine translation approaches and survey for indian languages. arXiv preprint arXiv:1701.04290 (2017) 7. Luong, M., Brevdo, E., Zhao, R.: Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt (2017) 8. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015) 9. Macketanz, V., Avramidis, E., Burchardt, A., Helcl, J., Srivastava, A.: Machine translation: Phrase-based, rule-based and neural approaches with linguistic evalu- ation. Cybernetics and Information Technologies 17(2), 28–43 (2017) 10. Sridhar, R., Sethuraman, P., Krishnakumar, K.: English to tamil machine transla- tion system using universal networking language. Sādhanā 41(6), 607–620 (2016) 11. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp. 3104–3112 (2014) 12. Thenmozhi, D., Aravindan, C.: Tamil-english cross lingual information retrieval system for agriculture society. In: Tamil Internet Conference (TIC2009), 9th In- ternational Conference on. pp. 173–178 (2009) 13. Thenmozhi, D., Aravindan, C.: Ontology-based tamil–english cross-lingual infor- mation retrieval system. Sādhanā 43(10), 157:1–14 (2018) 14. Wang, X., Tu, Z., Xiong, D., Zhang, M.: Translating phrases in neural machine translation. arXiv preprint arXiv:1708.01980 (2017)