Fine tuning based Domain Adaptation for Machine Translation of Low Resource Indic Languages Amulya Ratna Dash, Harpreet Singh Anand and Yashvardhan Sharma Department of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani, Jhunjhunu, Rajasthan, India, 333031 Abstract This paper describes the proposed system for the machine translation of Indic language pairs Odia - Hindi and Hindi - Odia for the General Translation and Domain Specific Translation tasks proposed by Forum of Information Retrieval Evaluation(FIRE) in 2023. For general task, the proposed system uses an ensemble of two pre-trained models and for domain specific task, the proposed system uses a pretrained model fine-tuned using domain specific training data filtered from open source datasets. Keywords Low resource Machine Translation, NLLB, BART, IndicTrans, Sentence Similarity 1. Introduction The importance of language as a mode of communication in the contemporary globalized society cannot be underestimated. In an increasingly globalized and interconnected world, the imperative to overcome linguistic boundaries is paramount in cultivating comprehension, collaboration, and advancement. The Indic languages, are widely spoken by a significant population residing in the Indian subcontinent as well as among diasporic communities. With the fast growing numbers of mobile phone and Internet users, there is an immediate need for automatic machine translation systems from/to English as well as, across Indian languages. Though the digital content in Indian languages has increased a lot in the last few years, it is not yet comparable to that in English. The incorporation of Indic languages into the field of Natural Language Processing (NLP) has been characterized by a gradual and restricted progress, despite the extensive diversity of the Indic linguistic domain. The ’Machine Translation for Indian Languages’ track at FIRE 2023[1][2] consists of two tasks namely General Translation Task(Task 1) and Domain Specific Translation Task(Task 2). Task 1 requires us to build machine translation models to translate sentences of 12 language pairs whereas Task 2 requires us to build machine translation models for Governance and Healthcare domains for 8 language pairs. This paper describes the machine translation system developed for Hindi - Odia and Odia - Hindi language pairs for Task 1 and Task 2. Forum for Information Retrieval Evaluation, December 15-18, 2023, India Envelope-Open p20200105@pilani.bits-pilani.ac.in (A. R. Dash); f20212416@pilani.bits-pilani.ac.in (H. S. Anand); yash@pilani.bits-pilani.ac.in (Y. Sharma) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Related Work The emergence of Encoder-Decoder models, particularly the Transformer neural network archi- tecture proposed by Vaswani et al.[3] in 2017, was a notable breakthrough in the domain of Natural Language Processing (NLP). Attention mechanisms [4] are utilized by transformers in order to process sequences of words in a simultaneous manner, hence enabling the generation of translations that are more contextually relevant and coherent. Transformer-based models have proved to outperform other encoder - decoder models based on RNN and LSTM[5]. The Transformer model uses multi-head self-attention mechanisms and position wise feed-forward networks. Recent literature indicates that Transformer models pre-trained on large corpora can acquire universal language representations that aid in subsequent tasks. The models are pre-trained on a variety of self-supervised tasks, including predicting a masked word based on its context. Once a model has been pre-trained, it can be fine-tuned on downstream datasets as opposed to being trained from inception. GPT [6][7], BERT[8], and BART[9] are examples of transformer-based pre-trained language models that have had tremendous success in NLP because of their ability to learn universal language representations from large volumes of unlabeled text data and then transfer this knowledge to downstream tasks. Yin et al.[10] proposed a method for using pre-trained Natural Language Inference(NLI) models as a ready-made zero-shot sequence classifiers. The method works by posing the sequence to be classified as the NLI premise and to construct a hypothesis from each candidate label. IndicBART[11] is a pre-trained BART model for Indic languages, specifically trained for As- samese, Bengali, Gujarati, Hindi, Marathi, Odia, Punjabi, Kannada, Malayalam, Tamil, Telugu and English. Recently transformer-based models specialized for machine translation of Indic languages like IndicTrans[12] and IndicTrans2[13] are available, which are trained on largest available Indic language parallel corpora namely Samanantar and BPCC respectively. IndicTrans model was trained for 11 Indic languages whereas IndicTrans2 was trained fore all the 22 sched- uled Indian languages. NLLB(No Language Left Behind)[14], a massively multilingual machine translation model has proven to be a breakthrough in the high-quality translation of around 200 languages across the world. MuRIL(Multilingual Representations for Indian Languages)[15], is a multilingual Language Model specifically built for Indic languages supporting around 17 languages. MuRIL outperforms multilingual BERT (mBERT) on all NLP tasks. 3. Dataset The dataset utilized for training is extracted from Bharat Parallel Corpus Collection (BPCC)[13], released by AI4Bharat. BPCC is comprised of two parts - BPCC-Mined and BPCC-Human totalling approximately 230 million bi-text pairs. BPCC-Mined contains about 228 million pairs, with nearly 126 million pairs newly added as a part of this work. BPCC-Human, on the other hand consists of 2.2 million gold standard English-Indic pairs, with an additional 644K bitext pairs from English Wikipedia sentences (forming the BPCC-H-Wiki subset) and 139K sentences covering everyday use cases (forming the BPCC-H-Daily subset). However the dataset contains the text in a particular Indian Language and its translation in English. Thus, for training Indic-Indic translation model, we used English as a pivot language to translate Indic-English and then English-Indic language. The availability of direct Indic-Indic parallel dataset may help build a better machine translation model as compared to dataset created via pivoting. 4. Proposed Technique The proposed technique uses corpus filtering methods, pretrained models, and fine-tuned multilingual models to develop general and domain-specific machine translation systems. 4.1. General Translation Task The proposed system translates the test set provided by task organizers using NLLB and IndicTrans models. We receive two different versions of translated output for Odia → Hindi and Hindi → Odia using both the models. For Odia → Hindi task, the sentence embeddings of the Odia test set sentences, and their NLLB1 and IndicTrans2 Hindi translations using the MuRIL3 model are generated. Similarly for Hindi → Odia, we generate the sentence embeddings of the Hindi test set sentences, and both versions of Odia translations. Then, we calculate the sentence similarity via embedding cosine similarity of each version of the translations with the corresponding input sentences in the original language and accept the version with higher cross-lingual semantic similarity. We now have Hindi and Odia test data with their most appropriate Odia and Hindi translations respectively. Figure 1: Proposed technique for General Task 4.2. Domain Specific Translation Task Domain specific task requires translation models specialized for the governance and healthcare domains. 1 https://huggingface.co/facebook/nllb-200-distilled-600M 2 https://github.com/AI4Bharat/indicTrans 3 https://huggingface.co/google/muril-base-cased 4.2.1. Domain specific Dataset We classified the English sentences from English-Hindi(625K Hindi sentences) and English- Odia(661K Odia sentences) dataset of BPCC using the BART-MNLI4 model via Zero-Shot Classification. Table 1 No. of sentences classified for each category Language Governance-related Healthcare-related Hindi 125587 42445 Odia 109937 74413 The classified sentences in Hindi and Odia are then translated to Odia and Hindi using IndicTrans Model. After the translation the Hindi-Odia and Odia-Hindi synthetic training dataset is split into governance and healthcare specific datasets for fine-tuning the NLLB model. 4.2.2. Fine Tuning of NLLB The AutoTokenizer from the NLLB model was used to tokenize the inputs. The domain-specific dataset was used to train(fine-tune) the NLLB model in batches of 32 and trained for 5 epochs with a learning rate of 2e-5. Using the same training parameters, we trained four fine-tuned models, Governance specific Hindi-Odia and Odia-Hindi, and Healthcare specific Hindi-Odia and Odia-Hindi model. Figure 2: Proposed technique for Domain Specific Task 5. Results The Table 2 and Table 3 show the official results of our proposed system for General Translation Task and Domain Specific Translation Task respectively. Table 2 Results for General Task Model BLEU CHRF CHRF+ TER COMET Hindi-Odia 20.057 56.389 51.836 63.967 0.842 Odia-Hindi 29.374 55.572 53.309 56.188 0.804 4 https://huggingface.co/facebook/bart-large-mnli Table 3 Results for Domain Specific Task Model BLEU CHRF CHRF+ TER COMET Hindi-Odia (Governance) 23.039 60.327 55.885 61.224 0.867 Odia-Hindi (Governance) 20.031 42.329 40.916 65.476 0.822 Hindi-Odia (Healthcare) 15.225 53.323 48.381 69.468 0.823 Odia-Hindi (Healthcare) 31.931 55.342 53.620 53.791 0.739 6. Conclusion and Future Work In this paper, we describe our proposed system for machine translation of low-resource Indic language pairs Hindi → Odia and Odia → Hindi, which achieved the second rank (chRF score) for general and domain specific translations in the MTIL track. The proposed system received COMET scores greater than 0.8 on 5 out of 6 sub-tasks, which validates that the translations generated by the models were highly accurate and fluent. In the future, we would further increase the size of domain-specific training data by exploring other available datasets and data augmentation techniques. Also, we would validate our system for machine translation of other Indic language pairs. References [1] S. Gangopadhyay, G. Epili, P. Majumder, B. Gain, R. Appicharla, A. Ekbal, A. Ahsan, D. Sharma, Overview of mtil track at fire 2023: Machine translation for indian languages, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023, Goa, India. December 15-18, 2023, ACM, 2023. [2] S. Gangopadhyay, G. Epili, P. Majumder, B. Gain, R. Appicharla, A. Ekbal, A. Ahsan, D. Sharma, Overview of mtil track at fire 2023: Machine translation for indian languages, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India. December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023. [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [4] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, Advances in neural information processing systems 27 (2014). [5] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014). [6] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language under- standing by generative pre-training (2018). [7] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [9] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle- moyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, arXiv preprint arXiv:1910.13461 (2019). [10] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach, arXiv preprint arXiv:1909.00161 (2019). [11] R. Dabre, H. Shrotriya, A. Kunchukuttan, R. Puduppully, M. M. Khapra, P. Kumar, Indicbart: A pre-trained model for indic natural language generation, arXiv preprint arXiv:2109.02903 (2021). [12] G. Ramesh, S. Doddapaneni, A. Bheemaraj, M. Jobanputra, R. Ak, A. Sharma, S. Sahoo, H. Diddee, D. Kakwani, N. Kumar, et al., Samanantar: The largest publicly available parallel corpora collection for 11 indic languages, Transactions of the Association for Computational Linguistics 10 (2022) 145–162. [13] J. Gala, P. A. Chitale, R. AK, S. Doddapaneni, V. Gumma, A. Kumar, J. Nawale, A. Su- jatha, R. Puduppully, V. Raghavan, et al., Indictrans2: Towards high-quality and acces- sible machine translation models for all 22 scheduled indian languages, arXiv preprint arXiv:2305.16307 (2023). [14] M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, et al., No language left behind: Scaling human-centered machine translation, arXiv preprint arXiv:2207.04672 (2022). [15] S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan, D. K. Margam, P. Aggarwal, R. T. Nagipogu, S. Dave, et al., Muril: Multilingual representations for indian languages, arXiv preprint arXiv:2103.10730 (2021).