Bidirectional Hindi-Punjabi Machine Translation Mukund K Roy 1, Karunesh K Arora 1 and Sunita Arora 1 1 SNLP Lab, CDAC, Noida, Uttar Pradesh, India Abstract This paper presents the development and comprehensive assessment of a Hindi-Punjabi machine translation system tailored specifically for the MTIL (Machine Translation for Indian Languages) track of FIRE 2023. Leveraging neural machine translation techniques, we developed a robust translation model to facilitate seamless communication between Hindi and Punjabi, two prominent Indian languages despite low resource availability. The methodology involved fine-tuning a pretrained NLLB-1.3B to adapt to the Hindi-Punjabi translation task. To evaluate the efficacy of the translation system, we conducted comprehensive experiments using standard evaluation metrics on FLORES, as well as, on our own testset. Our results demonstrate promising performance of Punjabi-Hindi language pair, showcasing highest score in terms of BLEU, chrF and TER metrics across all domain specific translations in the track. Similarly, our Hindi-Punjabi pair also scored the highest in all domains except Governance domain where our chrF and COMET scores marginally second highest though BLEU and TER were still the highest. The findings underscore the viability and potential of our developed machine translation system, contributing to the advancement of translation technology for Indian languages in diverse applications. Keywords Machine Translation, Hindi-Punjabi, Transformer based NMT, NLLB-200, Finetuning 1. Introduction India, renowned for its linguistic diversity, houses languages like Hindi and Punjabi that wield significant cultural and regional importance. However, despite their prevalence, the effective translation between Hindi and Punjabi remains a considerable challenge due to inherent linguistic intricacies and lack of good quality and large parallel resources. Though Hindi and Punjabi belongs to same Indo- Aryan family, they diverge significantly in script and vocabulary. Hindi uses Devanagari script while Punjabi is predominantly written in the Gurmukhi script. The dissimilarities in script and lack of parallel resources pose challenge for machine translation between the language pair. This necessitates careful handling during the translation process to ensure accurate conversion without loss of semantic or contextual meaning. The vocabulary also presents another hurdle, with both languages exhibiting distinct lexical items, idiomatic expressions, and dialectical variations. The challenge lies in accurately capturing the essence of these linguistic intricacies during translation to ensure natural and contextually relevant output. In light of these complexities, our work aims to address these challenges by leveraging advanced Neural Machine Translation techniques to facilitate accurate, contextually relevant, and fluent translations between Hindi and Punjabi. Neural machine translation (NMT) has reformed the area of machine translation (MT) in recent years, achieving significant improvements in translation quality compared to traditional statistical MT (SMT) approaches. NMT is based on artificial neural networks (ANNs), which are capable of learning complex relationships between languages from large amounts of training data. ________________________________ Forum for Information Retrieval and Evaluation, 15-18 December 2023, Panjim, India EMAIL: mukundkumarroy@cdac.in (M. Roy); karunesharora@cdac.in (K. Arora); sunitaarora@cdac.in (S. Arora) ©️ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings One of the key advancements in NMT was the development of the encoder-decoder architecture (Sutskever et al., 2014) [1] wherein the encoder takes a source-language sentence as input and generates a representation of its meaning. After that, the decoder part takes this representation and generates a target-language sentence that is equivalent in meaning to the source sentence. Another major advancement was from (Bahdanau et al., 2014; Vaswani et al., 2017) [2] where they developed attention mechanisms. Attention mechanisms allow the decoder to focus on different parts of the source sentence when generating the target sentence. This helps to improve the accuracy of translations by allowing the decoder to pay attention to the most relevant information in the source sentence. The development of transformer-based NMT models (Vaswani et al., 2017) [3] had also been a significant breakthrough. Transformer models are based on a self-attention mechanism that allows them to process all parts of the source sentence simultaneously, without the need for recurrent neural networks (RNNs). This makes transformer models more efficient and scalable than RNN-based NMT models. To continue further, (Tang et al., 2018; Firat et al., 2016, Johnson et al, 2017) [4][5][6] developed a multilingual models that can translate between multiple languages. These models are trained on large amounts of data in multiple languages, which allows them to learn the relationships between different languages more effectively. One notable example of a multilingual NMT model is NLLB-200, developed by Meta AI (Costa-jussà et al., 2022) [7]. NLLB-200 is a single AI model that can translate across 200 different languages with state-of-the-art results. NLLB-200 has the ability to translate between a wide ranges of languages, including many low-resource languages. The NLLB model has been shown to achieve state-of-the-art results on a variety of benchmark datasets. This paper outlines the building a high-performance translation model for the Hindi-Punjabi language pair which poses a significant challenge due to the scarcity of parallel training data. To address this limitation, we employed NLLB-200-3.3B, a state-of-the-art multilingual neural machine translation model designed to excel in low-resource settings. NLLB-200's ability to effectively utilize data from multiple languages, including Hindi and Punjabi, made it an ideal starting point for our translation model. By fine-tuning NLLB-200 on a carefully curated dataset of parallel Hindi-Punjabi sentences, we were able to achieve significant improvements in translation accuracy in compared to training a transformer model from scratch using the available corpus. The resulting translation model demonstrates the potential of NLLB-200 for low-resource machine translation tasks and its ability to bridge the communication gap between speakers of Hindi and Punjabi. Our team participated in the FIRE-2023 MTIL challenge for both Punjabi to Hindi and Hindi to Punjabi language pairs. We employed NLLB-200, a cutting-edge machine translation model optimized for resource-constrained environments, as the foundation for our submissions. We further enhanced the model's performance by training it on general domain data, governance domain data, and healthcare domain data as described in next section. The effectiveness of our model was evaluated using chrF (character-level F-score) [8], the official metric for the task. Our model achieved the highest chrF scores Punjabi to Hindi in all domains. For Hindi to Punjabi language pairs, except for Governance domain, our model again achieved highest chrF score among all participating teams. 2. Dataset The dataset for this task consisted of a parallel corpus of Hindi-Punjabi sentence pairs collected from diverse domains, including General, Agriculture, Tourism, Education, Science & Technology, Governance, Health and News articles. Overall 140K parallel sentences were collected and curated for this task. Majority of the corpus, though human translated, but needed some vetting, cleaning and preprocessing before sending for training of the translation model. The dataset was split into training, development, and test sets to facilitate model training and evaluation. In addition, FLORES test set [9] is also used to evaluate the system which contains 1012 sentence pairs. 3. Methodology In order to build Hindi-Punjabi and Punjabi-Hindi translation models, we fine-tuned the NLLB-200- 1.3B pretrained model with our corpus. For evaluating both the models, we used two datasets i.e. our own (CDACN) test set containing 1000 sentences of mixed of domain and another publicly available FLORES test set. It was necessary to maintain the fairness and avoid biasness. For our training purpose, we used SOTA Opennmt-Py toolkit which provides different configurations of building NMT models to play with. In this work, we built Transformer model from scratch and Fine-tuned model using the same toolkit. We used following methodology to train our models: 3.1. Data preprocessing Preprocessing plays a pivotal role in training Neural Machine Translation (NMT) models within the OpenNMT-py toolkit [10], serving as the foundational step in transforming raw textual data into a format suitable for effective model learning and translation. Primarily, the preprocessing workflow encompasses cleaning, tokenization, normalization, subword segmentation, and vocabulary construction. Tokenization, the initial step in preprocessing, involves breaking the text into smaller linguistic units, typically words or subword units, facilitating the model's understanding of the input. Subword tokenization, often implemented using Byte Pair Encoding (BPE) or SentencePiece, is widely preferred for its ability to handle Out of Vocabulary (OOV) words by splitting them into subword units, promoting better generalization and handling of unseen vocabulary during translation. Subword segmentation, using BPE or SentencePiece, further refines the tokenization process by breaking down words into smaller subword units based on their frequency of occurrence within the dataset. Normalization follows tokenization and involves standardizing the text by resolving issues such as punctuation, casing, and other linguistic variations. Vocabulary construction is another pivotal aspect of preprocessing in OpenNMT-py. It involves building a vocabulary set that comprises the most frequent tokens or subword units from the training data. Careful selection of the vocabulary size is crucial as it directly impacts the model's ability to generalize while also influencing the computational requirements. The vocabulary size must strike a balance between coverage of commonly occurring tokens and efficiency in model training. In OpenNMT-py, preprocessing is streamlined using the ‘preprocess.py’ script. This script takes the raw text data and performs the necessary preprocessing steps, generating vocabulary files and training and validation datasets in a format compatible with the NMT model. 3.2. Model Training As stated earlier, we fine-tuned the NLLB-200 pretrained model with our training dataset. We used the NLLB-200-3.3B which is basically a transformer-based encoder-decoder architecture with 3.3B parameters. It is trained on over 2TB of text data in 1220 language pairs, including 202 languages. The model is mainly intended for research in MT, primarily for low-resourced languages. It can perform translation of single sentence between 200 languages. Due to the dependency on this model, we customized the SentencePiece model [11] to work on OpenNMT toolkit and used it as tokenization method. The architecture of training model was also modified accordingly by incorporating 24 layers Transformer Encoder Decoder. The Feed Forward Network (FFN) now had 8192 hidden units and word vector size was doubled to become 1024. Similarly optimizing methods is also modified to use Standard Gradient Descent method. The Model training begins with feeding the preprocessed parallel training data into the fine-tuning framework. The framework splits the data into batches for efficient training. During the forward pass, the input sentence in the source language is passed through the encoder of the NLLB-200 model. The encoder generates a representation of the input sentence's meaning. The attention mechanism of this architecture allows the decoder to concentrate only on relevant parts of the encoder's representation while generating the output in the target language. The decoder generates a sequence of words in the target language, one word at a time, based on the encoder's representation and the attention mechanism. The predicted output sequence is compared to the actual target sequence to calculate the loss, which represents the model's error. The loss is propagated backward through the model to update the weights of the encoder and decoder. The optimizer adjusts the model's weights to minimize the loss, gradually improving the model's ability to translate sentences accurately. 4. Evaluation In this section, we have discussed about evaluation of our fine-tuned model using BLEU (Bilingual Evaluation Understudy)[12], chrF (character-level F-score) [8], COMET (Crosslingual Optimized Metric for Evaluation of Translation) [13] and TER (Translation Edit Rate) [14] metrics. BLEU is a popular metric for evaluating machine translation (MT) systems. It is a precision-based metric that calculates the percentage of n-grams (sequences of n words) that are correct in the translated output compared to the reference translation. chrF is a metric for evaluating MT systems that is based on character-level overlap between the translated output and the reference translation. chrF scores range from 0 to 1, with 1 being a perfect score. chrF is less sensitive to word order than BLEU and is more forgiving of errors in morphology and syntax. TER is a metric for evaluating MT systems that is based on the number of edits (insertions, deletions, and substitutions) that need to be made to the translated output to convert it into the reference translation. TER scores range from 0 to 1, with 0 being a perfect score. In table 1, different metric scores on two test sets has been given. We also present our performance in FIRE 2023 MTIL track which aims to create a strong machine translation system for converting text from one Indian language to another Indian language. There are two main jobs in this track. Task 1 involves making a translation model for general domain, working across 12 different Indian language pairs. Task 2, which is more specific, needs translation models focused on the Governance and Healthcare domain. Table 1 Evaluation scores on CDACN and FLORES Test set Language BLEU chrF TER pair CDACN Test set Punjabi-Hindi 50.3 69.8 31.8 Hindi-Punjabi 38.3 62.7 40.1 FLORES Test set Punjabi-Hindi 28.2 53.9 59.3 Hindi-Punjabi 21.4 48.1 64.7 Table 2 FIRE 2023 MTIL track Official evaluation scores of two translation tasks on different metrics Domain Language Pair BLEU chrF chrF++ TER COMET General Punjabi-Hindi 62.1954 77.4556 76.6006 22.2312 0.8366 General Hindi-Punjabi 50.9394 69.7897 68.1843 38.0883 0.8454 Governance Punjabi-Hindi 33.1194 56.1692 54.6360 51.5544 0.8180 Governance Hindi-Punjabi 56.8942 73.6951 72.8590 25.7565 0.8169 Health Punjabi-Hindi 37.5176 60.8540 59.5213 42.0599 0.8379 Health Hindi-Punjabi 65.0554 79.5775 78.8151 20.4537 0.8520 5. Results Upon analyzing Table 1 and Table 2 of evaluation scores, it can be observed that the model Punjabi to Hindi translation system is performing better than the Hindi to Punjabi system, although the dataset used is same for both the direction. One of the main reason is that Punjabi is a more inflected language than Hindi, which means that there are more cues for the translation systems to use when translating from Punjabi to Hindi. On our internal CDACN and Flores testsets, the BLEU scores for all translation tasks range from 21.4 to 50.3. This suggests that our both translation systems are able to produce translations that are of reasonable quality. The chrF scores for all translation tasks range from 48.1 to 69.8. This suggests that the translation systems are able to produce translations that are fluent and natural-sounding. The TER scores for all translation tasks range from 31.8 to 64.7. This suggests that the translation systems are able to produce translations that are relatively accurate. In MTIL challenge chrF is the official metric of evaluation. Here our systems scored the highest of all, reaching the score as high as 79.5775 and lowest being 60.8540 across all domain specific tasks. BLEU and TER scores also corresponds the models' capacity to translate domain-specific language with proficiency showcasing their robustness and adaptability. 6. Conclusion In this paper, we presented our work of building Hindi-Punjabi bidirectional translation model using fine-tuning methodology. Our system utilized the NLLB-200-3.3B pre-trained model to translate between Hindi and Punjabi across the General, Governance, and Healthcare domains. Our models achieved promising results in the MTIL track challenge in FIRE 2003, highlighting the efficacy of the methodology applied to these machine translation models. These empirical findings also establish a foundation future works of further advancements and exploration in the realm of domain-specific machine translation. 7. Acknowledgements We are sincerely thankful to the Ministry of Electronics and Information technology (Meity) for funding the NLTM-ILTM. We also express our thanks to Shri Vivek Khaneja, Executive Director, CDAC Noida for his constant support and motivation. Finally, we are thankful to the NPSF-AIRAWAT for providing the GPU compute infrastructure. 8. References [1] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NeurIPS. 3104--3112 [2] D. Bahdanau, K. Cho, and Y. Bengio. "Neural machine translation by jointly learning to attend and translate." arXiv preprint arXiv:1409.1055 (2014). [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, 1706.03762v7 (2017). [4] Y. Tang, C. Tran, X. Li, P.-J. Chen, N. Goyal, V. Chaudhary, J. Gu, A. Fan, Multilingual translation with extensible multilingual pretraining and finetuning, https://doi.org/10.48550/arXiv.2008.00401 (2020) [5] O. Firat, K. Cho, Y. Bengio, Multi-way, multilingual neural machine translation with a shared attention mechanism, Association for Computational Linguistics, 2016, pp. 866–875. doi:10.18653/v1/N16-1101. [6] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, J. Dean, Google’s multilingual neural machine translation system: Enabling zero-shot translation, Transactions of the Association for Computational Linguistics 5 (2017) 339–351. doi:10.1162/tacl_a_00065.). [7] N. Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault, G. M. Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R.Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, J. Wang, No language left behind: Scaling human-centered machine translation, arXiv:2207.04672v3 [ (2022). [8] M. Popović, chrf: character n-gram f-score for automatic mt evaluation, Association for Computational Linguistics, 2015, pp. 392–395. doi:10.18653/v1/W15-3049. [9] F. Guzmán, P.J Chen, M. Ott, J. Pino, G. Lample, P. Koehn, V. Chaudhary, and M.A.Ranzato.,The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English, 2019 In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), p. 6098–6111, Hong Kong, China. Association for Computational Linguistics. [10] https://github.com/OpenNMT/OpenNMT-py [11] T. Kudo and J. Richardson, "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing", 2018, Proc. EMNLP, pp. 66-71. [12] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu, Association for Computational Linguistics, 2001, p. 311. doi:10.3115/1073083.1073135. [13] R. Rei, C. Stewart, A. C. Farinha, A. Lavie, Comet: A neural framework for mt evaluation, Association for Computational Linguistics, 2020, pp. 2685–2702. doi:10.18653/v1/2020.emnlp- main.213. [14] M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. "A Study of Translation Edit Rate with Targeted Human Annotation". In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, pages 223–231. [15] B. Zoph, D. Yuret, J. May, K. Knight, Transfer learning for low-resource neural machine translation, Association for Computational Linguistics, 2016, pp. 1568–1575. doi:10.18653/v1/D16-1163. [16] S. Gangopadhyay, G. Epili, P. Majumder, B. Gain, R. Appicharla, A. Ekbal, D. Sharma, Overview of MTIL Track at FIRE 2023: Machine Translation for Indian Languages, in Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023, 2023. [17] S. Gangopadhyay, G. Epili, P. Majumder, B. Gain, R. Appicharla, A. Ekbal, D. Sharma, Overview of MTIL Track at FIRE 2023: Machine Translation for Indian Languages. In Working Notes of FIRE’23, 2023.