Text-To-Picto Using Lexical Simplification Notebook for the ImageCLEF Lab at CLEF 2024 Abbhinav Elliah, Ananth Narayanan P, Bhuvan S and P Mirunalini Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Tamil Nadu, India Abstract Augmentative and Alternative Communication (AAC) provides a lifeline for those with language problems by employing pictograms to ensure precise message conveyance. This study fine-tunes a pre-trained translation model for text-to-picto conversion, utilizing tokenization and lexical simplification. The model aids individuals with language impairments due to genetic diseases or aphasia, showcasing its potential in simplifying complex text for effective communication. The study involves using two model: GPT-2 and Helsinki-BERT which are fine-tuned using given dataset. The Helsinki-NLP model demonstrated superior performance with a Picto-term Error Rate (PictoER) of 18.51. In contrast, the GPT-2 model had a higher PictoER of 170.81, making it prone to produce extraneous terms. These results indicate that the Helsinki-NLP model is more effective in producing accurate and contextually relevant text aligned with pictogram keywords. Keywords Lexical simplification, Language-specific fine-tuning, GPT-2 model, Helsinki BERT, NLP tokenizer, Keyword mapping, Picto-term Error Rate 1. Introduction AAC provides a solution for those with language problems brought on by illnesses such as aphasia. These systems use pictograms to communicate, however, there is still a barrier in translating text or spoken language into comprehensible pictogram sequences. In the ToPicto subtask of ImageCLEF 2024 [1], the proposed system fine-tunes an existing model using the given dataset for pictogram generation via lexical simplification. This task [2] introduces two new challenges whose objective is to provide a translation in pictograms from a natural language, either from (i) text or (ii) speech understandable by the users, in this case, people with language impairments. 2. Related Works Radford et al. in [3] offer a thorough analysis of the capabilities and performance of the GPT-2 language model in a range of natural language processing tasks. In-depth assessments of GPT-2’s performance on datasets like CoQA, the CNN and Daily Mail dataset, summarization, and translation tasks are also included in this work. Pretrained models for Natural language processing (NLP) are based upon large conventional datasets, and are thus ineffective during classification, prediction tasks based on custom datasets. Neil et al. in [4] experimented Parameter-Efficient Transfer Learning for NLP in order to improve the accuracies for these tasks. This ideology is further expended to various pre-trained models such as RoBERTa by Liu et al. in [5], where a larger dataset was used along with calibrations in various hyper parameters, and Lexfit by Vulic et al. in [6] where lexical simplification is implemented. Recent progress in natural language processing has been driven by advances in both model archi- tecture and model pretraining. Wolf et al. in [7] introduced Transformer architectures for the same involving higher-capacity models and implementing highly-optimized tokenization library built using Rust. This is extended by the release of open-source Transformers library in Python. Qiang et al. in [8] proposed LSBert based on pretrained representation model Bert that is capable of using a dataset CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ abbhinav2210396@ssn.edu.in (A. Elliah); ananthnarayanan2210384@ssn.edu.in (A. N. P); bhuvan2210511@ssn.edu.in (B. S); miruna@ssn.edu.in (P. Mirunalini) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings for fine-tuning during simplification and substitution of candidates via complex word identification, substitute generation, filtering and substitute ranking. 3. Approach In this work, two distinct methods were explored for lexical simplification using pre-trained models. The first method uses GPT-2 architecture, while the second method leverages the use of Helsinki- NLP/opus-mt-ROMANCE-en model for lexical simplification. In the initial approach, a sentence compression model is developed using a fine-tuned GPT-2 archi- tecture with an additional linear layer for output generation. The process begins by reading source and target sentences, creating a dataset where each entry pairs a source sentence with its compressed counterpart. A class is used to manage the data, while another class extends the pre-trained GPT2LMHeadModel1 by adding a linear layer for mapping GPT-2’s hidden states to the vocabulary size. The model is trained for 10 epochs with a batch size of 16 using the Adam optimizer. During training, sentences are tokenized with padding and the CrossEntropyLoss function is used to compute the loss between the predicted and target sequences. Padding is applied to ensure uniform sequence lengths, and the model and optimizer states are saved and reloaded for training further. The second approach utilizes the Helsinki-NLP/opus-mt-ROMANCE-en model 2 , a pre-trained trans- lation model originally designed for Romantic languages of Europe, such as French, Spanish, or Italian. Despite its primary focus on translation tasks, the model is adapted for lexical simplification within the context of the ToPicto project. The goal is to convert complex French utterances into simplified sequences of terms linked to pictograms, thereby enhancing communication for individuals with lan- guage impairments. By fine-tuning the model on a specialized dataset, the proposed system explores its efficacy in simplifying text while preserving semantic integrity. 3.1. Data Preprocessing The dataset for this task is provided by the ImageCLEF 2024 organizers and it is structured in JSON format, comprising training, validation, and test sets. Each entry in the dataset includes: id: A unique identifier for each utterance. src: The source text, an oral transcription in French. tgt: The target sequence of simplified pictogram terms. pictos: A list of pictogram identifiers corresponding to each term in the target sequence. Figure 1: Data Description 1 GPT-2 pre-trained model documentary: https://huggingface.co/docs/transformers/en/model_doc/gpt2 2 Helsinki-NLP repository: https://clarifai.com/helsinkinlp/translation/models/text-translation-romance-lang-english The preprocessing involves loading the dataset, extracting the relevant fields (src and tgt), and tokenizing the text data using the Helsinki-NLP tokenizer. Tokenization converts the text into a format suitable for model processing, preserving linguistic nuances and syntax. 3.2. Proposed Model The Helsinki-NLP/opus-mt-ROMANCE-en model is part of the Open Subtitles (opus-mt) project by the University of Helsinki. It is built on the MarianMT framework, which is a highly optimized neural machine translation (NMT) system developed by the Marian NMT group. This model is pre-trained on a vast multilingual corpus, specifically focused on Romance-languages (such as French, Spanish, Italian, Portuguese, and Romanian). The model leverages a transformer architecture, renowned for its effectiveness in handling sequence-to-sequence tasks due to its self-attention mechanisms and parallel processing capabilities. 3.2.1. Encoder-Decoder Framework The model employs a standard transformer architecture with an encoder-decoder structure. The encoder processes the input sequence and generates contextual embeddings, which the decoder then uses to produce the output sequence. 3.2.2. Self-Attention Mechanism Both the encoder and decoder utilize self-attention layers, allowing the model to weigh the importance of different tokens in the sequence dynamically. This mechanism helps capture long-range dependencies and contextual information. The model is fine-tuned on the ToPicto dataset, which involves adjusting its parameters to learn the mapping from complex source texts to simplified target sequences. Fine-tuning leverages the model’s pre-existing linguistic knowledge, adapting it to the specific requirements of the lexical simplification task. 3.3. Methodology The Hardware specifications of the system used for Model Training are as follows: CPU: 12th Gen Intel(R) Core(TM) i7-12700H GPU: NVIDIA GeForce RTX 3060 The training setup is facilitated by the Hugging Face Transformers library, which provides specialized tools and classes for sequence-to-sequence tasks. The fine-tuning process begins with loading the training datasets from JSON files, focusing on extracting the source (“src”) and target (“tgt”) fields. Subsequently, the source and target texts undergo tokenization via the Helsinki-NLP tokenizer to ensure they are formatted correctly for fine-tuning. Training arguments are defined, specifying parameters such as the output directory, batch sizes (here, 4), number of epochs (here, 3), and logging frequency (here, 100) to ensure comprehensive monitoring and control of the training process, after consideration of data validity. The fine-tuning process involves optimizing the model’s parameters on the training dataset, leveraging backpropagation and gradient descent to minimize the loss function and improve the model’s accuracy in generating the desired sequences. During this phase, the model is iteratively trained over multiple epochs, with periodic evaluations on the validation dataset to prevent overfitting and ensure generalizability. The fine-tuned model is subsequently saved for deployment, ensuring that the trained parameters are preserved for future use. During inference, the model generates hypotheses (hyp) for the test set, which are then post-processed to ensure conformity with the expected output format and semantic coherence. 4. Results and Discussion We have experimented with two different models as discussed above: GPT-2 and Helsinki-NLP/opus- mt-ROMANCE-en models. The following picto images are provided by ImageClef’24 organizers, and the image sequence is generated using the script file provided for the ToPicto task. Consider for example, the source text "ils ont un accent eux aussi euh". Figure 2: Helsinki-NLP/opus-mt-ROMANCE-en model Generated Sequence: ils avoir un dire eux Figure 3: GPT-2 model Generated Sequence: ils ont un accent eux aussi euh It has been proved that Helsinki-NLP model is better in producing a more efficient and accurate output for the given source text, aligned to the meaning of the given text. Whereas, it was noted that GPT-2 model gives us comparatively less accurate result, which is due to the fact that the GPT-2 model is predominantly trained on a large number of English datasets. The performance of the proposed architecture was evaluated using the metrics namely Picto-term Error Rate (PictoER) [9], BLEU score [10], METEOR [11]. In the evaluation of information retrieval systems based on the provided metrics, the performance of two distinct runs reveals noteworthy insights. Table 1 Performance Comparison of Helsinki BERT and GPT-2 Models Model Pictoer Score BLEU Score METEOR Score Helsinki BERT 18.51 68.96 83.55 GPT-2 170.81 3.93 25.57 The Helsinki BERT model demonstrates superior performance in generating French text that aligns closely with picto keywords, evidenced by its high BLEU score of 68.96, METEOR score of 83.55, and a low PictoER score of 18.51. These results indicate the model’s effectiveness in producing fluent, contextually accurate text with minimal error in keyword mapping. The strong BLEU and METEOR scores highlight the model’s capability to preserve n-gram overlaps and account for synonymy, stemming, and paraphrase matching, making it highly suitable for tasks requiring precise linguistic and semantic accuracy. It’s worth noting that while the Helsinki BERT model is trained on a diverse set of languages, including French, this multilingual training could contribute to a slight reduction in its BLEU score due to the broad scope of its training data. Conversely, the GPT-2 model is significantly weaker at producing cohesive and contextually appropri- ate French language, as evidenced by its BLEU score of 3.93, METEOR score of 25.57, and high PictoER score of 170.81. The high PictoER score indicates considerable keyword alignment problems, while the poor BLEU and METEOR scores show the model’s inability to maintain contextual relevance and fluency. The significant difference between the two shows that models must be tailored specifically to the target language and application area. In this case, the Helsinki BERT model’s customized approach performs better than GPT-2’s generalist capabilities. Interestingly, the majority of the English datasets used to train GPT-2 have limited its ability to perform well on French language tasks. However, the GPT-2 model did show some ability to correctly predict numbers and nouns. 5. Conclusion In conclusion, the Helsinki-NLP model exhibits better performance in showcasing appropriate pictos for the given text over GPT-2 model. Both the model are able to predict the keywords of a given phrase with high accuracy, however the former model is able to predict the pronouns and phrase out a meaningful picto combination at a higher degree of confidence since it is pre-trained over French dataset rather than English, as in GPT-2. 6. Future Work Subsequent research endeavors may involve the enhancement of existing models, such as further fine-tuning GPT-2, or the creation of novel models to augment their efficacy in analogous language- specific assignments. The error rates can be improved by hyperparameter tuning or by removing erroneous words, which is achieved with more data for training the model and implementing various other pre-trained models. References [1] B. Ionescu, H. Müller, A. Drăgulinescu, J. Rückert, A. Ben Abacha, A. Garcıa Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, T. M. Pakull, H. Damm, B. Bracke, C. M. Friedrich, A. Andrei, Y. Prokopchuk, D. Karpenka, A. Radzhabov, V. Kovalev, C. Macaire, D. Schwab, B. Lecouteux, E. Esperança-Rodier, W. Yim, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, S. A. Hicks, M. A. Riegler, V. Thambawita, A. Storås, P. Halvorsen, M. Heinrich, J. Kiesel, M. Potthast, B. Stein, Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International Conference of the CLEF Association (CLEF 2024), Springer Lecture Notes in Computer Science LNCS, Grenoble, France, 2024. [2] C. Macaire, E. Esperança-Rodier, B. Lecouteux, D. Schwab, Overview of ImageCLEF 2024 - investigating the translation of natural language into pictograms, in experimental IR meets multilinguality, multimodality, and interaction., https://ceur-ws.org/, 2024. [3] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [4] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly, Parameter-efficient transfer learning for NLP, in: Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, 2019, pp. 2790–2799. [5] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [6] I. Vulić, E. M. Ponti, A. Korhonen, G. Glavaš, Lexfit: Lexical fine-tuning of pretrained language models, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 5269–5283. [7] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, et al., Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45. [8] J. Qiang, Y. Li, Y. Zhu, Y. Yuan, X. Wu, Lsbert: A simple framework for lexical simplification, arXiv preprint arXiv:2006.14939 (2020). [9] J. Woodard, J. Nelson, Pictoer: An information theoretic measure of speech recognition perfor- mance, in: Workshop on standardisation for speech I/O technology, Naval Air Development Center, Warminster, PA, 1982. [10] K. Papinesi, Bleu: A method for automatic evaluation of machine translation, in: Proc. 40th Actual Meeting of the Association for Computational Linguistics (ACL), 2002, 2002, pp. 311–318. [11] S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.