1. Introduction

A Transformer Based Approach for Text-to-Picto Generation

Avaneesh Koushik

Jithu Morrison S

P Mirunalini

Jothir Aditya R K

0 0 Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering , Tamil Nadu

This study aims to develop a Text to Pictogram translation system which is used to convert a French text into its corresponding pictogram terms. The proposed system demonstrates the efectiveness of a transformer-based model in translating French text into meaningful pictogram sentences. Google-T5 is utilized and further fine-tuned on a custom dataset of French text to predict corresponding pictogram terms in French. The model underwent ifne-tuning across multiple epochs to optimize performance. Additionally, the trained model was iteratively ifne-tuned to enhance its translation capabilities. Metrics like PictoER score, BLEU score and METEOR score were used to assess the model's performance. The proposed model achived a PictoER score of 13.9, BLEU score of 74.3 and a METEOR score 87.0.

eol>Text to Pictogram French text generation Transformers model Google-T5

1. Introduction

those with language impairments, thereby contributing to the advancement of computational methods in assistive technologies and enhancing the quality of life for individuals with aphasia.

2. Background

Text-to-pictogram translation is a task which involves translation of natural language text into text with words for which appropriate pictograms are available. A unified approach to transfer learning in NLP tasks can be achieved by considering every text processing problem as a “text-to-text” problem, i.e. taking text as input and producing new text as output [ 2 ]. This approach was utilised for the proposed model as the dataset for this task comprises of input and output texts. Pretrained models have significantly better performance over the original T5 models [ 3 ].

A shallow linguistic analysis approach can be used to perform linguistic analysis for text to picto conversion [ 4 ]. Shallow linguistic analysis involves processing of basic linguistic units like tokenization and POS tagging, without performing deep semantic analysis. Transformers are powerful tools that helps in building more complex and efective models for sequence-based tasks. Transformer architectures have facilitated building higher-capacity models and pre-training has made it possible to efectively utilize this capacity for a wide variety of tasks [ 5 ]. The original system of text-to-picto aimed at people with an intellectual disability can be extended to various other interesting applications [ 6 ].

Encoder-decoder models perform better than other models on textual similarity tasks [ 7 ]. Google-T5 was found to be the better than various other approaches like LSTM and CNN for other text related tasks like hate speech detection [ 8 ] and it was found to be suitable for Question-Answer Generation [ 9 ] both of which are tasks involving pattern detection in text.

3. Approach

Google T5 or Text-to-Text Transfer Transformer is an encoder-decoder model which was pre-trained on a multi-task mixture of both unsupervised and supervised tasks. It is known to work well in tasks which require out of the box thinking. This task involves converting French text in various everyday contexts into words which are simpler and have a corresponding pictogram available. The main objective of this approach is to develop a Text-to-Picto translation system using the T5 model.

3.1. Dataset

The dataset that has been used was built from the TCOF corpus, and is stored in JSON format. TCOF contains interactions between adults, adults and children, and children themselves, covering a wide range of topics including debates, everyday situations, and medical consultations. This type of text is representative of the interactions we observe between caregivers (families, medical staf) and individuals who rely on pictograms due to language impairments [ 10 ]. Each entry in the dataset contains multiple data points, including an identifier labelled as "id" which is a unique identifier for the source, target pair, the source text which is an oral transcription of a sentence spoken in French labelled as "src", the target sequence of simplified pictogram terms "tgt", and a list that assigns a pictogram identifier to each term in the target sequence labelled as "pictos".

3.2. Data Preprocessing 3.2.1. Tokenization

On further analysis of the dataset, it was found that the data contained 24270 lines of French text with appropriate target text. The average size of the source lines is 54.6 words and the average size of the target text is 53.8 pictogram words.

Tokenization is important for preparing the data for model training. Here, both the target pictogram sequence and the source French text are tokenized using the pre-trained tokenizer from the google-T5 model. The text is split into individual tokens and converted to a numerical representation. Padding and truncating limit the text sequences to have a maximum size of 256 tokens.

3.3. Model Selection

The T5 model can be adjusted for particular tasks and comes pre-trained on a large data corpus. Here, the model is adjusted to produce the simpler pictogram terms from oral transcriptions in French. The model is inherently trained in solving text-to-text tasks and hence proves to be eficient for this task. The "t5-base" variant of Google’s T5 model is utilised for this task. t5-base is a snapshot of the T5 model taken after it was trained with 220 million parameters. This makes it flexible and easy to train for the given French text data.

3.3.1. Self-Attention mechanism

The self-attention mechanism enables the model to identify long-range links and dependencies in the input sequence. To be more precise, the T5 model calculates attention scores between every pair of tokens in the input sequence using self-attention layers. By assigning a diferent weight (attention score) to each token based on relevance to the next token, it makes every token in the sequence able to pay attention to every other token.

3.3.2. Encoder-decoder mechanism

The T5 model is based on the traditional transformer architecture comprising of an encoder-decoder structure. The input sequence is processed by the encoder, which also outputs contextual embeddings. The output sequence is subsequently generated by the decoder using these embeddings.

3.4. Methodology

The dataset is loaded from the file and the source text (src) and target sequence (tgt) are extracted for every element and stored as a list, following which it is tokenized. This is then converted to key-value pairs and fed as input to the model. TThe T5 model contains several training arguments, such as the batch size, which is set to 16 and indicates the number of training instances processed in each iteration, and the number of training epochs, which are set to 3, 5, and 6 (where the model is initially trained for 3 epochs and then further trained on the already trained model). Additionally, the save steps are set to 1,000, meaning that model checkpoints are saved every 1,000 steps to ensure training progress is recorded. The learning rate of the optimizer algorithm—such as Adam—is chosen in order to efectively update the model parameters in light of the training data. The goal of this optimization is to increase model performance by minimizing the loss function. The model is trained using the hyperparameters mentioned in Table 1

The sequence of pictogram terms generated by the proposed model is converted to the corresponding pictos sequence using the resources provided by the task organisers.

For example: If the input source text is: "il y a un moment donné elle nous avait dit essayez de pas dire de mots français pendant le truc.", the model generates the following sequence of pictogram terms: "il_y_a un instant donner passé elle nous dire essayer de dire non de mot français pendant le truc". Figure 2 shows the pictos sequence corresponding to the above generated sequence of pictogram terms.

3.4.1. Resources Used

Pandas is used in the project to manipulate data, including loading data from JSON files and structuring it into dataframes. The main deep learning framework, PyTorch, makes it easier to apply and train the T5 model, which creates pictogram sequences from French text. A cloud-based Jupyter notebook environment with GPU-accelerated resources for quicker model training and inference is ofered by Google Colab. Together, these tools improve productivity and eficiency by streamlining the development and experimentation process.

4. Results and Discussion

The parameters used for evaluating the model are the Picto-term Error Rate (PictoER) [ 11 ], BLEU [ 12 ] and METEOR [ 13 ]. The model was trained for various epochs to test the improvement in learning. These results of the proposed model for diferent epochs were tabulated in the following Table 2

Comparing the outcomes from varying the epochs while training yields insightful observations. The error rate decreased from 18 to 17 when training for 5 epochs, suggesting some improvement in performance. When initially trained for 3 epochs and fine-tuned for an additional 3 epochs, the pictoer_score drops significantly from 17.5 to 13.9, suggesting improved generalization and performance on unseen data. This is also reflected in the BLEU score which measures the precision of n-grams and the METEOR score which focuses on word order, both of which show considerable improvement. This significant improvement underscores the efectiveness of fine-tuning in refining model parameters and enhancing its ability to capture underlying patterns in the data.

The results suggest that the model may not have been able to reach its maximum potential during the first training period. Rather, the model’s representations were gradually improved through the repeated training process, which improved generalization and decreases the error rate. These findings highlight the importance of iterative training strategies and the need for careful experimentation to achieve optimal results. This shows that one may continuously enhance the model’s performance and guarantee its flexibility to a variety of datasets and applications by iteratively fine-tuning it.

5. Conclusion

In conclusion, this research demonstrates the efectiveness of advanced transformer models, specifically the Google-T5, for the task of translating French text into pictogram sequences. Through iterative ifne-tuning, the model consistently improved in accuracy, demonstrating its ability to handle intricate aspects of language. This was evaluated using metrics like PictoER, BLEU, and METEOR scores. This research emphasizes how transformer-based techniques can improve accessibility and communication for people who use augmentative and alternative forms of communication. Researchers may enhance the model’s performance and guarantee its flexibility to a variety of datasets by iteratively fine-tuning it. In order to further advance the fields of assistive technology and natural language processing, future studies could explore expanding this strategy to other languages and improving the model’s adaptability to diverse linguistic contexts.

[1]

Ionescu ,

Müller ,

Drăgulinescu ,

Rückert ,

A. Ben

Abacha ,

Garcıa Seco de Herrera , L. Bloch,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

T. M.

Pakull ,

Damm ,

Bracke ,

C. M.

Friedrich ,

Andrei ,

Prokopchuk ,

Karpenka ,

Radzhabov ,

Kovalev ,

Macaire ,

Schwab ,

Lecouteux ,

Esperança-Rodier ,

Yim ,

Fu ,

Sun ,

Yetisgen ,

Xia ,

S. A.

Hicks ,

M. A.

Riegler ,

Thambawita ,

Storås ,

Halvorsen ,

Heinrich ,

Kiesel ,

Potthast ,

Stein , Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction , Proceedings of the 15th International Conference of the CLEF Association (CLEF 2024 ), Springer Lecture Notes in Computer Science LNCS, Grenoble, France, 2024 .

[2]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

P. J.

Liu , Exploring the limits of transfer learning with a unified text-to-text transformer , Journal of machine learning research 21 ( 2020 ) 1 - 67 .

[3]

Carmo ,

Piau , I. Campiotti ,

Nogueira ,

Lotufo , Ptt5: Pretraining and validating the t5 model on brazilian portuguese data , arXiv preprint arXiv: 2008 . 09144 ( 2020 ).

[4]

Vandeghinste ,

I. S. L.

Sevens , F. Van Eynde , Translating text into pictographs , Natural Language Engineering 23 ( 2017 ) 217 - 244 .

[5]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz , et al., Transformers: State-of-the-art natural language processing , in: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , 2020 , pp. 38 - 45 .

[6]

Norré ,

Vandeghinste ,

Bouillon , T. François, Extending a text-to-pictograph system to french and to arasaac , in: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021 ), 2021 , pp. 1050 - 1059 .

[7]

Ni ,

G. H.

Abrego ,

Constant , J. Ma, K. B. Hall , D.

Cer , Y.

Yang , Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models , arXiv preprint arXiv:2108.08877 ( 2021 ).

[8]

Adewumi ,

S. S.

Sabry ,

Abid ,

Liwicki , M. Liwicki, T5 for hate speech, augmented data, and ensemble , Sci 5 ( 2023 ). URL: https://www.mdpi.com/2413-4155/5/4/37. doi: 10 .3390/sci5040037.

[9]

Kumar ,

Chauhan , P. Kumar C. , Learning enhancement using question-answer generation for e-book using contrastive fine-tuned t5 , in: P. P. Roy , A.

Agarwal , T.

Li , P.

Krishna

Reddy

, R. Uday Kiran (Eds.), Big Data Analytics , Springer Nature Switzerland, Cham, 2022 , pp. 68 - 87 .

[10]

André , E. Canut, Mise à disposition de corpus oraux interactifs : le projet tcof (traitement de corpus oraux en français), Pratiques . Linguistique, littérature, didactique 147 - 148 ( 2010 ) 35 - 51 .

[11]

J. P.

Woodard ,

J. T.

Nelson , An information theoretic measure of speech recognition performance , in: Workshop on standardisation for speech I/O technology, Naval Air Development Center , Warminster, PA, 1982 .

[12]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics , Association for Computational Linguistics, 2002 , pp. 311 - 318 .

[13]

Banerjee ,

Lavie , Meteor: An automatic metric for mt evaluation with improved correlation with human judgments , in: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics , 2005 , pp. 65 - 72 .