Use of SimpleT5 for the CLEF workshop JokeR: Automatic Pun and Humor Translation Loïc Glemarec1 1 Université de Bretagne Occidentale: UBO Abstract In this paper, we present our work in the JokeR workshop. JokeR is a project aiming to study the strategies of localization of humor and puns and to create a multilingual parallel corpus and evaluation metrics in order to put a step forward to Automatic Pun and Humor Translation. 3 Tasks were proposed and to predict the requested fields for each of them, we trained the T5 model specialist in natural language processing tasks. Then we applied the models on test datasets, generating the runs as requested. By doing this we propose a simple method to study the translatability of puns. Also, we propose another way to implement our method to obtain more varied results. All this is to study the multilingual character of wordplay. Keywords Computational Humor, CLEF, JokeR, T5 models, SimpleT5 1. Introduction In the context of the CLEF workshop: JokeR [1], 3 tasks were proposed: 1) Wordplay instances classification, 2) Single-words-wordplay translation, 3) Wordplay-containing-phrases translation. We propose to use the T5 model [2]. T5 models (T5, mT5 or byT5) are mainly used for natural language processing tasks and makes use of Transformers [3]. To train it, we use the SimpleT5 library [4]. SimpleT5 is design from PyTorch-lightning [5], a high-level python library for PyTorch [6], a machine learning framework. SimpleT5 provides an easier way to use pre-trained models and makes it easy to train our T5 models with a few lines of code. First, we’ll explain our method, how to train the model, and how to use it. In a second step, we will present our results, and explain why they are not so good, then we propose a way to obtain better results while maintaining our process. CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ loic.glemarec3010@gmail.com (L. Glemarec) € https://gitlab.com/loicgle (L. Glemarec) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Method The objective is for tasks 2 and 3 to translate from English to French or for task 1 to predict classification fields for a given sentence. To do that, the method is the same for each task. First, we train the t5 model to predict what we want, and then we generate the runs with a script that apply the model to the input dataset. 2.1. Training scripts The training scripts are divided into 2 parts, the first is the dataset reading and its formatting. Then comes the training of the model. For tasks 2 and 3, the input dataset is three-fields-composed, the identifiers, the words/sen- tences in English then the same words or sentences in French. Before doing anything, we start by cleaning the data, that is to say, we go through all the existing fields by checking the correspondence of the types as well as the content. Indeed, a column that must contain character strings must not contain anything other than strings, likewise, no null values must appear in the entire dataset. So we start by separating the data in the form of data-frames into two parts. A train part and a test part. These two data-frames are then mixed. We end up removing the identifiers field and renaming the "en" and "fr" columns to "source_text" and "target_text" respectively. (cf. Table 1) "source_text" "target_text" My boyfriend and I started to date after he Mon petit ami et moi avons commencé à backed his car into mine. We met by sortir ensemble après qu’il ait reculé sa accident. voiture contre la mienne. Nous nous sommes rencontrés par accident. Table 1 Table detailing the precision of each field of Task 1 Now the model training can begin by using the SimpleT5 librarie. We first load T5 model, before running the training method. Now that the T5 model is trained, it is ready to predict the "target_text" field according to the "source_text" given to it. Note that we have given the example of Tasks 3 because it is simpler and more condensed to present, but concerning Task 1, it is the same process. The only difference is in the number of fields to predict, so we have 1 training per field to predict. The shape of the data-frames is the same too. 2.2. Generalizing script Now that we have a trained model, we can launch our runs. Of course, just like during the training, we must clean the input dataset, to make sure to not have erroneous values. For this, we just need to have a data frame with a column containing values similar to the preceding "source_text" fields. Then we have to load the trained model. Then using the apply method of the Pandas library, we can predict one by one all the desired corresponding values ("target_text"). We finish everything by formatting all the results in the JOKER-requested way. 2.3. Method summarization We have seen that to predict each desired field in tasks 1, 2, and 3, it was necessary each time to organize and format the input data to train the T5 model. We also saw that once a model was trained, it was then possible to apply it to a given input dataset. Without forgetting to be rigorous about the formatting of the output data which must correspond to the given format. 3. Results 3.1. Task 1 Task 1 is quite special due to its number of fields to predict. Therefore we have chosen to train a model for each of these fields. It was quite possible and even better to couple certain fields during the prediction. So here are the different elements that we have been able to observe : (cf. A) • Location : location field predictions are globally accurate (80 – Short wordplays : they contain only 2 or 3 words, and the pun is on these few words, the model, therefore, predicts that the location is on these words. – Sentences with wordplays : In these cases, the location of the puns is often at the end of the sentences and the model easily locates them. But in these two cases, it is impossible to say if the model is really precise or if it is based only on the standard of the input data. In fact, as for short puns or sentences, the location is very much the same, the model could very well base itself on this to predict the locations. Therefore it would not predict the location of the pun but the normal location it is supposed to have. • Interpretation : regarding this field, we can see different behaviors but none is really correct : – Sometimes the interpretation prediction is just bad and doesn’t make sense. The model takes a word and displays it two times. – Other times the prediction just took the two worded wordplay. It doesn’t give us any explanation about the wordplay. – Otherwise, the prediction took only one word from the original text. So that doesn’t give us an explanation either. – Then, there are some other behaviors, but there are never precisely explaining the wordplay. • Manipulation level : all rows are predicted as “Sound”, which is not always true nor false. And even if the “Sound” level is found and true, the interpretation is not able to find the initial word. (probably because both columns have to be trained together) • Horizontal and vertical : this field seems to be exclusively predicted as “vertical”. • Manipulation type : This seems to be the most accurately predicted field, but again it’s hard to tell if the model is really accurate or if it’s not just based on the rate of training values. Which statistically, would allow him to be precise, by chance in a way. Speaking of statistics (cf. Table 2), despite high accuracy rates in certain fields, it remains difficult to know if our model is accurate, or if it is not based on a certain standard. This standard repeating itself in our test data, most likely allows our model to be accurate. Accuratcy rate Matches Mismatches Shapes Errors location 0.804819 334 81 415 0 interpretation 0.0578313 24 391 415 0 hor/ver 0.985542 409 6 415 0 Manip_type 0.584541 242 172 414 0 Manip_level 0.99759 414 1 415 0 Table 2 Table detailing the precision of each field of Task 1 3.2. Task 2 For this task, we want to translate term-based wordplay, e.g. the Pokémon names, from English to French (cf. B). The model has translation difficulties (especially because we had to use the T5 model and not the mT5 model which is more efficient with translations). The results are mostly the same word as the source. Sometimes there are changes with the letters, accents appear and some double letters become single. 3.3. Task 3 For this last task, it was a question of translating sentences containing puns (cf. C). All in all, the translation is not bad if a few small errors are disregarded. The subject of the sentences is often kept, the humorous effect is not always present, or is more complicated to understand after translation. Nevertheless, some sentences work very well in French and have been translated well. English “Some burglars are always looking for windows of opportunity.” French "Certains cambrioliers cherchent toujours des fenêtres d’opportunité.” Table 3 Example of sentence with wordplay that as been translated from English to French retaining meaning Table 4 Table representing the analyzes made on different runs submitted to JokeR, the color gradient makes it possible to identify which are the best in opposition to each other. 4. Discussion Based on our results, it’s clear that our model needs much better training. Which in our case was complicated to do because of time and material constraints. Nevertheless, our results allow us to notice several things from the point of view of the training data. especially for task 1, certain classification elements must be more varied, to avoid the models being based on standards that would be detected. • For example, the classification fields "manipulation_level" and "horizontal/vertical" are not varied enough. Still concerning Task 1, it would surely be interesting to dissociate short puns and sentences containing puns. always with the aim of improving the models. Speaking of models, we used T5, but mT5 is the multi-lingual version of T5. Unfortunately we haven’t been able to use it (again for hardware performance reasons). This multi-lingual version would surely have been better for translations. Now let’s discuss the translation part, we were able to have our results analyzed and we can therefore compare them to other methods and models. Table 4 shows this, the lines represent the runs that were submitted to JokeR, ours being "CECILIA", the columns give us quantitative numerical indications of what was observed. The color code simply shows us who has the best results (green) and who has the worst (red) for each category. • Total : Total row submitted. • Valid : Number of valid submissions. • not-translated : Number of rows that have not been translated. • nonsens : Number of rows that have no sens. • syntax_problem : Number of rows that present syntax errors. • lexical_problem : Number of rows that present lexical errors. • lexical_field_preservation : Number of row that have retained lexical_field • sens_preservation : Number of translation that preserve the sens from english to french. • comprehensible_terms : • wordplay_form : Number of rows that present wordplay form. • identifiable_wordplay : Number of rows that present identifiable wordplay. • over-translation : • style_shift : Number of rows that present shift from the style point of view. • hilariousness_shift : Number of rows that present shift from the hilariousness point of view. From this table (cf. Table 2), it is much easier to give an opinion on our method and the use of T5. Indeed we have opted for a simple and quick implementation approach that may not be the case for the other proposals. We can then figure out the ratio between efficiency and translation accuracy, not only from the translation point of view as-it-is but also from the puns hilariousness transition aspect. Now, if we take the time to look at and compare our results to those of the other proposals, we observe several things. First, our method can translate the majority of sentences. Indeed, we present a small number of untranslated fields. However, regarding the presence of syntactic and lexical errors, we are at the opposite, less good. Regardless, this remains at a low ratio, given the total number of translations. Concerning the lexical and meaning preservation fields, we place ourselves this time at the forefront, especially with the second place for both, while remaining close to the first results. The number of understandable terms is not to our advantage, we place ourselves 4th out of 5, but we are still on the same number scale as the first. The number of pun forms is not the highest, but it remains the same magnitude as the best results. The same goes for identifiable puns. Note that we are the ones who show the highest number of over-translations, it is only about ten cases. Concerning the translation of puns and considering this evaluation, it allows us to show our strong and weak points. Also, we can situate ourselves from others’ submissions. What is clear is that our translation is globally good. Indeed our results always follow the norm with no figures standing out, whether for good or less. However, there is one that scores, the more than 1400 hilariousness shifts. We are certainly the only ones having correct translations while having a significantly higher number of translations that have lost their funny character. Our method, therefore, seems to be a good translator, which does not translate the second degree but which translates puns in the first degree (this can be likened to a form of humor, can’t it ?). 5. Conclusion In the context of the JOKER workshop, we decided to use existing models and to train them to predict asked fields. In summary, we don’t have any significantly good results. But our method works, and there are only tracks of improvement, mainly around the models and their training. Only for task 1, our version: 1 model per field to predict works but is in our opinion not the only one possible, nor the most efficient. It is nevertheless simple to implement because it follows the same procedure as all the other models that we have been able to use. We indeed think that the coupling of several fields for the training of the models was a concrete track of improvement for the predictions or at least a good way to know what is the best way to do this. The next step would be an improvement in hardware capabilities to better train our models. Also, the choice of model is not fixed and other models can be used for our objectives such as Jurassic [7] for example. 6. Online Resources The sources for the three JokeR tasks are available via • GitHub, Acknowledgments I would like to thank BOSSER Anne-Gwenn for the opportunity she gave me by offering me an internship, and for her help through her feedback and advice on my work. Also, thanks to ERMAKOVA Liana, who provided me with lots of technical help and examples in a context where time was tight. References [1] L. Ermakova, T. Miller, O. Puchalski, F. Regattin, Mathurin, S. Araújo, A.-G. Bosser, C. Borg, M. Bokiniec, G. L. Corre, B. Jeanjean, R. Hannachi, Mallia, G. Matas, M. Saki, CLEF Workshop JOKER: Automatic Wordplay and Humour Translation, in: M. Hagen, S. Verberne, C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2022, pp. 355–363. doi:10.1007/978-3-030-99739-7_45. [2] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, 2020. URL: http://arxiv.org/abs/1910.10683, number: arXiv:1910.10683 arXiv:1910.10683 [cs, stat]. [3] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State- of-the-Art Natural Language Processing, 2020. URL: https://www.aclweb.org/anthology/ 2020.emnlp-demos.6, pages: 38–45 original-date: 2018-10-29T13:56:00Z. [4] simpleT5, https://github.com/Shivanandroy/simpleT5, 2021. [5] W. Falcon, The PyTorch Lightning team, PyTorch Lightning, 2019. URL: https://github.com/ PyTorchLightning/pytorch-lightning. doi:10.5281/zenodo.3828935. [6] pytorch, 2022. URL: https://github.com/pytorch/pytorch/blob/ 88fca3be5924dd089235c72e651f3709e18f76b8/CITATION, original-date: 2016-08- 13T05:26:41Z. [7] O. Lieber, O. Sharir, B. Lentz, Y. Shoham, Jurassic-1: Technical Details and Evaluation (2021) 9. Appendices Appendix A Outputs example from Task 1 Appendix B Outputs example from Task 2 Appendix C Outputs example from Task 3