Use of SimpleT5 for the CLEF workshop JokeR:
Automatic Pun and Humor Translation
Loïc Glemarec1
1
    Université de Bretagne Occidentale: UBO


                                         Abstract
                                         In this paper, we present our work in the JokeR workshop. JokeR is a project aiming to study the strategies
                                         of localization of humor and puns and to create a multilingual parallel corpus and evaluation metrics
                                         in order to put a step forward to Automatic Pun and Humor Translation. 3 Tasks were proposed and
                                         to predict the requested fields for each of them, we trained the T5 model specialist in natural language
                                         processing tasks. Then we applied the models on test datasets, generating the runs as requested. By
                                         doing this we propose a simple method to study the translatability of puns. Also, we propose another
                                         way to implement our method to obtain more varied results. All this is to study the multilingual character
                                         of wordplay.

                                         Keywords
                                         Computational Humor, CLEF, JokeR, T5 models, SimpleT5


1. Introduction
   In the context of the CLEF workshop: JokeR [1], 3 tasks were proposed: 1) Wordplay instances
classification, 2) Single-words-wordplay translation, 3) Wordplay-containing-phrases translation.
We propose to use the T5 model [2]. T5 models (T5, mT5 or byT5) are mainly used for natural
language processing tasks and makes use of Transformers [3]. To train it, we use the SimpleT5
library [4]. SimpleT5 is design from PyTorch-lightning [5], a high-level python library for
PyTorch [6], a machine learning framework. SimpleT5 provides an easier way to use pre-trained
models and makes it easy to train our T5 models with a few lines of code. First, we’ll explain
our method, how to train the model, and how to use it. In a second step, we will present our
results, and explain why they are not so good, then we propose a way to obtain better results
while maintaining our process.


CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ loic.glemarec3010@gmail.com (L. Glemarec)
 https://gitlab.com/loicgle (L. Glemarec)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
2. Method
   The objective is for tasks 2 and 3 to translate from English to French or for task 1 to predict
classification fields for a given sentence. To do that, the method is the same for each task. First,
we train the t5 model to predict what we want, and then we generate the runs with a script that
apply the model to the input dataset.

2.1. Training scripts
  The training scripts are divided into 2 parts, the first is the dataset reading and its formatting.
Then comes the training of the model.

  For tasks 2 and 3, the input dataset is three-fields-composed, the identifiers, the words/sen-
tences in English then the same words or sentences in French.

   Before doing anything, we start by cleaning the data, that is to say, we go through all the
existing fields by checking the correspondence of the types as well as the content. Indeed,
a column that must contain character strings must not contain anything other than strings,
likewise, no null values must appear in the entire dataset.

   So we start by separating the data in the form of data-frames into two parts. A train part and
a test part. These two data-frames are then mixed. We end up removing the identifiers field and
renaming the "en" and "fr" columns to "source_text" and "target_text" respectively. (cf. Table 1)

        "source_text"                                   "target_text"
        My boyfriend and I started to date after he     Mon petit ami et moi avons commencé à
        backed his car into mine. We met by             sortir ensemble après qu’il ait reculé sa
        accident.                                       voiture contre la mienne. Nous nous
                                                        sommes rencontrés par accident.
Table 1
Table detailing the precision of each field of Task 1


  Now the model training can begin by using the SimpleT5 librarie. We first load T5 model,
before running the training method.

   Now that the T5 model is trained, it is ready to predict the "target_text" field according to the
"source_text" given to it.

   Note that we have given the example of Tasks 3 because it is simpler and more condensed to
present, but concerning Task 1, it is the same process. The only difference is in the number of
fields to predict, so we have 1 training per field to predict. The shape of the data-frames is the
same too.
2.2. Generalizing script
   Now that we have a trained model, we can launch our runs. Of course, just like during the
training, we must clean the input dataset, to make sure to not have erroneous values.

   For this, we just need to have a data frame with a column containing values similar to the
preceding "source_text" fields. Then we have to load the trained model. Then using the apply
method of the Pandas library, we can predict one by one all the desired corresponding values
("target_text").

  We finish everything by formatting all the results in the JOKER-requested way.

2.3. Method summarization
   We have seen that to predict each desired field in tasks 1, 2, and 3, it was necessary each time
to organize and format the input data to train the T5 model. We also saw that once a model
was trained, it was then possible to apply it to a given input dataset. Without forgetting to be
rigorous about the formatting of the output data which must correspond to the given format.


3. Results
3.1. Task 1
   Task 1 is quite special due to its number of fields to predict. Therefore we have chosen to
train a model for each of these fields. It was quite possible and even better to couple certain
fields during the prediction.

  So here are the different elements that we have been able to observe : (cf. A)

    • Location : location field predictions are globally accurate (80
         – Short wordplays : they contain only 2 or 3 words, and the pun is on these few words,
           the model, therefore, predicts that the location is on these words.
         – Sentences with wordplays : In these cases, the location of the puns is often at the
           end of the sentences and the model easily locates them.
      But in these two cases, it is impossible to say if the model is really precise or if it is
      based only on the standard of the input data. In fact, as for short puns or sentences, the
      location is very much the same, the model could very well base itself on this to predict the
      locations. Therefore it would not predict the location of the pun but the normal location
      it is supposed to have.
    • Interpretation : regarding this field, we can see different behaviors but none is really
      correct :
         – Sometimes the interpretation prediction is just bad and doesn’t make sense. The
           model takes a word and displays it two times.
          – Other times the prediction just took the two worded wordplay. It doesn’t give us
            any explanation about the wordplay.
          – Otherwise, the prediction took only one word from the original text. So that doesn’t
            give us an explanation either.
          – Then, there are some other behaviors, but there are never precisely explaining the
            wordplay.
    • Manipulation level : all rows are predicted as “Sound”, which is not always true nor
      false. And even if the “Sound” level is found and true, the interpretation is not able to
      find the initial word. (probably because both columns have to be trained together)
    • Horizontal and vertical : this field seems to be exclusively predicted as “vertical”.
    • Manipulation type : This seems to be the most accurately predicted field, but again it’s
      hard to tell if the model is really accurate or if it’s not just based on the rate of training
      values. Which statistically, would allow him to be precise, by chance in a way.

   Speaking of statistics (cf. Table 2), despite high accuracy rates in certain fields, it remains
difficult to know if our model is accurate, or if it is not based on a certain standard. This standard
repeating itself in our test data, most likely allows our model to be accurate.

                               Accuratcy rate     Matches   Mismatches   Shapes    Errors
             location          0.804819           334       81           415       0
             interpretation    0.0578313          24        391          415       0
             hor/ver           0.985542           409       6            415       0
             Manip_type        0.584541           242       172          414       0
             Manip_level       0.99759            414       1            415       0
Table 2
Table detailing the precision of each field of Task 1


3.2. Task 2
  For this task, we want to translate term-based wordplay, e.g. the Pokémon names, from
English to French (cf. B). The model has translation difficulties (especially because we had to
use the T5 model and not the mT5 model which is more efficient with translations). The results
are mostly the same word as the source. Sometimes there are changes with the letters, accents
appear and some double letters become single.

3.3. Task 3
   For this last task, it was a question of translating sentences containing puns (cf. C). All in all,
the translation is not bad if a few small errors are disregarded. The subject of the sentences is
often kept, the humorous effect is not always present, or is more complicated to understand after
translation. Nevertheless, some sentences work very well in French and have been translated
well.
            English     “Some burglars are always looking for windows of opportunity.”
            French    "Certains cambrioliers cherchent toujours des fenêtres d’opportunité.”
Table 3
Example of sentence with wordplay that as been translated from English to French retaining meaning


Table 4
Table representing the analyzes made on different runs submitted to JokeR, the color gradient makes it
possible to identify which are the best in opposition to each other.


4. Discussion
  Based on our results, it’s clear that our model needs much better training. Which in our case
was complicated to do because of time and material constraints.
Nevertheless, our results allow us to notice several things from the point of view of the training
data. especially for task 1, certain classification elements must be more varied, to avoid the
models being based on standards that would be detected.

    • For example, the classification fields "manipulation_level" and "horizontal/vertical" are
      not varied enough.

Still concerning Task 1, it would surely be interesting to dissociate short puns and sentences
containing puns. always with the aim of improving the models.
Speaking of models, we used T5, but mT5 is the multi-lingual version of T5. Unfortunately
we haven’t been able to use it (again for hardware performance reasons). This multi-lingual
version would surely have been better for translations.

  Now let’s discuss the translation part, we were able to have our results analyzed and we can
therefore compare them to other methods and models. Table 4 shows this, the lines represent
the runs that were submitted to JokeR, ours being "CECILIA", the columns give us quantitative
numerical indications of what was observed. The color code simply shows us who has the best
results (green) and who has the worst (red) for each category.

    • Total : Total row submitted.
    • Valid : Number of valid submissions.
    • not-translated : Number of rows that have not been translated.
    • nonsens : Number of rows that have no sens.
    • syntax_problem : Number of rows that present syntax errors.
    • lexical_problem : Number of rows that present lexical errors.
    • lexical_field_preservation : Number of row that have retained lexical_field
    • sens_preservation : Number of translation that preserve the sens from english to french.
    • comprehensible_terms :
    • wordplay_form : Number of rows that present wordplay form.
    • identifiable_wordplay : Number of rows that present identifiable wordplay.
    • over-translation :
    • style_shift : Number of rows that present shift from the style point of view.
    • hilariousness_shift : Number of rows that present shift from the hilariousness point of
      view.

   From this table (cf. Table 2), it is much easier to give an opinion on our method and the use
of T5. Indeed we have opted for a simple and quick implementation approach that may not
be the case for the other proposals. We can then figure out the ratio between efficiency and
translation accuracy, not only from the translation point of view as-it-is but also from the puns
hilariousness transition aspect.

   Now, if we take the time to look at and compare our results to those of the other proposals,
we observe several things. First, our method can translate the majority of sentences. Indeed, we
present a small number of untranslated fields. However, regarding the presence of syntactic
and lexical errors, we are at the opposite, less good. Regardless, this remains at a low ratio,
given the total number of translations. Concerning the lexical and meaning preservation fields,
we place ourselves this time at the forefront, especially with the second place for both, while
remaining close to the first results. The number of understandable terms is not to our advantage,
we place ourselves 4th out of 5, but we are still on the same number scale as the first. The
number of pun forms is not the highest, but it remains the same magnitude as the best results.
The same goes for identifiable puns. Note that we are the ones who show the highest number
of over-translations, it is only about ten cases.

   Concerning the translation of puns and considering this evaluation, it allows us to show
our strong and weak points. Also, we can situate ourselves from others’ submissions. What is
clear is that our translation is globally good. Indeed our results always follow the norm with
no figures standing out, whether for good or less. However, there is one that scores, the more
than 1400 hilariousness shifts. We are certainly the only ones having correct translations while
having a significantly higher number of translations that have lost their funny character. Our
method, therefore, seems to be a good translator, which does not translate the second degree
but which translates puns in the first degree (this can be likened to a form of humor, can’t it ?).


5. Conclusion
   In the context of the JOKER workshop, we decided to use existing models and to train them
to predict asked fields.

  In summary, we don’t have any significantly good results. But our method works, and there
are only tracks of improvement, mainly around the models and their training. Only for task 1,
our version: 1 model per field to predict works but is in our opinion not the only one possible,
nor the most efficient. It is nevertheless simple to implement because it follows the same
procedure as all the other models that we have been able to use. We indeed think that the
coupling of several fields for the training of the models was a concrete track of improvement
for the predictions or at least a good way to know what is the best way to do this.

  The next step would be an improvement in hardware capabilities to better train our models.
Also, the choice of model is not fixed and other models can be used for our objectives such as
Jurassic [7] for example.


6. Online Resources
The sources for the three JokeR tasks are available via

    • GitHub,


Acknowledgments
  I would like to thank BOSSER Anne-Gwenn for the opportunity she gave me by offering me
an internship, and for her help through her feedback and advice on my work.

   Also, thanks to ERMAKOVA Liana, who provided me with lots of technical help and examples
in a context where time was tight.


References
[1] L. Ermakova, T. Miller, O. Puchalski, F. Regattin, Mathurin, S. Araújo, A.-G. Bosser, C. Borg,
    M. Bokiniec, G. L. Corre, B. Jeanjean, R. Hannachi, Mallia, G. Matas, M. Saki, CLEF
    Workshop JOKER: Automatic Wordplay and Humour Translation, in: M. Hagen, S. Verberne,
    C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information
    Retrieval, Lecture Notes in Computer Science, Springer International Publishing, Cham,
    2022, pp. 355–363. doi:10.1007/978-3-030-99739-7_45.
[2] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, 2020.
    URL: http://arxiv.org/abs/1910.10683, number: arXiv:1910.10683 arXiv:1910.10683 [cs, stat].
[3] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y. Jernite,
    J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-
    of-the-Art Natural Language Processing, 2020. URL: https://www.aclweb.org/anthology/
    2020.emnlp-demos.6, pages: 38–45 original-date: 2018-10-29T13:56:00Z.
[4] simpleT5, https://github.com/Shivanandroy/simpleT5, 2021.
[5] W. Falcon, The PyTorch Lightning team, PyTorch Lightning, 2019. URL: https://github.com/
    PyTorchLightning/pytorch-lightning. doi:10.5281/zenodo.3828935.
[6] pytorch,              2022.          URL:         https://github.com/pytorch/pytorch/blob/
    88fca3be5924dd089235c72e651f3709e18f76b8/CITATION,               original-date:      2016-08-
    13T05:26:41Z.
[7] O. Lieber, O. Sharir, B. Lentz, Y. Shoham, Jurassic-1: Technical Details and Evaluation (2021)
    9.
Appendices
Appendix A   Outputs example from Task 1
Appendix B   Outputs example from Task 2
Appendix C   Outputs example from Task 3