<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLEF 2023 JOKER Tasks 2 and 3: Using NLP Models for Pun Location, Interpretation and Translation1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felix Ohnesorge</string-name>
          <email>felix-eutin@gmx.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mari Ángeles Gutiérrez</string-name>
          <email>mariangeles.gutierrez@uca.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Plichta</string-name>
          <email>j.plichta.162@studms.ug.edu.pl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Cádiz</institution>
          ,
          <addr-line>11003, Cádiz</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Gdańsk</institution>
          ,
          <addr-line>80-309 Gdańsk</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Kiel</institution>
          ,
          <addr-line>24118 Kiel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The translation of puns presents a significant challenge for translators and has garnered considerable attention in the field of translation studies. While translation technology aims to automate the translation process, little focus has been placed on the translation of wordplay. Addressing this gap, the CLEF2023 JOKER track aims to develop a multilingual corpus of wordplay and evaluation metrics to advance the automation of creative-language translation. This paper provides an overview of the track's Pilot Task 3, which specifically focuses on the translation of entire phrases containing wordplay, particularly puns.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Wordplay</kwd>
        <kwd>computation humour</kwd>
        <kwd>pun</kwd>
        <kwd>machine translation</kwd>
        <kwd>deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As Ermakova et al. explain [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], humorous wordplay comprehension and translation often require
recognizing implicit cultural references, understanding word formation processes, and discerning
double meanings. These challenges are encountered by both humans and computers alike. Introducing
the CLEF 2023 JOKER track, this paper adopts an interdisciplinary approach to facilitate the creation
of reusable test collections, evaluation metrics, and methods for automatic wordplay processing. The
track comprises interconnected shared tasks that encompass the detection, location, interpretation, and
translation of puns. Furthermore, associated datasets and evaluation methodologies are described, and
contributions leveraging this data are invited for further research and development.
      </p>
      <p>Through the collaborative efforts of the CLEF JOKER track, advancements in automating the
translation of wordplay are expected, paving the way for improved translation technology and
enhancing cross-cultural communication in creative language contexts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Task 1: Pun Detection</title>
      <p>A pun is a form of wordplay that exploits multiple meanings or similar sounds of words to create
a humorous or clever effect. Puns can be found in various forms, including plays on words, double
entendres, homophones, and wordplay involving idioms or metaphors, which implies nuanced
linguistic and cultural knowledge, making it difficult for algorithms to capture the full range of
punning possibilities.</p>
      <p>However, the methods based on natural language processing (NLP) and machine learning
techniques examine factors such as word similarity, part-of-speech tags, syntactic patterns, and
context, so, by comparing the different senses or meanings of words and their surrounding context,
algorithms can identify potential puns.</p>
      <p>In our case, we have been provided with the JOKER dataset, which contains around 5000 sample
sentences, which are labeled with either a “yes” or a “no”. This is not a big dataset when considering
the high dimensionality we have in text classification. Normally we would use a cross-validation
technique but that requires a lot of time and computational power, which google colab could not
provide enough of. So, we used a 60/20/20 split on our dataset for training, testing and evaluation.</p>
      <p>
        The pun detection is a binary classification task, for which we used several different methods. The
methods (some of them quite simple and only as a reference) will be explained in detail later. In
addition, all statistics can be found in Ermakova et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
2.1.1. Random
      </p>
      <p>This Method yielded an accuracy of 49%, which is expected because our evaluation data contained
50% of “yes” samples and 50% of “no” samples.</p>
    </sec>
    <sec id="sec-4">
      <title>2.1.2. Fasttext</title>
      <p>This is an open-source library for text classification. It comes pretrained in multiple languages,
which should help us because we only need to fine tune the model.</p>
      <p>The trained model achieved a 70% accuracy on our evaluation data, with an equally distributed
confusion matrix.</p>
      <p>2.1.3. Ridge</p>
      <p>Ridge is a classifier that works by regression and is most commonly used for multiclass
classifications but can of cause also be used for binary classification.</p>
      <p>This is not a pretrained model, so we had to train it with our available data exclusively. The model
still achieved a slightly higher accuracy on our evaluation dataset, 76%.</p>
      <p>2.1.4. Bayes</p>
      <p>Bayes is a well-known classifier for multimodal classification. It is basically a simple probability
calculation. Here an interesting thing happened. We achieved an accuracy of 62%, but the confusion
matrix showed zero true positives and zero false positives. This indicates an underfitting.</p>
      <p>2.1.5. MLP</p>
      <p>We used the MLPClassifier from the sklearn.neural_network library. This is a multi-layer
perceptron that optimizes the loss function by stochastic gradient descent. This is a very powerful
method, but it has many parameters that can be adjusted. For example, the number of neurons per
layer, the number of layers. As a solver we used Adam, as it is the most used one. As an activation
function we used ReLU which is also widely used.</p>
      <p>This multi-layer perceptron achieved an accuracy of 75%, scoring slightly behind Ridge.</p>
    </sec>
    <sec id="sec-5">
      <title>2.1.6. SimpleTransformersT5</title>
      <p>SimpleT5 models can be used for many tasks such as summarization, translation, text generation,
and more. We used it for a binary classification which we did not expect to perform very well because
it is not the intended task and is way too powerful. This showed i n training, which took an
exceedingly long time and the model started overfitting after the second epoch. The only
parameter we adjusted was the target_max_token_length, because we only wanted it to answer “yes”
and “no” or 0 and 1, respectively.</p>
      <p>As expected, the model only had a 70% accuracy on our evaluation data, which is slightly lower
than most of the other methods.</p>
    </sec>
    <sec id="sec-6">
      <title>2.1.7. SimpleTransformersRoberta</title>
      <p>This is a transformer model with a binary text classification model on top of it. We expect this
model to perform better than the others because it seems best fit to this problem.
The trained model achieved a 75% accuracy on our evaluation dataset.</p>
    </sec>
    <sec id="sec-7">
      <title>2.2. Task 2: Pun Location</title>
      <p>For the pun location, the main purpose is to identify which words carry the double meaning in a
text known a priori to contain a pun. We have used the following methods, which will be explained in
more detail later:
The metrics used for evaluation are:
• Precision: it measures the proportion of correctly predicted positive instances (true
positives) out of all instances predicted as positive (true positives + false positives).
• Recall: it measures the proportion of correctly predicted positive instances (true
positives) out of all actual positive instances (true positives + false negatives).
• F1-score: it is the harmonic mean of precision and recall and provides a balanced measure
of the model's performance. It takes both precision and recall into account and is
particularly useful when dealing with imbalanced datasets.
• Support: The support column indicates the number of instances (samples) in the dataset
that belong to each class.</p>
      <p>A very big problem was the size of available data. For the sexism detection task, we had 7000
labeled tweets and for the pun detection around 5000 sample sentences. When training a neural
network that has not been pretrained, this is not enough due to the high feature dimensionality of this
task. Here it may be interesting to use different vectorization methods and analyze their impact on the
result.</p>
      <p>Normally when dealing with samples for training/evaluation a good technique would be the cross
validation where all available data is split into multiple batches. We would then train a model on all
batches but one and use the left-out batch for evaluation. This would be repeated for every batch.
Unfortunately, this requires a lot of computing power and time, both of which we did not have. So,
we used an 80/20 split on the training data for training/evaluation. For the ML models we split the
training data into a training and test set, again in an 80/20 relation.</p>
      <p>The effectiveness of Bayes, as a method for multimodal classification, is limited in the pun location
task due to an excessive number of modalities or an absence of a finite set of modalities. Bayes is a
naïve classification method based on probabilities with multiple classes. In the pun detection setting
we have a two-class decision problem with can be realized by the bayes classifier. Additionally,
Fasttext and Ridge, both commonly employed in classification problems, are anticipated to exhibit
suboptimal performance in this context.</p>
      <p>Instead, pretrained models like SimpleT5 are expected to demonstrate superior performance in this
task, which is why we have used them.</p>
    </sec>
    <sec id="sec-8">
      <title>2.2.1. Results for SimpleT5</title>
      <p>From the extensive list of classified terms, it can be said that the majority have reached 1.0
accuracy, which underlines the point that we did not have enough data. So, in summary, the SimpleT5
model achieved an accuracy of 87% on the classification task against our evaluation data set. The
macro average and weighted average metrics indicate the overall performance across classes, with
values of 0.77 and 0.87, respectively.</p>
    </sec>
    <sec id="sec-9">
      <title>2.2.2. Results for SimpleTransformersT5</title>
      <p>In the same line as the previous method, it achieved an accuracy of 83% on the classification task.
The macro average and weighted average metrics indicate the overall performance across classes,
with values of 0.71 and 0.70 for precision, recall, and F1- score. However, without the specific values
of the confusion matrix, we cannot further analyze the model's performance for each class or discern
the distribution of the predicted and true labels.</p>
    </sec>
    <sec id="sec-10">
      <title>2.3. Task 3: Pun translation</title>
      <p>In the context of pun translation, several models were employed, including those pretrained on
translation tasks, as well as FastText and SimpleT5. Notably, FastText exhibited subpar performance
due to its classification-oriented nature, which may not be ideally suited for the intricacies of pun
translation. Similarly, SimpleT5 did not yield notably satisfactory results. However, it is worth
mentioning that specific quantitative performance metrics are unavailable, precluding a more detailed
assessment of their respective performances.</p>
      <p>Regarding to the dataset with which we have worked, translations from English into French and
Spanish have been obtained using two NPL models: OpusMT and M2M100.</p>
      <p>From 1000 sentences provided, 30 have been selected to check the quality of the resulting
translation. Taking said pre-selection as a reference, 96.7% of these sentences had obtained a good
result in the Spanish translation. The sentences translated by OpusMT were more accurate, while
M2M100 tends to make free translations, therefore meaningless occasionally.</p>
      <p>In general, we have obtained very literal translations, so that is why 60% failed to recreate the pun
in the target language. Following the “wordplay translation strategies” referenced by Ermakova et al
[4], in those sentences where the pun also works in the target language, they have performed an
isomorphic pun (that is, they have used the direct translation of the words because it coincides with
the main meaning that gives sense to the pun). For example, in the sentence “When you're wearing a
watch on an airplane, time flies” the translation is very direct, so in Spanish it retains the same
meaning when the result shows: “Cuando llevas un reloj en un avión, el tiempo vuela”.</p>
      <p>This implies that puns based on homophonic sounds are not detected or translated (ST: “You can't
trust a tiger. You never know when he might be lion”; TT: “No puedes confiar en un tigre. Nunca se
sabe cuándo podría ser león”), just like puns based on polysemous terms (ST: “If there's one person
you don't want to interrupt in the middle of a sentence it's a judge”; TT: “Si hay una persona que
no quieres interrumpir en medio de una sentencia es un juez” where the word ‘sentence’ loses one
of its meanings, in this case ‘expression’ or ‘group of words’ besides ‘a punishment given by a judge’.</p>
      <p>However, it is very interesting to see how OpusMT recognizes the wordplay’ semantic field in
order to make a coherent translation: ST: “The designer wondered why his pirate room wasn't perfect,
and the judge told him he went a little overboard”; TT: “El diseñador se preguntó por qué su
habitación pirata no era perfecta, y el juez le dijo que se fue un poco por la borda” maintaining
references to vocabulary related to the sea, while</p>
      <p>M2M100 made a neutral translation: “El diseñador se preguntó por qué su sala de piratas no era
perfecta, y el juez le dijo que iba un poco por encima”.</p>
      <p>It is also curious how OpusMT is able to distinguish a formal context: the second person 'you'
has two translations in Spanish: 'tú' for informal situations (under certain pragmalinguistic
conditions) and 'usted' for formal situations. In the following example, this model has been able to
recognize a situation that requires polite language: ST: “Waiter, there are pennies in my soup!'' Well,
sir, you said you'd stop eating here if there wasn't some change in the food”; TT: “ ‘¡Camarero, hay
centavos en mi sopa!’ Bueno, señor, usted dijo que dejaría de comer aquí si no había algún cambio
en la comida”.</p>
      <p>On the other hand, in a significantly lower proportion, we have found a few examples of pun’s
omissions (“pun to zero”, according to Delabastita lists, cited in Ermakova et al. [3]): ST: “OLD
POLICEMEN never die they just cop out”; TT: “Los viejos policías nunca mueren”.</p>
    </sec>
    <sec id="sec-11">
      <title>3. Conclusion</title>
      <p>In conclusion, this paper focused on the interdisciplinary approach taken in the CLEF 2023 JOKER
track to address the challenges of wordplay comprehension, detection, location, interpretation, and
translation. We have described the creation of reusable test collections, evaluation metrics, and
methods for automatic wordplay processing.</p>
      <p>For the task of pun detection, various methods were employed, including Random, Fasttext, Ridge,
Bayes, MLP, SimpleTransformersT5, and SimpleTransformersRoberta. Each method was evaluated
using accuracy (F1) as the performance metric, with Ridge and MLP achieving the highest accuracies
of 76% and 75%, respectively. These results demonstrated the effectiveness of machine learning and
neural network-based approaches in identifying puns.</p>
      <p>In the pun location task, we used methods such as SimpleT5, SimpleTransformersT5, Ridge,
Bayes, and Fasttext. The evaluation metrics used included precision, recall, F1-score, and support.
SimpleT5 achieved the highest accuracy of 87%, indicating its efficacy in identifying the words
carrying double meanings in puns.</p>
      <p>Regarding pun translation from English into Spanish, different models were utilized, including
OpusMT, M2M100, FastText, and SimpleT5. The translations obtained were mostly literal, resulting
in a 60% failure rate in recreating the puns in the target language. The translation models struggled to
capture homophonic sounds and polysemous terms, leading to the loss of puns based on those
linguistic elements. OpusMT showcased the ability to recognize the semantic field of wordplay, while
M2M100 provided more neutral translations.</p>
      <p>Overall, the CLEF JOKER track and the methods employed in this paper showcased promising
advancements in automating wordplay processing. The results highlighted the potential of machine
learning, neural networks, and pretrained models in tackling the complexities of pun detection,
location, and translation. Further research and development in this area can contribute to improved
language technology and enhanced cross-cultural communication in creative language contexts.</p>
    </sec>
    <sec id="sec-12">
      <title>4. Acknowledgements</title>
      <p>At this point we want to thank the Erasmus program, the European University of the Seas
(SEAEU) and the Instituto de Investigación en Estudios del Mundo Hispánico (In-EMHis) for giving us
the opportunity to take this course that promises to be innovative in our academic careers. We also
want to acknowledge all the work done by the whole team of the CLEF 2023 project, especially Liana
Ermakova, who hosted the Artificial Intelligence’s course and Séverine Allain and Ludmila Guillotin
for the excellent coordination and the warm welcome from the Université de Bretagne Occidentale.
Grigori Sidorov, and Adam Jatowt. 2023. Overview of JOKER - CLEF-2023 track on
Automatic Wordplay Analysis. In Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika,
Stefanos Vrochidis, Anastasia Giachanou, Dan Li, Mohammad Aliannejadi, Michalis
Vlachos, Guglielmo Faggioli, Nicola Ferro (Eds.) Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of
the CLEF Association (CLEF 2023).
[3] L. Ermakova, T. Miller, F. Regattin, A. G. Bosser, C. Borg, E. Mathurin, G. Le Corre, S.</p>
      <p>Araújo, R. Hannachi, J. Boccou, A. Digue, A. Damoy and B. Jeanjean, Overview of
JOKER@CLEF 2022: Automatic Wordplay and Humor Translation Workshop, in:
Experimental IR Meets Multilinguality, Multimodality, and Interaction: 13th International
Conference of the CLEF Association, CLEF 2022, Bologna, Italy, September 5-8, 2022,
Proceedings, Sprinter-Verlag, Berlin, Heidelberg, 2022, pp. 447-469. doi:
10.1007/978-3031-13643-6_27.
[4] Liana Ermakova, Fabio Regattin, Tristan Miller, Anne-Gwenn Bosser, Claudine Borg,
Benoît Jeanjean, Élise Mathurin, Gaëlle Le Corre, Radia Hannachi, Sílvia Araújo, Julien
Boccou, Albin Digue, and Aurianne Damoy. “Overview of the CLEF 2022 JOKER Task 3:
Pun Translation from English into French”. In Guglielmo Faggioli, Nicola Ferro, Allan
Hanbury, and Martin Potthast, editors, Proceedings of the Working Notes of CLEF 2022
-Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th to 8th, 2022,
volume 3180 of CEUR Workshop Proceedings (2022): 1681-1700.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Liana</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , Fabio Regattin,
          <string-name>
            <given-names>Tristan</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <surname>Anne-Gwenn</surname>
            <given-names>Bosser</given-names>
          </string-name>
          , Sílvia Araújo, Claudine Borg, Gaëlle Le Corre, Julien Boccou, Albin Digue, Aurianne Damoy, Paul Campen, and Orlane Puchalski. “
          <article-title>Overview of the CLEF 2022 JOKER Task 1: Classify and explain instances of wordplay”</article-title>
          . In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, editors,
          <source>Proceedings of the Working Notes of CLEF 2022 -- Conference and Labs of the Evaluation Forum</source>
          , Bologna, Italy,
          <source>September 5th to 8th</source>
          ,
          <year>2022</year>
          , volume
          <volume>3180</volume>
          <source>of CEUR Workshop Proceedings</source>
          (
          <year>2022</year>
          ):
          <fpage>1641</fpage>
          -
          <lpage>1665</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Liana</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Tristan</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <surname>Anne-Gwenn</surname>
            <given-names>Bosser</given-names>
          </string-name>
          , Victor Manuel Palma Preciado,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>