=Paper=
{{Paper
|id=Vol-2624/paper3
|storemode=property
|title=Supervised Pun Detection and Location with Feature Engineering and Logistic Regression
|pdfUrl=https://ceur-ws.org/Vol-2624/paper3.pdf
|volume=Vol-2624
|authors=Jingyuan Feng,Özge Sevgili,Steffen Remus,Eugen Ruppert,Chris Biemann
|dblpUrl=https://dblp.org/rec/conf/swisstext/FengSRRB20
}}
==Supervised Pun Detection and Location with Feature Engineering and Logistic Regression==
Supervised Pun Detection and Location
with Feature Engineering and Logistic Regression
Jingyuan Feng? , Özge Sevgili† , Steffen Remus† , Eugen Ruppert† , and Chris Biemann†
?
Technische Universität Hamburg, Hamburg, Germany
†
Universität Hamburg, Hamburg, Germany
jingyuan.feng@tuhh.de
{sevgili,remus,ruppert,biemann}@informatik.uni-hamburg.de
Abstract nunciation.
With such ambiguities, they can usually achieve
Puns, by exploiting ambiguities, are com- a humorous or rhetorical effect. Puns can be seen
monly used in literature to achieve a hu- not only as jokes, but also are widely used in liter-
morous or rhetorical effect. Previous ap- ature and can be traced back as early as the Roman
proaches mainly focus on machine learn- playwright Plautus (Pollack, 2011).
ing models or rule-based methods, how- Puns can be a challenge to appreciate, even for
ever, they have not addressed how and humans. It requires not only sufficient associa-
why a pun is detected or located. Focus- tions but also rich English and background knowl-
ing on this, we propose a system for rec- edge. A well-functioning system in machine trans-
ognizing and locating English puns. Re- lation may help non-English users better under-
garding the fact of limited training data standing literary criticism and analysis. Besides,
and the aim of measuring how relevant a it may also enhance the experience of human-
predictor and its direction of the associa- computer interaction (Hempelmann, 2008).
tion is, we compile a dataset and explore This paper focuses on the detection and lo-
different feature sets as input for logis- cation of homographic and heterographic puns.
tic regression, and measure their influence The state-of-the-art systems mostly deployed rule-
in terms of the assigned weights. To our based strategies and a few purposed complex ma-
best knowledge, our system achieves bet- chine learning models. Their experimental results
ter results than state-of-the-art systems on reached 83 % to 90 % F1 in pun detection and
three subtasks for different types of puns 80 % in the pun location identification task on the
respectively. SemEval-2017 Task 7 dataset (Miller et al., 2017).
Our contributions are: (1) accumulating a dataset
for pun detection; (2) utilizing a logistic regression
1 Introduction model to show the relations with straightforward
features on both sentence level and word level; (3)
Puns are a type of wordplay that deliberately ex-
unveiling how puns may work according to their
ploits two or more different meanings of the same
type.
or similar words in a sentence. Puns utilizing the
same word with ambiguous senses is known as ho- 2 Related Work
mographic. I used to be a banker but I lost in-
terest; in this sentence, the trigger word “inter- Most previous work focus on pun generation and
est” could mean “curiosity” and “a fixed charge modelling. Starting in 2004, Taylor and Ma-
for borrowing money”1 . Whereas, puns using dif- zlack (2004) used N-grams to recognize and lo-
ferent words with similar soundings are called het- cate wordplay. In 2015, Miller and Gurevych
erographic. Are evil wildebeests bad gnus?; here, (2015) adapted knowledge-based Word Sense Dis-
“gnus” and “news” (/nu:z/)2 have the same pro- ambiguation (WSD) to “disambiguate” different
meanings of puns. By processing sentences, Kao
Copyright c 2020 for this paper by its authors. Use permit- 1
http://wordnetweb.princeton.edu/perl/webwn
ted under Creative Commons License Attribution 4.0 Interna- 2
https://dictionary.cambridge.org/dictionary/
tional (CC BY 4.0) english/
et al. (2016) built an information-theory-based i.e. helping us to uncover hidden relationships in
computational model for interpreted puns with the puns.
“ambiguity” and “distinctiveness”.
3 Methods
In the pun detection part, Sevgili et al. (2017)
computed PMI scores for every pair of words The influence of each feature can be traced based
and looked for the strong associations; Peder- on the results. These terms explore the statistic
sen (2017) applied different settings of WSD ap- characteristics of a pun as well as its semantic
proaches to voting for puns; Doogan et al. (2017) properties. In general, they can be categorized into
calculated phonetic distances for heterographic the following 4 types.
puns. Several papers also proposed supervised Part-of-speech (POS) tag: By analyzing the
methods: Indurthi and Oota (2017) differentiated statistics of the dataset, nouns, verbs, adjectives
puns from non-puns using bi-directional Recurrent and proper nouns take up about 98 % of all pun
Neural Network (RNN) with word embeddings as words. Besides, a verb-type pun word is almost
features. Xiu et al. (2017) also trained a classifier, certain to appear at the end.
but on a self-collected training set, with features Representation of the entire sentence: Pre-
based on WordNet (Miller, 1995) and word2vec trained doc2vec (Le and Mikolov, 2014) and
(Mikolov et al., 2013) embeddings. Diao et al. BERT (Devlin et al., 2019) language models are
(2019) created the PSUGA model for hetero- used to get a representation of the sentence as the
graphic puns, which applies a hierarchical atten- contextual background for disambiguation.
tion mechanism to learn phoneme and spelling re- Sentence separation: Many researchers be-
lations. For pun location identification, Doogan lieve that the pun word often locates in the latter
et al. (2017) selected words whose two senses hav- half of a sentence. However, Sevgili et al. (2017),
ing higher similarity scores with two different con- Oele and Evang (2017) lost structure when using
tent words. Vechtomova (2017) developed eleven PMI and WSD respectively; Vechtomova (2017)
features as rules, including position information, failed on most complex sentences by splitting with
PMI, TF-IDF, etc., to score candidate words. Zou certain keywords. Instead, we use dependency
and Lu (2019) jointly detected and located puns parsing to extract the largest strict sub-tree in the
with tags from an LSTM (Long Short-Term Mem- sentence structure as the second part, leaving the
ory) and CRFs (Conditional Random Fields). Cai rest as the first part (see Figure 1). It separates
et al. (2018) also applied a BiLSTM, but based on sentences and preserves the structure regardless of
sense-aware models. sentence types.
Two works by Mao et al. (2020) and Zhou et al. hid
(2020) were reviewed and published concurrently.
Mao et al. (2020) captured long-distance and They from in
short-distance semantic relations between words;
Zhou et al. (2020) combined contextualized word gunman sauna
embedding and pronunciation embedding with a
self-attentive encoder, reaching 2 %, 2 %, 13 % the a sweat
and 7 % increase in F-score on the four tasks re-
spectively. where they could it out
Previous studies focused on machine learning
models or rule-based methods (Diao et al., 2019), Figure 1: Example sentence for dependency parsing.
e.g., They hid from the gunman in a sauna where they
however, they are not able to measure how asso- could sweat it out. After sentence separation: where
ciated a predictor with the purpose or its direction they could sweat it out (important part) and They hid
with. Instead of using rules that are mainly based from the gunman in a sauna (the rest part).
on belief, or deploying neural networks which,
due to their intrinsic complexity, indicate no clear Word embedding or meaning: We use GloVe
clue on the relationship between input and out- (Pennington et al., 2014) to derive word embed-
put, we use logistic regression and combinations ding and other approaches like path distances of
of widely-used terms. This gives us the best result word senses in WordNet to get meanings for word
so far and also provides us a valuable by-product, pairs.
System P R A F1 System P R A F1
Zhou et al. (2020)4 .942 .957 .949 Zhou et al. (2020) .948 .956 .952
Zou and Lu (2019)4 .912 .933 .922 Zou and Lu (2019) .867 .931 .898
Indurthi and Oota (2017)5 .902 .897 .853 .900 Diao et al. (2019)6 .879 .851 .829 .865
Sevgili et al. (2017) .755 .933 .736 .835 Sevgili et al. (2017) .773 .930 .755 .844
Pedersen (2017) .783 .872 .736 .825 Doogan et al. (2017) .871 .819 .784 .844
First setting .924 .937 .900 .930 First setting .921 .939 .899 .930
Second setting .828 .928 .811 .875 Second setting .831 .938 .820 .881
Table 1: Homographic pun detection results with the Table 2: Heterographic pun detection with the top two
top three teams from the competition and two best re- teams from the competition and three best recent stud-
cent studies (the upper part). f4, f5, f6, f8, f9, f10 ies; the same features as homographic.
and f12 are used in both settings: 5-fold CV and the
one using own training data. Description
f1 The number of words regarding its POS tags.
f2 The distance of last appeared POS tags respec-
tively normalized to sentence length.
4 Experiments f3 Individual sums of all founded PMI values ac-
cording to POS tags.
4.1 Subtask 1: Pun Detection f4 doc2vec sentence representation.
f5 doc2vec dot product for separated parts.
Pun detection is a binary classification problem f6 doc2vec for both parts of the sentence.
f7 doc2vec cosine similarity of word pair from sep-
with a sentence as the input, and the decision of arated parts of the sentence in descending order.
whether it is punning as the output. The first 10 values are taken.
f8 It has three elements: if the sentence contains a
Data: The published dataset (Miller et al., similar idiomatic representation; if it differs ex-
2017) contains 2250 contexts for the homographic actly one word; how much they are in common.
and 1780 for the heterographic type. We also gen- f9 Word similarity based on the shortest path in
WordNet. We evaluated with path similarity7 .
erated a corpus from “Pun of the Day”3 , which For each sentence, all word pairs from two sub-
contains mixed types of puns. After removing sentences are evaluated, and the 4 largest results
are chosen.
duplicates, we found 843 puns (disregarding pun f10 The number of associated words in the first part
types) with a significantly larger standard devia- of the sentence. For each word ω from the sec-
ond part that exists in Free Association corpus8 ,
tion of sentence length compared to the given cor- we count how many content words from the first
pus. We fitted the dataset by limiting the range part are listed as associative words of word ω ac-
of word counts and ended up with 707 puns and a cording to Free Association corpus.
f11 The number of words that are predicted differ-
variety of negative samples made of non-punning ently if one word ω is masked, using BERT.
jokes, famous sayings and other short collections. f12 Sentence representation using BERT.
The compiled dataset can be made available upon
Table 3: Feature lists for pun detection.
request.
Setting: In this subtask, we did experiments
on two different settings: 5-fold cross-validation a positive influence on the final result, and vice
and purely with collected training data. For the versa. To unify, we choose the same feature sets
first setting, we cross-validated with the official for both homographic and heterographic sets.
dataset, since it is not split by the provider. To Results: Table 1 and 2 provide the experimen-
make it comparable to the previous research, we tal results for homographic and heterographic pun
tested all the folds independently and calculated detection, respectively. The 5-fold CV utilizes all
the macro score in the end, thus the final result data provided in the task; the latter one does not
covers all benchmark data with 5 sub-experiments; use any data from the task for training.
for the second one, we trained on self-collected From both results, our system with the first
data and evaluated on the official dataset. Both setting (5-fold cross-validation) leads to the best
experiment metrics use standard precision, recall, scores, compared with all the teams which used
accuracy and F-score. 6
5-fold cross-validation on the original dataset and our
Features: Table 3 lists the features; among compiled corpus from Pun of the Day.
7
them, f4, f5, f6, f8, f9, f10 and f12 give path similarity from nltk returns a score denoting how
similar two senses are, based on the shortest path that
3
Pun of the Day: http://www.punoftheday.com/ connects the senses in the is-a (hypernym/hyponym)
4 taxonomy. http://www.nltk.org/howto/wordnet.html
Both teams used 10-fold cross-validation. 8
Free Association is a collection of word pairs that
5
Trained on part of the dataset, evaluated on 675 of the people tend to think first when given the other word:
2250 homographic contexts according to task organizer. http://w3.usf.edu/FreeAssociation/AppendixC/
System C P R F1 System C P R F1
Zhou et al. (2020) .904 .875 .889 Zhou et al. (2020) .942 .904 .923
Mao et al. (2020) .850 .813 .831 Mao et al. (2020) .888 .858 .873
Zou and Lu (2019) .835 .771 .802 Zou and Lu (2019) .814 .775 .794
Cai et al. (2018) .815 .747 .780 Vechtomova (2017) .998 .797 .795 .796
Doogan et al. (2017) .999 .664 .662 .663 Doogan et al. (2017) 1.00 .685 .685 .685
Vechtomova (2017) .999 .653 .652 .652 Sevgili et al. (2017) .988 .659 .652 .655
Indurthi and Oota (2017) 1.00 .522 .522 .522 Our system 1.00 .849 .849 .849
Our system 1.00 .762 .762 .762
Table 5: Heterographic pun location results with the
Table 4: Homographic pun location results with the top best three teams from the competition and three best
three teams from the competition and three best recent teams from recent studies; all features except f7 are
studies; all features except f2 and f4 are used. used.
part of official data for training (all teams listed and calculated. Scores were computed using stan-
above except Sevgili et al. (2017) and Doogan dard coverage, precision, recall and F-score mea-
et al. (2017)). Compared with the other teams sures.
participating in this subtask, our second setting Features: Table 6 lists all the features used
(training separated) also outperforms them by in this subtask. All of them are word-based and
about 4 %. each assigns a vector or value to words. f2 and
Besides, there is a 5 % performance drop from f4 are only for heterographic since they contain
training in the 5-fold CV to the self-collected cor- homonymic information. We then concatenate all
pus respectively. This may result from multi- vectors from chosen features. After logistic re-
ple reasons. For instance, the self-collected cor- gression, words with the highest score presume to
pus does not categorize the punning type; the be pun location.
non-punning samples may consist of some hidden Results: Our system achieves competitive re-
puns; the two corpora vary in terms of their prop- sults compared to teams from the competition. For
erties, etc. location on homographic puns, our model’s per-
Ablation test: In the ablation test, we found formance is lower than Zou and Lu (2019) and
that the major factors are sentence representation Cai et al. (2018), which use LSTM (see Table 4).
(f4, f12), the relation between both parts (f5, Added pronunciation features (f2 and f4), our
f6, f7 and f10) and word meaning (f8, f9). system (using all features except f7) exceeds the
While sentence representation offers the basis, the state-of-the-art results by around 5 % for hetero-
relation between both parts also helps in general graphic puns (see Table 5).
or individual word pairs. Ablation test: Both f6 and f7 lead to a sig-
Furthermore, since all features have different nificant increment in the result. They give words,
dimensions, f12 occupies more than 2/3 of the especially content words, with latter order more
feature space, and most of the top 15 % impor- weight. f3 uses doc2vec to extract relations be-
tant components are from it. The third parame- tween separated parts, while f9 is to find out the
ter in f8 calculates the maximum ratio of over- “surprise” within a pun using MaskLM. Together,
lapping words to the word length of the found id- they contribute to around 2 % improvement. These
iomatic representation and always has the largest two features answer to our hypothesis, and also
influence, then comes f4, and sometimes f6. tend to focus on the differences between double
meanings of the trigger words and the two parts
4.2 Subtask 2: Pun Location of sentences. f10 concatenates the GloVe vector
Pun location is to find out which word in the con- to represent the word itself. It results in an ap-
tent is punning, given a pun-containing sentence. proximately 7 % boost. Homonymic information
Data: The dataset provided by the organizer in- (f2) helps, but still left much to explore. First,
cludes 1271 homographic puns and 1780 hetero- the data is heterographic instead of homophonic
graphic ones. (e.g., “orifice” and “office”). Besides, variances
Setting: In this subtask, we used 5-fold cross- of the word are not considered (e.g., “knowingly”
validation to test. Like in Subtask 1, the official and “no”). Third, puns may exploit names or com-
dataset was randomly split into 5 folds. The pre- pounds (e.g., “Clarence” and “clearance”).
dictions from each fold were then accumulated This problem is remedied by adding word fre-
Description ing and interpretation, and may be assembled into
f1 Assign value 1 to the last content word in the sen-
tence. Namely, if the feature is used, the last con- machine translation in the future.
tent word in sentences will be concatenated with
vector [1], while a vector [0] for the other words. Acknowledgments
f2 If there is the same pronunciation of the word in
CMU Pronouncing Dictionary9 .
f3 Maximum doc2vec cosine similarity of word pair We thank the anonymous reviewers for sugges-
from separated parts of the sentence. tions on the submission; the paper was partially
f4 The number of context words that have lower supported by the German Academic Exchange
doc2vec cosine similarity with word ω than with
any of ω’s homonyms. Service and partially supported by base.camp at
f5 Assign value 0, 1 or 2 for word ω according to its Universität Hamburg.
frequency in Brown corpus10 . The rarer the word
is, the higher the value it is assigned.
f6 The position of word ω in the sentence, assign 1
if it is in the second half, plus additional 1 if it References
also lays in the last quarter.
f7 Mark last N, V, Adj, Propn in form of a vector Yitao Cai, Yin Li, and Xiaojun Wan. 2018. Sense-
at their place (from subtask 1). For example, if aware neural models for pun location in texts. In
word ω is the last verb in the sentence, its vector Proceedings of the 56th Annual Meeting of the As-
for this feature should be [0,1,0,0]. sociation for Computational Linguistics (Volume 2:
f8 doc2vec of the whole sentence (from subtask 1).
f9 The number of context words that are predicted Short Papers), pages 546–551, Melbourne, Aus-
differently using BERT if word ω is masked. tralia.
f10 GloVe vector of the word.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Table 6: Pun location features and description. Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
quency feature (f5) instead of given special at-
Computational Linguistics: Human Language Tech-
tention to particular names or structure patterns. nologies, Volume 1 (Long and Short Papers), pages
Although this feature works significantly in het- 4171–4186, Minneapolis, MN, USA.
erographic, it barely influences the homographic
Yufeng Diao, Hongfei Lin, Liang Yang, Xiaochao Fan,
ones. Unlike heterographic puns, one needs a
Di Wu, Dongyu Zhang, and Kan Xu. 2019. Het-
word with two senses that are widely known to erographic pun recognition via pronunciation and
people in the homographic case. So a rare word spelling understanding gated attention network. In
can hardly be used in that situation. The World Wide Web Conference, pages 363–371,
San Francisco, CA, USA.
5 Conclusion and Future Work Samuel Doogan, Aniruddha Ghosh, Hanyang Chen,
and Tony Veale. 2017. Idiom savant at SemEval-
We provide a dataset for pun detection and built a 2017 Task 7: Detection and interpretation of En-
model that achieves state-of-the-art on three sub- glish puns. In Proceedings of the 11th International
tasks for different types of puns. We found that Workshop on Semantic Evaluation (SemEval-2017),
pages 103–108, Vancouver, Canada.
three things affect a pun: general interpretation of
the content, relation for both parts and word mean- Christian F Hempelmann. 2008. Computational hu-
ing. In case we know it is punning as a prior, mor: Beyond the pun. The Primer of Humor Re-
we can utilize e.g., word position or “surprise” ac- search. Humor Research, 8:333–360.
cording to their type, to locate the punning word. Vijayasaradhi Indurthi and Subba Reddy Oota. 2017.
We deployed homophones to heterographic Fermi at SemEval-2017 Task 7: Detection and in-
tasks; this could be an interesting topic for future terpretation of homographic puns in English lan-
work as well as a test of higher-order associations guage. In Proceedings of the 11th International
Workshop on Semantic Evaluation (SemEval-2017),
between word pairs. Nevertheless, with the im- pages 457–460, Vancouver, Canada.
proved results of pun detection and interpretation,
our system provides a step for further understand- Justine T Kao, Roger Levy, and Noah D Goodman.
2016. A computational model of linguistic humor
9
Carnegie Mellon University (CMU) Pronouncing in puns. Cognitive science, 40(5):1270–1285.
Dictionary is an open-source pronunciation dictionary:
http://www.speech.cs.cmu.edu/cgi-bin/cmudict Quoc Le and Tomas Mikolov. 2014. Distributed repre-
10
It is a general English-language corpus with a total of sentations of sentences and documents. In Interna-
roughly one million words: tional conference on machine learning, pages 1188–
https://archive.org/details/BrownCorpus 1196, Beijing, China.
Junyu Mao, Rongbo Wang, Xiaoxi Huang, and Zhiqun Olga Vechtomova. 2017. Uwaterloo at SemEval-2017
Chen. 2020. Compositional semantics network with Task 7: Locating the pun using syntactic character-
multi-task learning for pun location. IEEE Access, istics and corpus-based metrics. In Proceedings of
8:44976–44982. the 11th International Workshop on Semantic Eval-
uation (SemEval-2017), pages 421–425, Vancouver,
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Canada.
rado, and Jeff Dean. 2013. Distributed representa-
tions of words and phrases and their compositional- Yuhuan Xiu, Man Lan, and Yuanbin Wu. 2017. Ecnu at
ity. In Advances in neural information processing semeval-2017 task 7: Using supervised and unsuper-
systems, pages 3111–3119, Lake Tahoe, NV, USA. vised methods to detect and locate English puns. In
Proceedings of the 11th International Workshop on
George A Miller. 1995. WordNet: a lexical Semantic Evaluation (SemEval-2017), pages 453–
database for English. Communications of the ACM, 456, Vancouver, Canada.
38(11):39–41.
Tristan Miller and Iryna Gurevych. 2015. Automatic Yichao Zhou, Jyun-Yu Jiang, Jieyu Zhao, Kai-Wei
disambiguation of English puns. In Proceedings Chang, and Wei Wang. 2020. ”The Boating Store
of the 53rd Annual Meeting of the Association for Had Its Best Sail Ever”: Pronunciation-attentive
Computational Linguistics and the 7th International contextualized pun recognition. arXiv preprint
Joint Conference on Natural Language Processing arXiv:2004.14457.
(Volume 1: Long Papers), pages 719–729, Beijing, Yanyan Zou and Wei Lu. 2019. Joint detection and
China. location of English puns. In Proceedings of the 2019
Tristan Miller, Christian Hempelmann, and Iryna Conference of the North American Chapter of the
Gurevych. 2017. Semeval-2017 task 7: Detection Association for Computational Linguistics: Human
and interpretation of English puns. In Proceed- Language Technologies, Volume 1 (Long and Short
ings of the 11th International Workshop on Semantic Papers), pages 2117–2123, Stroudsburg, PA, USA.
Evaluation (SemEval-2017), pages 58–68, Vancou-
ver, Canada.
Dieke Oele and Kilian Evang. 2017. Buzzsaw at
SemEval-2017 Task 7: Global vs. local context
for interpreting and locating homographic English
puns with sense embeddings. In Proceedings of
the 11th International Workshop on Semantic Eval-
uation (SemEval-2017), pages 444–448, Vancouver,
Canada.
Ted Pedersen. 2017. Duluth at SemEval-2017 Task
7: Puns Upon a Midnight Dreary, Lexical Seman-
tics for the Weak and Weary. In Proceedings of
the 11th International Workshop on Semantic Eval-
uation (SemEval-2017), pages 416–420, Vancouver,
Canada.
Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word
representation. In Proceedings of the 2014 confer-
ence on empirical methods in natural language pro-
cessing (EMNLP), pages 1532–1543, Doha, Qatar.
John Pollack. 2011. The Pun Also Rises: How the
Humble Pun Revolutionized Language, Changed
History, and Made Wordplay More Than Some An-
tics. Penguin, New York, NY, USA.
Özge Sevgili, Nima Ghotbi, and Selma Tekir. 2017. N-
Hance at SemEval-2017 Task 7: A Computational
Approach using Word Association for Puns. In Pro-
ceedings of the 11th International Workshop on Se-
mantic Evaluation (SemEval-2017), pages 436–439,
Vancouver, Canada.
Julia M Taylor and Lawrence J Mazlack. 2004. Com-
putationally recognizing wordplay in jokes. In Pro-
ceedings of the Annual Meeting of the Cognitive Sci-
ence Society, 26, Chicago, IL, USA.