OFAI–UKP at HAHA@IberLEF2019:
    Predicting the Humorousness of Tweets Using
       Gaussian Process Preference Learning

 Tristan Miller1[0000−0002−0749−1100] , Erik-Lân Do Dinh2[0000−0002−1536−3854] ,
Edwin Simpson2[0000−0002−6447−1552] , and Iryna Gurevych2[0000−0003−2187−7621]
      1
       Austrian Research Institute for Artificial Intelligence (OFAI), Freyung 6,
                     1010 Vienna, Austria, http://www.ofai.at/
 2
   Ubiquitous Knowledge Processing Lab (UKP-TUDA), Department of Computer
 Science, Technische Universität Darmstadt, Hochschulstraße 10, 64289 Darmstadt,
                   Germany, https://www.ukp.tu-darmstadt.de/


          Abstract Most humour processing systems to date make at best discrete,
          coarse-grained distinctions between the comical and the conventional, yet
          such notions are better conceptualized as a broad spectrum. In this paper,
          we present a probabilistic approach, a variant of Gaussian process prefer-
          ence learning (GPPL), that learns to rank and rate the humorousness
          of short texts by exploiting human preference judgments and automat-
          ically sourced linguistic annotations. We apply our system, which had
          previously shown good performance on English-language one-liners anno-
          tated with pairwise humorousness annotations, to the Spanish-language
          data set of the HAHA@IberLEF2019 evaluation campaign. We report
          system performance for the campaign’s two subtasks, humour detection
          and funniness score prediction, and discuss some issues arising from the
          conversion between the numeric scores used in the HAHA@IberLEF2019
          data and the pairwise judgment annotations required for our method.


Keywords: Computational humour · Humour · Gaussian process preference
learning · GPPL · Best–worst scaling


1     Introduction

Humour is an essential part of everyday communication, particularly in social
media [21,33], yet it remains a challenge for computational methods. Unlike con-
ventional language, humour requires complex linguistic and background knowledge
to understand, which are difficult to integrate with NLP methods [19].
    An important step in the automatic processing of humour is to recognize its
presence in a piece of text. However, its intensity may be present or perceived
    Copyright © 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 September
    2019, Bilbao, Spain.
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


to varying degrees to its human audience [5]. This level of appreciation (i.e.,
humorousness or equivalently funniness) can vary according to the text’s content
and structural features, such as nonsense or disparagement [7] or, in the case of
puns, contextual coherence [25] and the cognitive effort required to recover the
target word [18, pp. 123–124].
    While previous work has considered mainly binary classification approaches
to humorousness, the HAHA@IberLEF2019 shared task [10] also focuses on its
gradation. This latter task is important for downstream applications such as
conversational agents or machine translation, which must choose the correct
tone in response to humour, or find appropriate jokes and wordplay in a target
language. The degree of creativeness may also inform an application whether the
semantics of a joke can be inferred from similar examples.
    This paper describes the OFAI–UKP system that participated in both subtasks
of the HAHA@IberLEF2019 evaluation campaign: binary classification of tweets
as humorous or not humorous, and the quantification of humour in those tweets.
Our system employs a Bayesian approach—namely, a variant of Gaussian process
preference learning (GPPL) that infers humorousness scores or rankings on the
basis of manually annotated pairwise preference judgments and automatically
annotated linguistic features. In the following sections, we describe and discuss
the background and methodology of our system, our means of adapting the
HAHA@IberLEF2019 data to work with our system, and the results of our
system evaluation on this data.


2    Background

Pairwise comparisons can be used to infer rankings or ratings by assuming a
random utility model [37], meaning that the annotator chooses an instance with
probability p, where p is a function of the utility of the instance. Therefore,
when instances in a pair have similar utilities, the annotator selects one with
a probability close to 0.5, while for instances with very different utilities, the
instance with higher utility will be chosen consistently. The random utility model
forms the core of two popular preference learning models, the Bradley–Terry
model [6,26,31], and the Thurstone–Mosteller model [37,28]. Given this model
and a set of pairwise annotations, probabilistic inference can be used to retrieve
the latent utilities of the instances.
    Besides pairwise comparisons, a random utility model is also employed by
MaxDiff [27], a model for best–worst scaling (BWS), in which the annotator
chooses the best and worst instances from a set. While the term “best–worst
scaling” originally applied to the data collection technique [15], it now also refers
to models such as MaxDiff that describe how annotators make discrete choices.
Empirical work on BWS has shown that MaxDiff scores (instance utilities) can
be inferred using either maximum likelihood or a simple counting procedure that
produces linearly scaled approximations of the maximum likelihood scores [16].
The counting procedure defines the score for an instance as the fraction of times
the instance was chosen as best, minus the fraction of times the instance was


                                          181
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


chosen as worst, out of all comparisons including that instance [23]. From this
point on, we refer to the counting procedure as BWS, and apply it to the tasks
of inferring scores from both best–worst scaling annotations for metaphor novelty
and pairwise annotations for funniness.
    Gaussian process preference learning (GPPL) [11], a Thurstone–Mosteller–
based model that accounts for the features of the instances when inferring their
scores, can make predictions for unlabelled instances and copes better with sparse
pairwise labels. GPPL uses Bayesian inference, which has been shown to cope
better with sparse and noisy data [39,38,4,24], including disagreements between
multiple annotators [12,36,14,22]. Through the random utility model, GPPL is
able to handle disagreements between annotators as noise, since no label has a
probability of one of being selected.
    Given a set of pairwise labels, and the features of labelled instances, GPPL
can estimate the posterior distribution over the utilities of any instances given
their features. Relationships between instances are modelled by a Gaussian
process, which computes the covariance between instance utilities as a function
of their features [32]. Since typical methods for posterior inference [29] are not
scalable (O(n3 ), where n is the number of instances), some of the present authors
introduced a scalable method for GPPL that permits arbitrarily large numbers
of instances and pairs [35]. This method uses stochastic variational inference [20],
which limits computational complexity by substituting the instances for a fixed
number of inducing points during inference.
    Our GPPL method has already been applied with good results to ranking
arguments by convincingness (which, like funniness, is an abstract linguistic
property that is hard to quantify directly) and to ranking English-language
one-liners by humorousness [35,34]. In these two tasks, GPPL was found to
outperform SVM and BiLSTM regression models that were trained directly on
gold-standard scores, and to outperform BWS when given sparse training data,
respectively. We therefore elect to use GPPL on the Spanish-language Twitter
data of the HAHA@IberLEF2019 shared task.
    In the interests of replicability, we will be freely releasing the code for running
our GPPL system, including the code for the data conversion and subsampling
process detailed in §3.2.3


3     Experiments

3.1    Tasks

The HAHA@IberLEF2019 evaluation campaign consists of two tasks. Task 1 is
humour detection, where the goal is to predict whether or not a given tweet is
humorous, as determined by a gold standard of binary, human-sourced annotations.
Systems are scored on the basis of accuracy, precision, recall, and F-measure.
Task 2 is humorousness prediction, where the aim is to assign each funny tweet a
score approximating the average funniness rating, on a five-point scale, assigned
3
    https://github.com/UKPLab/haha2019-GPPL


                                          182
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


by a set of human annotators. Here system performance is measured by root-
mean-squared error (RMSE). For both tasks, the campaign organizers provide a
collection of 24 000 manually annotated training examples. The test data consists
of a further 6000 tweets whose gold-standard annotations were withheld from
the participants.


3.2   Data Preparation

For each of the 24 000 tweets in the HAHA@IberLEF2019 training data, the
task organizers asked human annotators to indicate whether the tweet was
humorous, and if so, how funny they found it on a scale from 1 (“not funny”) to
5 (“excellent”). (This is essentially the same annotation scheme used for the first
version of the corpus [9] which was used in the previous iteration of HAHA [8].)
As originally distributed, the training data gives the text of each tweet along
with the number of annotators who rated it as “not humour”, “1”, “2”, “3”, “4”,
and “5”. For the purposes of Task 1, tweets in the positive class received at least
three numerical annotations and at least five annotations in total; tweets in the
negative class received at least three “not humour” annotations, though possibly
fewer than five annotations in total. Only those tweets in the positive class are
used in Task 2.
    This ordinal data cannot be used as-is with our GPPL system, which requires
as input a set of preference judgments between pairs of instances. To work around
this, we converted the data into a set of ordered pairs of tweets such that the
first tweet has a lower average funniness score than the second. (We consider
instances in the negative class to have an average funniness score of 0.) While
an exhaustive set of pairings would contain 575 976 000 pairs (minus the pairs in
which both tweets have the same score), we produced only 10 730 229 pairs, which
was the minimal set necessary to accurately order the tweets. For example, if
the original data set contained three tweets A, B, and C with average funniness
scores 5.0, 3.0, and 1.0, respectively, then our data would contain the pairs (C, B)
and (B, A) but not (C, A). To save memory and computation time in the training
phase, we produced a random subsample such that the number of pairs where
a given tweet appeared as the funnier one was capped at 500. This resulted in
a total of 485 712 pairs. In a second configuration, we subsampled up to 2500
pairs per tweet. We used a random 60% of this set to meet memory limitations,
resulting in 686 098 pairs.
    With regards to the tweets’ textual data, we do only basic tokenization as
preprocessing. For lookup purposes (synset lookup; see §3.3), we also lemmatize
the tweets.


3.3   Experimental Setup

As we adapt an existing system that works on English data [34], we generally
reuse the features employed there, but use Spanish resources instead. Each tweet
is represented by the vector resulting from a concatenation of the following:


                                          183
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


 – The average of the word embedding vectors of the tweet’s tokens, for which
   we use 200-dimensional pretrained Spanish Twitter embeddings [13].
 – The average frequency of the tweet’s tokens, as determined by a Wikipedia
   dump.4
 – The average word polysemy—i.e., the number of synsets per lemma of the
   tweet’s tokens, as given by the Multilingual Central Repository (MCR 3.0,
   release 2016) [17].
Using the test data from the HAHA@IberLEF2018 task [8] as a development
set, we further identified the following features from the UO UPV system [30] as
helpful:
 – The heuristically estimated turn count (i.e., the number of tokens beginning
   with - or --) and binary dialogue heuristic (i.e., whether the turn count is
   greater than 2).
 – The number of hashtags (i.e., tokens beginning with #).
 – The number of URLs (i.e., tokens beginning with www or http).
 – The number of emoticons.5
 – The character and token count, as well as mean token length.
 – The counts of exclamation marks and other punctuation (.,;?).
    We adapt the existing GPPL implementation6 using the authors’ recom-
mended hyperparameter defaults [35]: batch size |Pi | = 200, scale hyperparame-
ters α0 = 2 and β0 = 200, and the number of inducing points (i.e., the smaller
number of data points that act as substitutes for the tweets in the dataset)
M = 500. The maximum number of iterations was set to 2000. Using these
feature vectors, hyperparameter settings, and data pairs, we require a training
time of roughly two hours running on a 24-core cluster with 2 GHz CPU cores.
    After training the model, an additional step is necessary to transform the
GPPL output values to the original funniness range (0, 1–5). For this purpose,
we train a Gaussian process regressor which we supply with the output values
of the GPPL system as features and the corresponding HAHA@IberLEF2018
test data values as targets. However, this model can still yield results outside the
desired range when applied to the GPPL output of the HAHA@IberLEF2019
test data. Thus, we afterwards map too-large and too-small values onto the
range boundaries. We further set an empirically determined threshold for binary
funniness estimation.

3.4   Results and Discussion
Tables 1 and 2 report results for the binary classification setup (Task 1) and the
regression task (Task 2), respectively. Included in each table are the scores of our
4
  https://dumps.wikimedia.org/eswiki/20190420/eswiki-20190420-pages-articles.xml.
  bz2; last accessed on 2019-06-15.
5
  https://en.wikipedia.org/wiki/List of emoticons#Western, Western list; last accessed
  on 2019-06-15.
6
  https://github.com/UKPLab/tacl2018-preference-convincing


                                          184
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


                  Table 1. Results for Task 1 (humour detection)

                      System       F1 Precision Recall Accuracy
                    winner   0.821        0.791   0.852    0.855
                    OFAI–UKP 0.660        0.588   0.753    0.698
                    baseline 0.440        0.394   0.497    0.505

              Table 2. Results for Task 2 (funniness score prediction)

                                    System      RMSE
                                 winner   0.736
                                 OFAI–UKP 1.810
                                 baseline 2.455


own system, as well as those of the top-performing system and a naı̈ve baseline.
For Task 1, the naı̈ve baseline makes a random classification for each tweet (with
uniform distribution over the two classes); for Task 2, it assigns a funniness score
of 3.0 to each tweet.
    In the binary classification setup, our system achieved an F-measure of 0.660
on the test data, representing a precision of 0.588 and a recall of 0.753. In the
regression task, we achieved RMSE of 1.810. The results are based on the second
data subsample (up to 2500 pairs), with the results for the first (up to 500 pairs)
being slightly lower. Our results for both tasks, while handily beating those of the
naı̈ve baseline, are significantly worse than those reported by some other systems
in the evaluation campaign, including of course the winner. This is somewhat
surprising given GPPL’s very good performance in our previous English-language
experiments [34].
    Unfortunately, our lack of fluency in Spanish and lack of access to the gold-
standard scores for the test set tweets precludes us from performing a detailed
qualitative error analysis. However, we speculate that our system’s less than
stellar performance can partly be attributed to the information loss in converting
between the numeric scores used in the HAHA@IberLEF2019 tasks and the
preference judgments used by our GPPL system. In support of this explanation,
we note that the output of our GPPL system is rather uniform; the scores
occur in a narrow range with very few outliers. (Figure 1 shows this outcome
for the HAHA@IberLEF2018 test data.) Possibly this effect would have been
less pronounced had we used a much larger subsample, or even the entirety, of
the possible training pairs, though as discussed in §3.2, technical and temporal
limitations prevented us from doing so. We also speculate that the Gaussian
process regressor we used may not have been the best way of mapping our
GPPL scores back onto the task’s funniness scale (albeit still better than a linear
mapping).
    Apart from the difficulties posed by the differences in the annotation and
scoring, our system may have been affected by the mismatch between its language


                                          185
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


Figure 1. Gold values of the HAHA@IberLEF2018 test data vs. the GPPL output of
our system, before mapping to the expected funniness range using GPR. The lowest
output value (−1400) was removed from the plot to obtain a better visualization.


resources and the language of the test data. That is, while we relied on language
resources like Wikipedia and MCR that reflect standardized registers and prestige
dialects, the HAHA@IberLEF2019 data is drawn from unedited social media,
whose language is less formal, treats a different range of topics, and may reflect a
wider range of dialects and writing styles. Twitter data in particular is known to
present problems for vanilla NLP systems, at least without extensive cleaning and
normalization [1]. This is reflected in our choice of word embeddings: while we
achieved a Spearman rank correlation of ρ = 0.52 with the HAHA@IberLEF2018
test data using embeddings based on Twitter data [13], the same system using
more “standard” Wikipedia-/news-/Web-based embeddings [2] resulted in a
correlation near zero.


4   Conclusion

This paper has presented the OFAI–UKP system for predicting both binary
and graded humorousness. It employs Gaussian process preference learning, a
Bayesian system that learns to rank and rate instances by exploiting pairwise
preference judgments. By providing additional feature data (in our case, shallow
linguistic features), the method can learn to predict scores for previously unseen
items.
    Though our system had previously achieved good results with rudimentary,
task-agnostic linguistic features on two English-language tasks (including one in-


                                          186
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


volving the gradation of humorousness), its performance on the Spanish-language
Twitter data of HAHA@IberLEF2019 was less impressive. We tentatively at-
tribute this to the information loss involved in the (admittedly artificial) con-
version between the numeric annotations used in the task and the preference
judgments required as input to our method, and to the fact that we do not
normalize the Twitter data to match our linguistic resources.
    Possible future work would include mitigating the above two problems (for
example, by normalizing the language of the tweets, and by coming up with a
better way of converting between humour annotation formats, or by sourcing new
preference judgments from Spanish-speaking annotators), and by using additional,
humour-specific features, including some of those used in past work as well as
those inspired by the prevailing linguistic theories of humour [3]. The benefits
of including word frequency also point to possible further improvements using
n-grams, tf–idf, or other task-agnostic linguistic features.


Acknowledgments

This work has been supported by the German Federal Ministry of Education and
Research (BMBF) under the promotional reference 01UG1816B (CEDIFOR),
by the German Research Foundation (DFG) as part of the QA-EduInf project
(grants GU 798/18-1 and RI 803/12-1), by the DFG-funded research training group
“Adaptive Preparation of Information from Heterogeneous Sources” (AIPHES;
GRK 1994/1), and by the Austrian Science Fund (FWF) under project M 2625-
N31. The Austrian Research Institute for Artificial Intelligence is supported by
the Austrian Federal Ministry for Science, Research and Economy.


References

 1. Proceedings of the ACL 2015 Workshop on Noisy User-generated Text. Association
    for Computational Linguistics (Jul 2015). https://doi.org/10.18653/v1/W15-43,
    https://www.aclweb.org/anthology/W15-4300
 2. Almeida, A., Bilbao, A.: Spanish 3B words word2vec embeddings (version 1.0)
    (2018). https://doi.org/10.5281/zenodo.1410403
 3. Attardo, S.: Linguistic Theories of Humor. Mouton de Gruyter, Berlin (1994).
    https://doi.org/10.1515/9783110219029
 4. Beck, D., Cohn, T., Specia, L.: Joint emotion analysis via multi-task Gaussian
    processes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural
    Language Processing. pp. 1798–1803. Association for Computational Linguistics
    (2014). https://doi.org/10.3115/v1/D14-1190
 5. Bell, N.D.: Failed humor. In: Attardo, S. (ed.) The Routledge Handbook of Language
    and Humor, pp. 356–370. Routledge Handbooks in Linguistics, Routledge, New York
    (Feb 2017), https://www.routledgehandbooks.com/doi/10.4324/9781315731162.
    ch25
 6. Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: I. The
    method of paired comparisons. Biometrika 39(3/4), 324–345 (Dec 1952).
    https://doi.org/10.2307/2334029


                                          187
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


 7. Carretero-Dios, H., Pérez, C., Buela-Casal, G.: Assessing the appreciation
    of the content and structure of humor: Construction of a new scale. Hu-
    mor: International Journal of Humor Research 23(3), 307–325 (Aug 2010).
    https://doi.org/10.1515/humr.2010.014
 8. Castro, S., Chiruzzo, L., Rosá, A.: Overview of the HAHA task: Humor analysis
    based on human annotation at IberEval 2018. In: Rosso, P., Gonzalo, J., Martı́nez,
    R., Montalvo, S., de Albornoz, J.C. (eds.) Proceedings of the Third Workshop
    on Evaluation of Human Language Technologies for Iberian Languages. CEUR
    Workshop Proceedings, vol. 2150, pp. 187–194. Spanish Society for Natural Language
    Processing (Sep 2018), http://ceur-ws.org/Vol-2150/overview-HAHA.pdf
 9. Castro, S., Chiruzzo, L., Rosá, A., Garat, D., Moncecchi, G.: A crowd-annotated
    Spanish corpus for humor analysis. In: Proceedings of the Sixth International
    Workshop on Natural Language Processing for Social Media. pp. 7–11. Association
    for Computational Linguistics (2018), http://aclweb.org/anthology/W18-3502
10. Chiruzzo, L., Castro, S., Etcheverry, M., Garat, D., Prada, J.J., Rosá, A.: Overview
    of HAHA at IberLEF 2019: Humor analysis based on human annotation. In:
    Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019). CEUR
    Workshop Proceedings (Sep 2019)
11. Chu, W., Ghahramani, Z.: Preference learning with Gaussian processes. In: Pro-
    ceedings of the 22nd International Conference on Machine Learning. pp. 137–144.
    ACM (2005). https://doi.org/10.1145/1102351.1102369
12. Cohn, T., Specia, L.: Modelling annotator bias with multi-task Gaussian processes:
    An application to machine translation quality estimation. In: Proceedings of the 51st
    Annual Meeting of the Association for Computational Linguistics. vol. 1, pp. 32–42.
    Association for Computational Linguistics (2013), http://aclweb.org/anthology/
    P13-1004
13. Deriu, J., Lucchi, A., De Luca, V., Severyn, A., Müller, S., Cieliebak, M., Hoffmann,
    T., Jaggi, M.: Leveraging large amounts of weakly supervised data for multi-language
    sentiment classification. In: Proceedings of the 26th International World Wide Web
    Conference. pp. 1045–1052. International World Wide Web Conferences Steering
    Committee (2017)
14. Felt, P., Ringger, E.K., Seppi, K.D.: Semantic annotation aggregation with con-
    ditional crowdsourcing models and word embeddings. In: Proceedings of the 26th
    International Conference on Computational Linguistics. pp. 1787–1796 (2016),
    http://aclweb.org/anthology/C16-1168
15. Finn, A., Louviere, J.J.: Determining the appropriate response to evidence of public
    concern: The case of food safety. Journal of Public Policy & Marketing 11(2), 12–25
    (Sep 1992). https://doi.org/10.1177/074391569201100202
16. Flynn, T.N., Marley, A.A.J.: Best–worst scaling: Theory and methods. In: Hess,
    S., Daly, A. (eds.) Handbook of Choice Modelling, pp. 178–201. Edward Elgar
    Publishing, Cheltenham, UK (2014). https://doi.org/10.4337/9781781003152.00014
17. Gonzalez-Agirre, A., Laparra, E., Rigau, G.: Multilingual Central Repository version
    3.0. In: Proceddings of the 8th International Conference on Language Resources
    and Evaluation. pp. 2525–2529. European Language Resources Association (2012)
18. Hempelmann, C.F.: Paronomasic Puns: Target Recoverability Towards Automatic
    Generation. Ph.D. thesis, Purdue University, West Lafayette, IN, USA (Aug 2003)
19. Hempelmann, C.F.: Computational humor: Beyond the pun? In: Raskin, V. (ed.)
    The Primer of Humor Research, pp. 333–360. No. 8 in Humor Research, Mouton
    de Gruyter, Berlin (2008). https://doi.org/10.1515/9783110198492.333


                                           188
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


20. Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.W.: Stochastic variational inference.
    Journal of Machine Learning Research 14, 1303–1347 (May 2013), http://jmlr.org/
    papers/v14/hoffman13a.html
21. Holton, A.E., Lewis, S.C.: Journalists, social media, and the use of humor on
    Twitter. Electronic Journal of Communication 21(1&2) (2011), http://www.cios.
    org/EJCPUBLIC/021/1/021121.html
22. Kido, H., Okamoto, K.: A Bayesian approach to argument-based reasoning for attack
    estimation. In: Sierra, C. (ed.) Proceedings of the Twenty-Sixth International Joint
    Conference on Artificial Intelligence. pp. 249–255. International Joint Conferences
    on Artificial Intelligence (2017). https://doi.org/10.24963/ijcai.2017/36
23. Kiritchenko, S., Mohammad, S.M.: Capturing reliable fine-grained sentiment as-
    sociations by crowdsourcing and best–worst scaling. In: Proceedings of the 2016
    Conference of the North American Chapter of the Association for Computational
    Linguistics: Human Language Technologies. pp. 811–817. Association for Computa-
    tional Linguistics (2016). https://doi.org/10.18653/v1/N16-1095
24. Lampos, V., Aletras, N., Preoţiuc-Pietro, D., Cohn, T.: Predicting and characterising
    user impact on Twitter. In: Proceedings of the 14th Conference of the European
    Chapter of the Association for Computational Linguistics. pp. 405–413. Association
    for Computational Linguistics (2014). https://doi.org/10.3115/v1/E14-1043
25. Lippman, L.G., Dunn, M.L.: Contextual connections within puns: Effects on per-
    ceived humor and memory. Journal of General Psychology 127(2), 185–197 (Apr
    2000). https://doi.org/10.1080/00221300009598578
26. Luce, R.D.: On the possible psychophysical laws. Psychological Review 66(2), 81–95
    (1959). https://doi.org/10.1037/h0043178
27. Marley, A.A.J., Louviere, J.J.: Some probabilistic models of best, worst, and
    best–worst choices. Journal of Mathematical Psychology 49(6), 464–480 (2005).
    https://doi.org/10.1016/j.jmp.2005.05.003
28. Mosteller, F.: Remarks on the method of paired comparisons: I. The least squares
    solution assuming equal standard deviations and equal correlations. Psychometrika
    16(1), 3–9 (Mar 1951). https://doi.org/10.1007/BF02313422
29. Nickisch, H., Rasmussen, C.E.: Approximations for binary Gaussian process
    classification. Journal of Machine Learning Research 9, 2035–2078 (Oct 2008),
    http://www.jmlr.org/papers/volume9/nickisch08a/nickisch08a.pdf
30. Ortega-Bueno, R., Muñiz Cuza, C.E., Medina Pagola, J.E., Rosso, P.: UO UPV:
    Deep linguistic humor detection in Spanish social media. In: Rosso, P., Gonzalo,
    J., Martı́nez, R., Montalvo, S., de Albornoz, J.C. (eds.) Proceedings of the Third
    Workshop on Evaluation of Human Language Technologies for Iberian Languages.
    CEUR Workshop Proceedings, vol. 2150, pp. 203–213. Spanish Society for Natural
    Language Processing (Sep 2018), http://ceur-ws.org/Vol-2150/HAHA paper2.pdf
31. Plackett, R.L.: The analysis of permutations. Journal of the Royal
    Statistical Society, Series C (Applied Statistics) 24(2), 193–202 (1975).
    https://doi.org/10.2307/2346567
32. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning.
    Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA, USA
    (2006), http://www.gaussianprocess.org/gpml/
33. Shifman, L.: Memes in Digital Culture. Essential Knowledge, MIT Press, Cambridge,
    MA, USA (Oct 2013)
34. Simpson, E., Do Dinh, E.L., Miller, T., Gurevych, I.: Predicting humorousness and
    metaphor novelty with Gaussian process preference learning. In: Proceedings of the
    57th Annual Meeting of the Association for Computational Linguistics (ACL 2019).
    Association for Computational Linguistics (Jul 2019), to appear


                                           189
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


35. Simpson, E., Gurevych, I.: Finding convincing arguments using scalable Bayesian
    preference learning. Transactions of the Association for Computational Linguistics
    6, 357–371 (2018), http://aclweb.org/anthology/Q18-1026
36. Simpson, E.D., Venanzi, M., Reece, S., Kohli, P., Guiver, J., Roberts, S.J., Jennings,
    N.R.: Language understanding in the wild: Combining crowdsourcing and machine
    learning. In: Proceedings of the 24th International Conference on World Wide Web.
    pp. 992–1002. International World Wide Web Conferences Steering Committee
    (2015). https://doi.org/10.1145/2736277.2741689
37. Thurstone, L.L.: A law of comparative judgment. Psychological Review 34(4),
    273–286 (1927). https://doi.org/10.1037/h0070288
38. Titov, I., Klementiev, A.: A Bayesian approach to unsupervised semantic role
    induction. In: Proceedings of the 13th Conference of the European Chapter of the
    Association for Computational Linguistics. pp. 12–22. Association for Computa-
    tional Linguistics (2012), http://aclweb.org/anthology/E12-1003
39. Xiong, H.Y., Barash, Y., Frey, B.J.: Bayesian prediction of tissue-regulated splicing
    using RNA sequence and cellular context. Bioinformatics 27(18), 2554–2562 (Sep
    2011). https://doi.org/10.1093/bioinformatics/btr444


                                           190