1. Introduction

Journal of the ACM 38 (1991) 935-962. [18] E. Muñoz

10.1016/j.artmed.2022.102486

A Post-Modern Approach to Automatic Metaphor Identification

Dario Del Fante

Federico Manzella

Guido Sciavicco

Eduard I. Stan

0 0 Free University of Bozen-Bolzano , Italy 1 University of Ferrara , Italy

2023

13796 0000 0002

This paper provides the theoretical bases for a symbolic approach to text classification, particularly metaphor identification, that generalizes the existing ones and is inspired by similar generalizations of symbolic approaches to learning models for non-text-related tasks. From a computational point of view, metaphor identification is a particular case of text classification. The recent literature on general text classification, particularly metaphor identification, is quite broad and includes both top-down approaches [9] and bottom-up ones. Topdown approaches start from a human-designed theory of the phenomenon, which is later digitalized to provide automatic identification. Bottom-up, or data-driven ones, Much work has been devoted to discussing the on the other hand, aim to perform identification starting metaphor identification and interpretation process, as from a dataset of examples. Bottom-up strategies can in [6]. In this sense, a qualitative approach represents be, in turn, separated into symbolic and sub-symbolic apthe safest methodology since metaphors regard an as- proaches. Sub-symbolic approaches, commonly realized pect of language that occasionally can be ambiguous. via several types of neural networks, produce black-box For example, two speakers from the same linguistic and models which in some cases can be very accurate [10]. cultural context can interpret the same metaphor difer- Along with the application of pre-trained and large lanently. However, this approach is time-consuming and guage models they currently are a de-facto standard for requires at least more than two human coders to be ef- text-related learning tasks, and quite a lot of results exist fectively reliable. Despite this phenomenon, it remains even in the narrow field of metaphor identification (see, a computationally hard task given the many structural among many others, [11, 10, 12, 13, 14]). Conversely, the problems that make automatic identification not quickly purpose of a symbolic approach is to provide an idenefective [ 7]. Scholars between digital humanities and tification model and a statistically validated theory of computational linguistics have developed diferent ap- the phenomenon, written in a suitable logical language. While symbolic systems are sometimes used for textrelated tasks in general, their application to the case of metaphor identification needs to be addressed.

eol>Automatic metaphor detection and interpretation Symbolic learning NLP Modal logic

1. Introduction

proaches to support automatic identification. Indeed, the recent improvements regarding artificial intelligence and machine learning might consistently impact metaphor research regarding the time and quantity of analyzed text [8].

Metaphors involve talking and, potentially, thinking of one thing in terms of another; the two things are diferent, but we can perceive sets of correspondences between them. In other words, a metaphor corresponds to using a word or phrase from the context in which it is expected to occur to another context, where it is not expected to occur [1]. Metaphors are ubiquitous in language [2]: they cannot only be considered a pure artistic ornament that exclusively pertains to literary discourse, but they are essential for the development of language and culture [3, 4, 5]. [. . . ] · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 In the above example, ’flood of immigrants’ is a 2-gram (before tokenization, stemming, and stop words elimination), and the rule that has been learned checks whether or not that particular 2-gram occurs. Towards an abstract representation, 2-grams can be encoded into propositional letters, which can represent not only their occurrence but also other interesting properties, such as the number of times that they occur. In the end, a text is represented as a model of propositional logic, and a (set of) propositional rule(s) can be statistically learned from a dataset of texts.

2. A Logic-Based Post-Modern Approach

Symbolic and sub-symbolic approaches to text-related tasks are diferent in spirit. In both cases, the key idea is to provide a representation of the text later used for learning. However, in the case of sub-symbolic strategies, such a representation, usually referred to as embedding, is numerical. The most famous examples of sub-symbolic representations are (all variants of) vectorizations of tokens (i.e., words, sentences, or paragraphs). Each token is mapped to a point of a high-dimensional space so that mathematical tools can be used to reason about texts, and a learned model, for example, for metaphor identification, takes the form of a mathematical function.

A further generalization of symbolic text-based en

codings requires two steps: generalizing the concepts of -gram and increasing the expressive power of the logic that we use to describe texts. Both ideas are simple.

Focusing on 2-grams, specifically, the most natural generalization consists of eliminating the constraint of two words being one next to the other to form a 2-gram. So a generalized 2-gram can be defined as any pair of successive, non-consecutive words. Such a generalization has two main consequences: first, the label of a generalized 2-gram may be much richer than the label of a standard one, and second, the encoding of a text using generalized 2-grams can be much more expressive than the encoding of the same text using standard ones.

In symbolic approaches, on the other hand, we encode a token (typically, an entire sentence or paragraph) as a logical model. In the most uncomplicated cases, following the so-called bag-of-words methodology, a text is encoded starting from a fixed (arbitrarily long) dictionary; it is translated into a binary vector of length , being the size of the dictionary, where the -th component takes the value 1 if and only if the -th word of the dictionary occurs in the text. Text-based encodings are easily generalized along two directions: bag-of-words become bag-of--grams, and vector components become coun- Let us focus on labeling. As explained above, a stanters so that the -th component takes value if and only dard 2-gram is logically labeled using (the number of if the -th -gram of the fixed -grams vocabulary occurs times) that it occurs. A generalized 2-gram, on the other exactly times in the text (in this context, -grams are hand, can be labeled using the occurrences of the words not used in their canonical, probabilistic version, that in between. In Fig. 1, we see the abstract idea of a generalis, to predict the -th element from the previous − 1 ized 2-gram: the pair of words 3, 7 form a generalized ones, but, instead, in their crisp one, that is, a straight- 2-gram (they are two, possibly non-consecutive, words) forward generalization of single words). In most cases, and, in the encoding, they are represented by a proposithe experiments show that using 2-grams attain the best tional letter (in the example, ). The meaning of such a compromise between the computational complexity of propositional letter is no longer limited to depend on the the tasks and the performances of the learned models. occurrence of 3, 7, either separately or together. On The logical interpretation of symbolic encoding emerges the contrary, one can use the entire sentence between 3 by introducing propositional letters to represent the text and 7 to build ; examples may range from the topic of by the presence of relevant -grams. Simplifying, a sym- the sentence, to its length, the semantic category of any bolic encoding classification model can be described by word between the extremes of the generalizes 2-gram, (sets of) rule(s) of the type: and so on.

If ’flood of immigrants’ occurs then metaphor.

Concerning the expressive power of the encoding,

These people. They arrive forming a continuous wave, an endless flow that changes societies at all levels, swirling together diferent and irreconcilable cultures. These are the migrants, often considered a problem.

These people, the migrants, arrive on dilapidated boats at the mercy of the waves and flows. They risk their lives and when they arrive they are often rejected, because, it is believed, they risk changing societies at all levels, including cultural ones. ⟨⟩(topic ’migrants’ ∧ ⟨⟩topic ’fluid’)

⇓ metaphor ⟨⟩(topic ’migrants’ ∧ ¬[]topic ’fluid’)

⇓ not a metaphor now observe that in the standard text-based approaches, guages such as this one may in fact be designed, and its the relative ordering of the original sentences is lost, expressive power be modulated, depending on the task. while only the order of constituents of each -words is Symbolic learning algorithms for interval temporal logic preserved. Generalized 2-words, instead, are naturally have been recently studied [19] and used for learning linked to a qualitative, more-than-propositional logic that interval temporal properties in very diferent contexts, allows one to preserve the ordering in a very expressive mostly, but not exclusively, in the medical sciences (see way. The key idea is that a sentence can be seen as a [20, 21], among others); in those cases, the object being linearly ordered sequence of words, which entails, in turn, encoded are multi-variate temporal series, via a process a temporal order, as also proposed in other models, such that eventually produces interval temporal models from as BiLSTMs [15, 16]. Thus, a generalized 2-words is an which rules are ultimately learned. It is of notice how interval in such a order, and any two intervals on a lin- such diverse contexts, including text-related tasks, can in ear order can be qualitatively related to each other in fact be approached with the same methodology. Continexactly one of thirteen ways. The family of logics that uing with the example in Fig. 1, the generalized 2-gram allow one to describe propositional properties of inter- 4, 6 is during the generalized 2-gram 3, 7 vals on a linear order is called interval temporal logics, and they belong to the more general category of modal In Fig. 2, we show how, in a text, relevant generalized logics. Originally studied by Allen in the early 80s, in- 2-grams are identified; in both texts, two generalized terval temporal logic have been formalizes a few years 2-grams are identified. Focusing on the top paragraph, later, and the most representative language for express- the first generalized 2-gram, in red color, is captured ing propositional properties of intervals is the modal logic by the words people and migrants; the entire text in beof time intervals, or HS [17]. In HS, each of the possible tween (even ignoring the full stop, thus ignoring that binary relations that may exist between two intervals they belong to two diferent sentences) is categorized becomes an accessibility relation; it can be immediately as topic ’migrants’, thus imitating a human reader who, verified that they are, in fact, thirteen: after (capturing an reading the complete text, can identify when the writer interval that starts at the end of the current one, usually starts referring to some category of persons, when he/she denoted by ⟨⟩), later (capturing an interval that starts stops doing that, and which one this category is. The secpast the end of the current one, ⟨⟩), overlaps (capturing ond generalized 2-gram, in blue color, is captured by the an interval that starts during the current one and ending words wave and swirling, and the entire text in between after it, ⟨⟩), during (capturing an interval that starts is categorized as topic ’fluid’ (observe the frequencies of and ends within the current one, ⟨⟩), begins (captur- words that refer to fluids, and water in particular, that ocing an interval that starts at the start of the current one cur in the blue-highlighted text). The bottom paragraph and ends before it, ⟨⟩), and ends (capturing an inter- shows similar words in a similar but identical order. Both val that starts within the current one and ends with it, topics are still present and identified in the same way. ⟨⟩). Working with the relations/operators as they were However, the two topics are in a diferent topological originally introduced may not be always suitable; inter- order. On the right-hand side, we propose a possible val temporal logics such as HS have been simplified for rule linking the topics’ topological order to distinguish specific tasks in several ways. Among them, the most rel- between metaphoric and non-metaphoric text written in evant proposals include the so-called topological versions propositional HS. Most interestingly, ChatGPT (version of interval temporal logic, in which the relations are, in 3.5, consulted on the prompt in September 2023) classifies fact, disjunctions of Allen’s relations. So, for example, in both texts as metaphoric, probably because metaphors the case of HS3 [18], two intervals can just have at least linking fluids and migrants are statistically common. one point in common or can be completely separated; lan

3. Conclusions We will further verify our hypotheses by conducting

some tests on an annotated newspaper corpus, which was human-labeled as metaphor or non-metaphor, consisting of 13,000 tokens and 2,000 diferent words. The label pertains to the entire text, and the task will regard recognizing its metaphoric expressions. This work represents an initial attempt to approach symbolic learning for text-related tasks like metaphor detection. A symbolic approach can extract a theory from a specific linguistic phenomenon, which raises at least three problems: first, determining whether a theory of a phenomenon should exist and in what terms; second, ifnding the appropriate logic for the extraction process; and third, ensuring the existence of an automatic method for extracting the theory in that logic. In this work, we have attempted to address the first and second points, and we did so using a logical formalism for which a solution to the third one already exists. Should this approach be successful, it can be used to address other text-related challenges, such as all variants of text classification. Additionally, our generalized 2-gram encoding can be further generalized to partially benefit from well-known wordto-vec approaches without compromising its symbolic essence.