=Paper=
{{Paper
|id=Vol-3625/paper9
|storemode=property
|title=
Enriching Natural Language Processing Systems with Semantic-Pragmatic Information through Communicative Intentions
|pdfUrl=https://ceur-ws.org/Vol-3625/paper9.pdf
|volume=Vol-3625
|authors=María Miró Maestre
|dblpUrl=https://dblp.org/rec/conf/sepln/Maestre23
}}
==
Enriching Natural Language Processing Systems with Semantic-Pragmatic Information through Communicative Intentions
==
Enriching Natural Language Processing Systems with
Semantic-Pragmatic Information through
Communicative Intentions
María Miró Maestre1
1
Department of Software and Computing Systems, University of Alicante, 03690 Alicante, Spain
Abstract
Communicative intentions are one of the linguistic elements that usually determine the content of any
message we want to express. However, regardless of the high precision Natural Language Processing
(NLP) systems are acquiring these days, thanks to the revolution derived from the explosion of the
latest large language models (LLMs), these architectures still show a lack of appropriate training in
order to detect the intention of a message correctly. For the purpose of improving these systems, the
present research project aims to create a communicative intention annotation scheme based on the
taxonomy presented in the Speech Act Theory. Such resource could help NLP architectures to consider
communicative intentions as a starting point to classify any message depending first on the intention
it reflects. With this aim, the scheme will be created with the help of an already annotated corpus in
Spanish. Subsequently, we will test the scheme within a classification system so that we can verify the
accuracy of the intention patterns detected. In this way, it will be possible to check if NLP systems are
capable of identifying Spanish communicative intentions or even generate messages that reflect a given
intention, therefore enriching the linguistic information these architectures can infer automatically.
Keywords
communicative intentions, speech acts, natural language processing, annotation scheme, classification
system
1. Introduction and Motivation
Natural Language Processing (NLP) systems, and more concretely Natural Language Generation
systems (NLG), are nowadays at their peak due to the evolution that the large language models
have shown these last years. Therefore, we currently have at our disposal more and more
precise classification and generation systems when it comes to identifying the linguistic patterns
demanded in the text to be processed or generated. This is the case of the task of abstractive
summary generation in the NLG research branch, or the detection of offensive or humorous
messages if we focus on current NLP tasks. Both represent a few examples of how automatic
learning systems are starting to correctly detect and generate more concrete and ambiguous
linguistic features each time.
Nevertheless, despite the excellent results that these systems show when detecting linguistic
patterns belonging to levels of analysis such as morphology, syntax, and semantics, there is still
Doctoral Symposium on Natural Language Processing from the Proyecto ILENIA, 28 September 2023, Jaén, Spain.
$ maria.miro@ua.es (M. M. Maestre)
0000-0001-7996-4440 (M. M. Maestre)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
a long way to refine such architectures so that they can finally capture the natural aspect of a
message. One of the elements that mainly helps to define the structure and, generally, the sense
of a message is its communicative intention. In fact, two of the main scopes of research that are
receiving a lot of attention nowadays in the context of Artificial Intelligence are the tasks of
intention detection in the scope of NLP, and conscious text generation with the inclusion of
external knowledge in the NLG branch of research.
These tasks are usually oriented normally to creating conversational agents where a human
tries to keep a conversation with a robot. Consequently, researchers need to study the linguistic
parameters that denote a given communicative intention and the best approach to integrate
them into automatically generated responses so that they are created according to the contextual
and linguistic requirements of the conversation. However, regardless of the diverse works we
also find on intention detection in written textual genres, the scientific output is still reserved
for a small percentage of languages, with English at the head.
Consequently, our goal with this research proposal is to create a communicative intention
annotation scheme in Spanish to incorporate more semantic-pragmatic information in a NLG
system thanks to the linguistic indicators of each intention that we have gathered in our
guidelines for the annotation task. Moreover, this initial corpus will serve us as a base for testing
several semi-supervised classification techniques to augment the final weight of the corpus. In
this way, we will provide the NLP research community with a valuable linguistic resource for
identifying more linguistic patterns automatically in a language other than English.
The remainder of this article is organised as follows: Section 2 focuses on the different
approaches made in NLP in order to tackle the automatic classification of semantic-pragmatic
elements of language, then Section 3 shows the main hypotheses and objectives planned for this
research. Subsequently, we explain the methodology proposed for fulfilling each project task in
Section 4, and Section 5 sets out the different research issues we may need to face throughout
the experimentation. Finally, the bibliography used for this study is included at the end of the
paper.
2. Related Work
Regardless of the numerous innovative techniques available nowadays to do research on different
linguistic tasks within the scope of NLP, there are well-known traditional linguistic theories
that still shed light on how to approach some of the most difficult NLP tasks to resolve. This is
the case of the Speech Act Theory (SAT) founded by Austin [1] and extended by Searle [2, 3],
who defended that language can serve as a means to perform actions depending on the uttered
message. To verify so, it was Austin [1] who first investigated verbs to identify how they could
denote actions on their own (called performative verbs or either describe reality (descriptive
verbs). Subsequent to this first pragmatic division, Austin focused his research on one of the
aspects that comprises the act of uttering a message: the illocutionary act (i.e., the intention
of a utterance). With this element, he created a 5-fold typology of intentions, although it was
Searle’s Searle [2] modified version the one generally accepted by the research community,
given its more thorough and well delimited approach (see Table 1).
Later on, Searle [3] also made a distinction between the types of intentions aforementioned,
Intention Description Examples
Assertives we commit to the veracity of the message declare, manifest, conclude, explain,
expressed etc.
Directives the speaker uses this type to make the lis- ask for, dare, invite, command, chal-
tener do something lenge, etc.
Commissives they commit the speaker to do an action swear, promise, commit, intend, etc.
in the future
Expressives they express the psychological state of the thank, forgive, excuse, congratulate,
speaker with respect to a topic specified in etc.
the message
Declaratives when uttering them we get the content of declare, designate, resign, marry,
the message to coincide with reality, that etc.
is, the action is performed, or in Searle’s
own words: ‘saying makes it so’
Table 1
Taxonomy of intentions according to the SAT.
known as direct speech acts because the relation between the meaning and the intention of
the message is straightforward, and other type of illocutionary acts called indirect speech acts.
In the latter, the relation between the message and the intention requires other inferential
processes (i.e., cultural references, social context, etc.) to successfully interpret the intention of
the message, as in those texts containing irony, sarcasm or rhetorical questions, among others.
Despite the difficulties that the inclusion of pragmatic elements inside NLP and NLG systems
entailed, several studies focused on this linguistic level to make progress in these domains of
computational linguistics [4, 5, 6]. Therefore, we now find manifold research works enrich-
ing systems with pragmatic knowledge, and more concretely with communicative intentions
classifications, to improve their efficiency.
A very prolific area of research is that devoted to the study of computer-mediated communica-
tion (CMC) [7], which includes the study of the language used in different social media platforms
in its research scope. Specifically, within the scope of social media users’ communicative inten-
tions, several works have been published these last years where the Speech Act Theory serves as
a base to identify users’ intentions in their tweets, as in Saha et al. [8] and Zhang et al. [9]. Even
some researchers have proposed to mix several NLP tasks in CMC corpora as in [10, 11] or [12],
where authors test the implication of both sentiment analysis and emotion recognition tasks
when trying to detect the intention of a tweet. If we broaden our scope of research to further
NLP tasks, the SAT taxonomy meant a key point for studying the best approach to develop
systems that could also automatically identify text intentions in task-oriented conversational
systems [13].
Nevertheless, it is clear that most research on this subject handles English documents so
far. However, we can still find a few examples of works where Spanish speech acts are used to
either improve task-oriented dialogues as in Martínez-Hinarejos et al. [14] and Caballero et al.
[15], analyse pathological language extracted from clinical oral data in Spanish, as shown in
Gallardo Paúls and Fernández Urquiza [16], or even study CMC travel blogs [17] and the types
of speech acts found in the social network Facebook [18].
3. Main hypotheses and objectives
Because of the numerous NLP scenarios in which we can make use of the SAT nowadays, our
research proposal is based on the creation of a communicative intention annotation scheme in
Spanish to use it as a resource to solve some of the most currently studied tasks in our field. In
this way, we will try to improve the ability of current language models when identifying more
semantic-pragmatic aspects of language to successfully reproduce them. More concretely, the
main research questions that support this project are:
• RQ 1) Which linguistic features can help us detect the intention of a given message in the
Spanish language?
• RQ 2) Is it possible to identify those linguistic features in a CMC corpus, given its colloquial
style and lexical variety?
• RQ 3) Can a language model learn to differentiate between different types of intentions
with a training dataset, regardless of the ambiguities inherent to language?
• RQ 4) Can both sentiment analysis and emotion detection help to identify the commu-
nicative intention of a message with better precision?
• RQ 5) Does the automatic annotation of communicative intentions benefit NLP applica-
tions such as an automatic text generator?
4. Methodology and proposed experiments
To integrate communicative intentions in some NLP current tasks to enrich systems with further
semantic-pragmatic information, we will focus on Searle’s classification of direct speech acts as
explained in Section 2 and other linguistic features that also reflect the intention of the message
straightforwardly. To create the corresponding annotation scheme, and to test its validity in an
NLP application, several linguistic resources and computing tools were used to complete each
of the experimentations that shape our research project:
1. Corpus creation with the Shared Task on Hope Speech Detection for Equality,
Diversity and Inclusion [19] and UMUCorpusClassifier [20]
The lack of sufficient datasets in languages other than English forces researchers to either
modify those existing corpora to accomplish the objective of their research work or
create their own resources so that they can analyse language concentrating on particular
linguistic phenomena. For our research, we first examined the corpus compiled for the
Shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion [19], but
as it was focused on the task of hope speech detection, we didn’t find enough results
to create the first version of our corpus. Consequently, we completed the selection of
tweets we found with intention indicators in this corpus with a compilation of tweets
extracted through the Twitter API thanks to the extracting tool UMUCorpusClassifier
[20]. By combining both resources, we were able to compile a corpus of Spanish tweets
about the LGTBIQ+ community. The final amount of tweets is 454, which gives us a
corpus of 996 instances to analyse, as we decided to tag the intention of each of the
utterances comprised in the same tweet, not the tweet as a whole. We made this decision
after noticing that different utterances of the same tweet can show linguistic patterns
linked to different intentions, so we preferred to separate the tweets in their utterances
to not confuse the recognition of a given intention.
2. Communicative intentions annotation scheme
Parallel to the corpus creation, we compiled linguistic patterns linked to a particular
intention according to the SAT classification we explained previously in Section 2. To
this end, several resources were also of help to gather the best representation -within
our means- of the Spanish linguistic structures that reflect an intention when used
appropriately. On the one hand, we translated the verb lexicon comprised in [21] to get a
Spanish equivalent of the verbs that, according to Austin and Searle, reflect a particular
intention. This book provides in-detailed semantic descriptions of around 200 of the
most frequent speech act verbs used in English. In this way, by studying the semantic
particularities of each English verb, we could look for the equivalent verbs in Spanish
that kept each semantic nuance so that the speech act verb classification would not
differ from one language to another. On the other hand, we revisited two of the most
extendedly used grammar references of the Spanish language to study their approach to
detect speech acts through grammatical features in Spanish [22, 23, 24, 25].
3. Corpus annotation with INCEpTION [26]
The linguistic tool used to annotate our corpus of tweets was the platform created for
semantic annotation with intelligent assistance INCEpTION [26]. Thanks to its intuitive
interface, this platform allows users to individually manage, curate and modify annotation
projects in the same environment by assigning each project to the corresponding
annotators. In our case, two experts in the field of Spanish linguistics served as the
annotators for our corpus together with the author. As previously mentioned, the tweets
were uploaded in the platform and annotated utterance by utterance, as shown in Figure
1, so we could identify as many intention indicators as possible even within the same
tweet. Once the annotation task was completed, several metrics were calculated in
order to check the inter-annotator agreement achieved between the three annotators.
INCEpTION also includes a section that calculates both Fleiss’ Kappa and Krippendorf’s
Kappa automatically in the annotation project, so we checked for the results and
confirmed that our annotation scheme could be validated, thanks to achieving a 0.77 of
agreement with both measures.
4. Proposed experiment A): enriching our corpus while training the classification
system through active learning
Once we validated our annotation scheme, another task we currently study is exploiting
the "active learning" functionality that the annotation tool INCEpTION includes within
its platform. This machine learning method consists in training the classification system
Figure 1: Example of annotating a tweet with intentions in INCEpTION
with the instances we have manually annotated. Then, once we add new instances to be
automatically annotated by it, we check which examples the recommender can annotate
correctly or not. Those more difficult examples would be the ones to annotate manually
so that the recommender system keeps learning on those more ambiguous examples until
it finally classifies well those difficult tags without manual help. Consequently, with this
technique we will both boost the classifier performance and augment the final weight of
our corpus.
5. Proposed experiment B): combining SAT with sentiments and emotions
The second experiment to be fulfilled during our research is combining the identification
of communicative intentions in tweets with the tasks of sentiment analysis and emotion
detection. Following [12], they demonstrated that previously classifying the sentiment
and emotion of a given tweet in English could help to better identify the intention of the
tweet. Therefore, as we previously mentioned in the research questions of our doctoral
thesis, one of the main tasks we want to accomplish is to check to which point this
combined classification can also help to improve the identification of Spanish intentions
in tweets. In this way, we would enrich NLP systems with further semantic-pragmatic
information and establish more linguistic patterns that help detect the natural essence of
a given message. It is also worth mentioning that this research work is being done in
collaboration with the Laboratoire Interdisciplinaire des Sciences du Numérique from the
Université Paris-Saclay, as one of the research outputs derived from our international
Ph.D. stay at the Sémantique et Extraction d’Information Research Group.
6. Proposed experiment C): incorporating our corpus as a training dataset in a NLG
system
Finally, the last experiment we want to test with our enriched corpus is including it in a
NLG architecture as its training dataset so that the system can learn from our already
validated examples of messages with a given intention. In this way, we would check
if such systems would generate automatic messages with a clear intention following
the taxonomy we established in our annotation scheme. To accomplish so, we will
follow the methodology established in the task of commonsense text generation, where
external knowledge is included as an input in the generation system so that it can
generate messages with further world knowledge and linguistic context. In our case, our
intention-annotated corpus would be the contextual seed that teaches the system how to
recognise an intention, and then generate a new message keeping that same intention,
therefore improving the performance of such system by adding more semantic-pragmatic
information in its architecture.
5. Research issues to discuss
Given the suggestions and comments received in the previous editions of the Doctoral Sympo-
sium, we solved some of the research issues we came up with through the development of our
study. However, as an inherent part of this project, new research questions arise that need to be
discussed to ensure a good research quality that provides new knowledge within our research
area in Spanish:
• Should we add indirect speech act examples to our corpus to check if the classification
system is capable of correctly detecting the semantic differences between direct and
indirect speech acts?
• Do we have enough intention indicators so that the automatic classification system is
capable of differentiating between the different types of intentions we included in our
scheme?
• Would it be possible to apply our annotation scheme to other textual typologies outside
CMC?
• What if we try to test LLMs ability to generate sentences with a given intention, to check
whether there are inconsistencies regarding their intention classification, or if they could
be of help to find even more intent linguistic patterns in Spanish?
Acknowledgments
This research work is part of the R&D project "CORTEX: Conscious Natural Text Generation"
(PID2021-123956OB-I00), funded by MCIN/AEI/10.13039/501100011033/ and by “ERDF A way
of making Europe”.
References
[1] J. L. Austin, How to Do Things with Words, Oxford at the Clarendon Press, 1962.
[2] J. R. Searle, Speech Acts: An Essay in the Philosophy of Language, volume 626, Cambridge
University Press, 1969.
[3] J. R. Searle, Expression and meaning: Studies in the theory of speech acts, Cambridge
University Press, 1985.
[4] W. C. Mann, Toward a Speech Act Theory for Natural Language Processing, Technical
Report, University of Southern California Marina del Rey Information Science Inst, 1980.
[5] S. C. Herring, D. Stein, T. Virtanen, Introduction to the pragmatics of computer-mediated
communication, in: Pragmatics of Computer-Mediated Communication, De Gruyter
Mouton, 2013, pp. 3–32. doi:10.1515/9783110214468.
[6] C. Bonial, L. Donatelli, M. Abrams, S. Lukin, S. Tratz, M. Marge, R. Artstein, D. Traum,
C. Voss, Dialogue-amr: abstract meaning representation for dialogue, in: Proceedings of
the 12th Language Resources and Evaluation Conference, 2020, pp. 684–695.
[7] A. Georgakopoulou, Computer-mediated communication, in: J. Verschueren, J.-O. Öst-
man, J. Blommaert, C. Bulcaen (Eds.), Pragmatics in Practice, volume 9, John Benjamins
Publishing Co, 2011, pp. 93–110.
[8] T. Saha, S. Saha, P. Bhattacharyya, Tweet act classification: A deep learning based classifier
for recognizing speech acts in twitter, in: 2019 International Joint Conference on Neural
Networks (IJCNN), IEEE, 2019, pp. 1–8. doi:10.1109/IJCNN.2019.8851805.
[9] R. Zhang, D. Gao, W. Li, What are tweeters doing: Recognizing speech acts in twitter,
in: Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011, pp.
86–91. URL: https://www.aaai.org/ocs/index.php/WS/AAAIW11/paper/view/3803.
[10] Y. Tian, T. Galery, G. Dulcinati, E. Molimpakis, C. Sun, Facebook sentiment: Reactions
and emojis, in: Proceedings of the Fifth International Workshop on Natural Language
Processing for Social Media, ACL, 2017, pp. 11–16. doi:10.18653/v1/W17-1102.
[11] T. Mahler, W. Cheung, M. Elsner, D. King, M.-C. de Marneffe, C. Shain, S. Stevens-Guille,
M. White, Breaking NLP: Using morphosyntax, semantics, pragmatics and world knowl-
edge to fool sentiment analysis systems, in: Proceedings of the First Workshop on
Building Linguistically Generalizable NLP Systems, Association for Computational Lin-
guistics, Copenhagen, Denmark, 2017, pp. 33–39. URL: https://aclanthology.org/W17-5405.
doi:10.18653/v1/W17-5405.
[12] T. Saha, A. Upadhyaya, S. Saha, P. Bhattacharyya, Towards sentiment and emotion aided
multi-modal speech act classification in Twitter, in: Proceedings of the 2021 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Association for Computational Linguistics, Online, 2021, pp.
5727–5737. URL: https://aclanthology.org/2021.naacl-main.456. doi:10.18653/v1/2021.
naacl-main.456.
[13] I. Casanueva, T. Temčinas, D. Gerz, M. Henderson, I. Vulić, Efficient intent detection
with dual sentence encoders, in: Proceedings of the 2nd Workshop on Natural Language
Processing for Conversational AI, Association for Computational Linguistics, Online, 2020,
pp. 38–45. URL: https://aclanthology.org/2020.nlp4convai-1.5. doi:10.18653/v1/2020.
nlp4convai-1.5.
[14] C. D. Martínez-Hinarejos, J. M. Benedí, V. Tamarit, Unsegmented dialogue act annotation
and decoding with n-gram transducers, IEEE/ACM Transactions on Audio, Speech, and
Language Processing 23 (2014) 198–211. doi:10.1109/TASLP.2014.2377595.
[15] M. Caballero, L. Díaz, M. Taulé, Guía de anotación del corpus FerroviELE, 2014.
[16] B. Gallardo Paúls, M. Fernández Urquiza, Etiquetado pragmático de datos clínicos, e-AESLA
(2015) 1–12.
[17] D. Pascual, Speech acts in travel blogs: Users’corpus-driven pragmatic intentions and
discursive realisations, ELIA: Estudios de Lingüística Inglesa Aplicada (2021) 85–123.
[18] S. Ridao Rodrigo, Actos de habla en redes sociales: perfiles privados versus perfiles públicos,
Literatura y lingüística (2021) 429–446.
[19] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, S. C. Navaneethakrishnan, J. P.
McCrae, M. Á. García-Cumbreras, S. M. Jiménez-Zafra, R. Valencia-García, Shared task
on hope speech detection for equality, diversity, and inclusion - ACL, 2022. URL: https:
//competitions.codalab.org/competitions/36393#learn_the_details-organizers.
[20] J. A. García-Díaz, Á. Almela, G. Alcaraz-Mármol, R. Valencia-García, UMUCorpusClassifier:
Compilation and evaluation of linguistic corpus for natural language processing tasks,
Procesamiento del Lenguaje Natural 65 (2020) 139–142.
[21] A. Wierzbicka, English Speech Act Verbs: A Semantic Dictionary, Academic Press, 1987.
[22] R. A. Española, et al., Nueva gramática de la lengua española, volume 2, Espasa Madrid,
2009.
[23] V. Demonte, Gramática descriptiva de la lengua española: Sintaxis básica de las clases de
palabras, volume 1, Espasa, 1999.
[24] V. Demonte, Gramática descriptiva de la lengua española: Las construcciones sintácticas
fundamentales, volume 2, Espasa, 1999.
[25] I. Bosque, Gramática descriptiva de la lengua española: Entre la oración y el discurso.
Morfología, volume 3, Espasa, 1999.
[26] J.-C. Klie, M. Bugert, B. Boullosa, R. E. de Castilho, I. Gurevych, The INCEpTION
platform: Machine-assisted and knowledge-oriented interactive annotation, in: Pro-
ceedings of the 27th International Conference on Computational Linguistics: System
Demonstrations, Association for Computational Linguistics, 2018, pp. 5–9. URL: http:
//tubiblio.ulb.tu-darmstadt.de/106270/.