Exploring Discourse Corpora Using Process Mining
Techniques
Samantha Kent1 , Hans-Christian Schmitz1
1
Fraunhofer Institute for Communication, Information Processing and Ergonomics FKIE, Fraunhoferstr. 20, 53343
Wachtberg, Germany


                                         Abstract
                                         In our paper we will introduce and discuss Process Mining (PM) as a means for conducting conversa-
                                         tional analytic, linguistic and rhetorical investigations into discourse processes. PM is a technique for
                                         automatically deriving and further analyzing process models from event data. It is mainly applied for
                                         the analysis of Business Processes. We will argue that conversations and other kinds of discourse can
                                         be treated as processes too, and that PM enables us to systematically investigate large quantities of
                                         discourse transcripts. It would be possible to examine many different linguistic research questions, such
                                         as examining the conditions of turn-taking in dialogue or the pragmasemantics of discourse particles
                                         like English “well” or German “halt”, among others.1

                                         Keywords
                                         Discourse Processing, Process Mining, Discourse Process Mining, Corpus Analysis, Information Extrac-
                                         tion, Unsupervised Learning


1. Introduction
Process Mining (PM) is a young interdisciplinary research field that sits between machine
learning and data mining on the one side and process modelling on the other [2]. The main
difference to classic data-oriented types of analysis is that process mining focuses on the process
as a whole, rather than just a specific aspect. Compared to process modelling, process mining
relies on using real life raw data to model what is actually happening. Knowledge is extracted
from raw data stored in event logs to discover, monitor and improve real processes. The
data recorded by information systems can then be used to provide better insight into existing
processes and the quality of process models can be improved.
   There are three main types of PM: process discovery, conformance checking, and process
enhancement [2]. Process discovery refers to the initial process in which a model is extracted
from an event log. In conformance checking, data extracted from an event log are combined
with an existing, predefined process model and the discrepancies are recorded as a diagnostic
tool. It shows the difference between a model derived without data, i.e. what is supposed to
happen, and the event log data, i.e. what is actually happening. Model enhancement combines

                  1
      The work in this paper is a continuation of our extended abstract [1]. It has been supported through funding
from Philip Morris Impact as part of the Fraud Information Fusion Intelligence Project.
Humanities-Centred AI (CHAI), Workshop at the 44th German Conference on Artificial Intelligence, September 28, 2021,
Berlin, Germany
$ samantha.kent@fkie.fraunhofer.de (S. Kent); hans-christian.schmitz@fkie.fraunhofer.de (H. Schmitz)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
the previous types and extends or improves existing process models. It also allows the analysis
of further aspects, such as time behaviour and consumed resources.
   PM, which is sometimes also referred to as Business Intelligence, is applied in a wide variety
of different fields. Examples of typical use cases are optimizing the processes in a hospital,
order handling, or examining mortgage application processing in a bank. There are a number of
prerequisites that are needed to analyse data using PM tools. In general, the data is stored in an
event log, and contains activities that are further labelled with information, such as an activity
label, an event time stamp, amount of resources needed, and other additional information. In
order to conduct an analysis, at least the process ID, activity label and time stamp are needed.
   In this paper, we used Process Mining techniques to explore the structure of discourse corpora.
Section two introduces the concept of Discourse Process Mining. In section three, previous
work is listed. Section four provides an overview of potential corpora that can be used for
Discourse Process Mining, and the following section provides an exploratory example analysis
of three of the corpora. Finally, we will illustrate the potential of Discourse Process Mining as a
research method.


2. Discourse Process Mining
In this paper, we introduce Discourse Process Mining (DPM) and the idea that discourse
structures can be modelled using PM techniques and tools. We argue that discourse, or more
generally spoken and written conversation, contains a specific structure, a notion that is already
explored in linguistics in e.g. Discourse Analysis (DA) and Speech Act Theory [3] [4]. Generally,
dialogue proceeds in a linear manner, where the interlocutors exchange units of conversation
that are functionally related to one another [5]. Successful discourse contains some type of
organizational structure, this can be seen in turn-taking, opening and closing sequences of
conversations, general conversational routines and repairs, and more specifically in adjacency
pairs.
   In DPM, we apply the techniques and tools used in traditional Process Mining to automatically
extract structure from linguistic discourse corpora. We assume that DPM has a great potential for
conducting investigations in both dialogue analysis and more general linguistics. Firstly, it would
be extremely useful in optimizing processes where dialogue plays a central role, for example in
customer service calls. It is imaginable that customer service centers keep transcripts of some of
the conversations that also include time stamps and some type of customer satisfaction rating.
It would therefore be possible to annotate the data and provide an overview of the general
structure of satisfactory and not so satisfactory customer service calls. This is true for any
conversational processes that are often formulaic, short and fulfil a specific purpose.
   It would also be possible to examine many different linguistic research questions. These
can range from more detailed linguistic questions pertaining to words in a discourse, or more
general questions about conversational structure. For example, the use of specific discourse
markers in a dialogue, such as the English “well” or “like” and the German “halt”, and the
structural elements that precede or follow these markers. Another research question could
relate to the use of speech acts between native speakers and foreign language learners. This
question has already been extensively researched for the English language in particular, but a
structural approach using Discourse Process Mining could shed new light on an old research
question. Furthermore, rhetorical analyses can reveal under which circumstances an argument
is compelling and, thus, successful without necessarily being sound and valid. To us, the notion
that dialogues can be interpreted as processes is quite compelling, and it might also be worth
investigating whether monological texts can effectively be handled as processes too. Especially
in the domain of academic writing, research into how to write an introduction contains specific
structural guideline, and with DPM it would be possible to automatically analyse this type of
structural information in academic papers. If corpora contain similar types of annotations, such
research questions could not only be answered for a specific corpus or domain, but also provide
a basis for a more general examination. Further details of specific corpora that may be suitable
are given below.
   There are a number of key requirements that are needed to conduct Discourse Process Mining.
Let us assume that we are provided with a corpus of conversations. Each conversation will be
treated as a separate process, each conversational move refers to a specific event, and each event
has an annotated tag. The order of the events is equivalent to the order of the conversational
moves in the corpus. In order to extract this information a corpus needs to be annotated
using specific structural markers, such as speech or dialogue acts, and the corpus needs to be
transformed in such a way that the annotations can be extracted and processed automatically.
There are three key requirements that are needed to use PM tools to automatically process
the data: each speech/dialog act needs a case ID, an activity and a timestamp. A case ID is
the key of a process, it shows which process the utterance belongs to. The activity is the type
of event, in other words the tag that belongs to the utterance. And finally, the timestamp is
needed to ensure that the sequence utterances in the conversation is kept intact. Some tools
will automatically adopt the input order and do not necessarily require a specific time stamp,
however the inclusion of time stamps would enable a different type of analysis. Data preparation
is key to be able to successfully extract meaningful information from a corpus.


3. Related Work
There has been some previous research using PM to explore dialogue structures. Most recently,
Vakulenko et al. applied process mining techniques to discover patters in conversational
transcripts of information-seeking dialogues [6]. These patterns are then used to develop an
own model of conversation. The authors state that their model better represents conversations in
this domain that previous models, because it better reflects the flow observed in real information-
seeking conversations. While Vakulenko et al. focus on the analysis of a specific conversational
domain, a handful of other studies focus on a specific corpus. Wang et al. described the
application of PM techniques to analyze a corpus of online discussion threads from the Apple
support forum [7]. Similarly, Compagno et al. developed a fine-grained corpus-independent
classification of speech acts. They apply their annotations to a corpus of digital conversations
extracted from the website Reddit and use PM tools to explore the written conversations [8].
Finally, Richetti et al. combine speech act theory and PM to automatically extract structure
from customer service conversations [9].
4. Corpora
As is often the case with computational language analysis, one of the more challenging aspects
of DPM is the availability of suitable data. As discussed above, a corpus needs to be annotated
using an appropriate annotation scheme so that it can be transferred into an event log and
automatically processed using DPM. We have found a number of different annotated corpora
that would potentially be suitable to explore using DPM, some of which are listed below.
   We have distinguished between two different categories of corpora, classic linguistic corpora
that have been used in traditional linguistic corpus studies, and newer corpora that been
collected for the purpose of examining language use for speech recognition systems. The
Switchboard Dialogue Act corpus (SwDA) contains a collection of 1,155 five-minute telephone
conversations between two participants [10]. It is annotated using the SWBD-DAMSL tag
set. Once the initial data formatting is complete, DPM tools make it possible to automatically
explore the conversational flow and the connections between the specific tags.
   The ICSI meeting recorder dialogue act (MRDA) corpus contains about 180,000 hand-annotated
dialogue act tags and accompanying adjacency pair annotations [11]. Interestingly, this corpus
contains transcripts of meeting recordings, and provides a sample unconstrained speech in
both a formal setting and of more casual conversation between multiple speakers. It has been
annotated using the same tagset (SWBD-DAMSL) as in SwDA corpus, and is therefore often
used as a comparison corpus.
   In contrast to the two corpora above, the Spaadia corpus is a task-specific corpus of human-
human train booking conversations. There are two different versions, and specifically the latest
version annotated with the DART taxonomy seems to be suitable for DPM [12].
   A more recent discourse corpus that has been developed for task-oriented dialogue modelling
is the MultiWOZ corpus [13] [14]. It contains human-human written annotations that have
been annotated with dialogue acts. It is the largest corpus containing just over 10.000 dialogues
and spans different domains, including restaurant, hotel, police, and hospital, among others.
Similarly, the Microsoft Dialogue Challenge consists of three domain specific corpora, taxi,
restaurant and movie booking, collected for spoken dialogue modeling purposes [15]. The major
difference is that the conversations in the corpus do not take place between two humans, but
they rather simulate a conversation between a human and a conversational agent.
   So far, all of the corpora introduced above have been monolingual English corpora. A corpus
that is available in multiple different languages is the HCRC Map Task corpus [16]. This would
allow for a comparative analysis of discourse structure in different languages.
   The current discussion has centered around the processing of a selection of readily available
discourse corpora. It would also be possible to create a corpus and annotate it so that it is possible
to analyze the content using DPM. Given that hand-annotation is a very time consuming task,
there is currently ongoing research into providing tools to automatically annotate corpora. On
the one hand, this would drastically reduce the time it would to take to annotate a corpus and
provide many new opportunities to explore previously unavailable discourse material. On the
other hand, automatic annotation bears the risk that a further analysis rather leads to insights
into the annotation algorithm and not into the original research question.
   Whilst all of the corpora discussed above contain some type of dialog act annotation, they all
differ in complexity, and are therefore more or less challenging to process using DPM. Based on
Figure 1: An extract from the Switchboard Dialogue Act corpus. A and B are the speakers, utt stands for
utterance, and the tags stand for hedge, interruption, open question and statement opinion respectively.


the experience we have with experimenting with the different corpora, we are exploring some
of the potential research questions initially proposed in the following section.


5. Answering Potential Research Questions
The main goal in this paper is to introduce the concept of DPM and explore some of its potential
uses. The research questions in this section serve as initial example analyses that illustrate the
concept in general, rather than provide a detailed analysis. The analyses range from a structural
dialogue level analysis, to a more detailed analysis of a specific linguistic phenomenon. We
assume that there is a difference between task-specific corpora vs. unconstrained conversational
corpora, and illustrate this by exploring different types of research questions based on the
type of corpus. To start with, we applied DPM to the Switchboard corpus, which consists of
unconstrained telephone conversations, albeit in a specific domain.
   One of the challenges in this corpus is the large amount of variation. In total, there are 220
different tags in 42 different classes, and the conversations are quite long. This means there
are many different possible combinations of tags within one conversation. Furthermore, the
corpus was transcribed and annotated with a linguistic analysis mind. It therefore contains a
detailed analysis of hesitations, overlaps, and other features that are not necessary relevant
for a more pragmatic structural analysis. An example of the transcript can be found in figure
1. When processing these dialogues using DPM, it quickly became apparent that editing and
carefully selecting the data was crucial to gaining meaningful insights for a corpus with so
much variance. Figure 2 shows the process map of a partial structural corpus analysis, and
the outcome can be described as a spaghetti model [2]. While it is easy to input the data and
automatically create these models, they are difficult to interpret and not suitable for this type of
research. We therefore conclude that it would be more suitable as a basis for exploring more
specific linguistic research questions such as the pragmatic use of the discourse marker "well".
Whilst this analysis is beyond the scope of this paper, we plan to address these questions in
future work.
   In contrast, the results from the automatic structural analysis of a task specific corpus, such
as the train booking dialogue from the Spaadia corpus show great potential. Figure 3 shows the
process map of an analysis of 35 train human-human train booking conversations. Because there
Figure 2: The process map, a so-called spaghetti model, from the SwDA corpus. The data was analyzed
using the Process Mining tool Disco by Fluxicon.


is much less variance in a task-specific dialogue, the process map is much more condensed and
it shows some of the general structure of the conversations. Starting from the top of the process,
almost all of the conversations start with a greeting and the speakers identify themselves. In
the middle of the conversation there is a split into different paths. Interestingly, the model also
shows that there are a number of conversations that contain multiple bookings or requests, the
dark arrow shows that the process starts again with an information request rather than ending
directly.
    To further illustrate the concept of DPM, we also analyzed the Microsoft Dialogue Challenge
corpus, specifically because it is also a constrained task-orientated corpus, but unlike the
Spaadia corpus, it involves a human and an conversational agent. In total, the corpus contains
the transcriptions of about 3000 conversations where a system user interacts with a movie
booking agent. Figure 4 shows that DPM enables the quick analysis of the overall structure
of the dialogues in the corpus. The process map shows some basic information, for example
that the two most used dialogue acts are request and inform, and that they often follow one
another. More interestingly, it also shows that the conversations often do not contain a more
traditional ending to a conversation, as only 1169 occurrences of thanking, and consequently
only 50 instances of welcome, were found in the data.
    It is important to note that generalizations about conversational structure only hold true for
the specific corpus that is being researched. Especially in the case of the examples above, the
dialogues between a human-agent seem to be more structured than between human-human
interlocutors, as conversational turns are dictated by the system used by the agent. Taxi bookings,
or bookings more generally, seem to show more variance, as in the Spaadia corpus. However,
Discourse Process Mining can be used as a tool to further explore conversational structure in
a way that would have been difficult to achieve so rapidly using more traditional methods. It
provides an automated process to gain quick insights into different types of conversational
corpora.
Figure 3: The process map based on the Spaadia corpus. The data was analyzed using the Process
Mining tool Disco by Fluxicon.


6. Conclusion
The aim of this paper was to introduce the idea of applying automatic Process Mining (PM)
techniques to the analysis of discourse phenomena. A discourse is a sequence of speech acts
or conversational moves. From a corpus of conversations, in other words different discourse
sequences, a comprehensive discourse model can be derived. Intuitively, there is a difference
between conducting Discourse Process Mining (DPM) on corpora containing unrestrained
dialogue versus highly constrained, task-oriented dialogue. Our initial research supports this
hypothesis as was demonstrated using the Switchboard Dialogue Act Corpus, the Spaadia and
the Microsoft Dialogue Challenge corpus.
Figure 4: The analysis of the movie-booking task in the Microsoft Dialogue Challenge corpus. The
process map was created using the Process Mining tool Disco by Fluxicon.


   There are various tools for conducting Process Mining and, therefore, Discourse Process
Mining. For our initial analyses, we used the tool Disco by Fluxicon (https://www.fluxicon.com/),
which has a high usability. There is also a very active PM community. Therefore, experiments
in DPM are easy to conduct. One major constraint is the availability of a suitable discourse
corpus that (a) contains a sufficient number of phenomena to be investigated and (b) is reliably
annotated. As is often the case in data-driven linguistics, the availability of data can be a
problem, and tools for automatic annotation of suitable corpora might be of use in this case.
   In general, we assume that discourse process models can effectively support the further
investigation of discourse structure, as well as the speakers’ means to control discourse, in
particular during conversations with multiple participants. In future work we would like to
explore some of the potential research questions introduced in this paper, in particular the
examination of discourse particles, such as "well" or the German "halt" in large unconstrained
speech corpora.


References
 [1] S. Kent, H.-C. Schmitz, Discourse process mining, Humanities-Centred AI (CHAI) (2021).
     URL: https://doi.org/10.25592/uhhfdm.9672.
 [2] W. M. P. van der Aalst, Process Mining: Data Science in Action, 2 ed., Springer, Heidelberg,
     2016. doi:10.1007/978-3-662-49851-4.
 [3] J. L. Austin, How to Do Things with Words, Harvard University Press, Cambridge, MA,
     1962.
 [4] J. R. Searle, Speech Acts, Cambridge University Press, Cambridge, UK., 1969.
 [5] E. A. Schegloff, H. Sacks, Opening up closings, Semiotica 8 (1973) 289–327.
 [6] S. Vakulenko, K. Revoredo, C. D. Ciccio, M. de Rijke, QRFA: A data-driven model of
     information-seeking dialogues, CoRR abs/1812.10720 (2018). URL: http://arxiv.org/abs/
     1812.10720. arXiv:1812.10720.
 [7] G. A. Wang, H. J. Wang, J. Li, A. S. Abrahams, W. Fan, An analytical framework for
     understanding knowledge-sharing processes in online qa communities., ACM Trans.
     Manag. Inf. Syst. 5 (2015) 18:1–18:31. URL: http://dblp.uni-trier.de/db/journals/tmis/tmis5.
     html#WangWLAF15.
 [8] D. Compagno, E. Epure, R. Deneckère, C. Salinesi, Exploring digital conversation corpora
     with process mining, Corpus Pragmatics 2 (2018). doi:10.1007/s41701-018-0030-6.
 [9] P. H. P. Richetti, J. C. de A. R. Gonçalves, F. A. Baião, F. M. Santoro, Analysis of knowledge-
     intensive processes focused on the communication perspective., in: J. Carmona, G. Engels,
     A. Kumar (Eds.), BPM, volume 10445 of Lecture Notes in Computer Science, Springer, 2017,
     pp. 269–285. URL: http://dblp.uni-trier.de/db/conf/bpm/bpm2017.html.
[10] D. Jurafsky, E. Shriberg, D. Biasca, Switchboard SWBD-DAMSL shallow-discourse-function
     annotation coders manual, Technical Report Draft 13, University of Colorado, Institute of
     Cognitive Science, 1997.
[11] E. Shriberg, R. Dhillon, S. Bhagat, J. Ang, H. Carvey, The icsi meeting recorder dialog act
     (mrda) corpus., in: M. Strube, C. L. Sidner (Eds.), SIGDIAL Workshop, The Association
     for Computer Linguistics, 2004, pp. 97–100. URL: http://dblp.uni-trier.de/db/conf/sigdial/
     sigdial2004.html.
[12] M. Weisser, Dart – the dialogue annotation and research tool, Corpus Linguistics and
     Linguistic Theory 12 (2016). doi:10.1515/cllt-2014-0051.
[13] P. Budzianowski, T.-H. Wen, B.-H. Tseng, I. Casanueva, U. Stefan, R. Osman, M. Gašić,
     Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue
     modelling, in: Proceedings of the 2018 Conference on Empirical Methods in Natural
     Language Processing (EMNLP), 2018.
[14] X. Zang, A. Rastogi, S. Sunkara, R. Gupta, J. Zhang, J. Chen, Multiwoz 2.2: A dialogue
     dataset with additional annotation corrections and state tracking baselines, in: Proceedings
     of the 2nd Workshop on Natural Language Processing for Conversational AI, ACL 2020,
     2020, pp. 109–117.
[15] X. Li, S. Panda, J. Liu, J. Gao, Microsoft dialogue challenge: Building end-to-end task-
     completion dialogue systems, CoRR abs/1807.11125 (2018). URL: http://arxiv.org/abs/1807.
     11125. arXiv:1807.11125.
[16] A. H. Anderson, M. Bader, E. G. Bard, E. Boyle, G. Doherty, S. Garrod, S. Isard, J. Kowtko,
     J. McAllister, J. Miller, C. Sotillo, H. S. Thompson, R. Weinert, The hcrc map task corpus,
     Language and Speech 34 (1991) 351–366. URL: https://doi.org/10.1177/002383099103400404.