Dual-Task Dialogue Understanding
                         Sibgha Anwar, Nirmalie Wiratunga and Mark Snaith
                         School of Computing, Engineering and Technology, Robert Gordon University, Aberdeen AB10 7GJ, Scotland, UK


                                     Abstract
                                     In dialogue systems, utterances do not occur in isolation. One conversation might involve interactions between
                                     several speakers. It’s crucial to determine the intentions behind utterances in multi-party conversations when more
                                     than two interlocutors are interacting. Beyond directly capturing the speaker’s intention, our proposed model
                                     first focuses on identifying speakers from utterances, and based on this knowledge, it classifies the corresponding
                                     dialogue acts. For the speaker identification process, the study extracted linguistic features related to speakers from
                                     conversations and incorporated them during the fine-tuning process, which is particularly beneficial in dealing with
                                     multiple speakers. After that our model aims to improve dialogue act recognition baselines on shorter utterances
                                     by implementing a pipe-lining approach based on speaker model predictions. The effectiveness of our approach
                                     is demonstrated using two benchmark datasets, MRDA and SwDA, which are based on multiparty and twofold
                                     conversations, respectively.

                                     Keywords
                                     Speaker identification, dialogue act recognition, dual-task learning, conversational structure learning


                            Conversations, both written and verbal, are crucial for human communication. Speaker identification
                         (SI) and dialogue act recognition (DAR) are essential tasks for understanding spoken language, identifying
                         speakers, and facilitating human-computer interaction applications. SI and DAR have historically been
                         considered separate tasks in natural language processing (NLP) and speech processing [1, 2, 3]. Some
                         contributions to the field of SI have been made by [3, 4] while [5, 6] conducted significant research on
                         dialogue act recognition. The goal is to identify each speech in a conversation based on its communicative
                         function, such as question, statement, command, or request, which aids in understanding conversational
                         flow and anticipating future exchanges.
                            The literature identifies several major challenges with dialogue act recognition that require improvement.
                         Firstly, dialogue act models hugely rely on statistical patterns rather than speaker-specific traits, predicting
                         actions based on similar sequences but not considering unique speaking styles [7, 8]. Secondly, as
                         dialogue act recognition algorithms generalise across all speakers without taking details into account,
                         they frequently provide inaccurate classifications and fail to correctly detect specific speakers’ speech
                         patterns or personal styles [9, 10]. Finally, most current methods often see conversation actions as discrete
                         categories, which may overlook the subtle variations in how various speakers convey identical intents and
                         the pragmatic implications [11, 12].
                            Research shows that speaker identification is a crucial aspect of dialogue act recognition systems,
                         enabling personalised recognition and distinguishing between orders, enquiries, and assertions based on
                         speaker-specific patterns [13]. It is especially helpful for unclear utterances such as "Really?" and aids in
                         identifying and adjusting to unusual or non-standard dialogue behaviours. Inaccurate categorisation and
                         disruption of discourse can result from mis-identification. SI tracks conversational roles and interactions,
                         which preserves dialogue flow, improves turn-taking modelling, and increases DAR accuracy [10, 14].
                         Table 1 displays conversation snippets tagged with speaker IDs and dialogue acts from the MRDA corpus.
                            The literature research suggests that combining SI and DAR can enhance conversational flow, particu-
                         larly in multiparty interactions. However, the challenge of simultaneously identifying the speaker and
                         their intent has not been adequately addressed in literature. Therefore our research combines SI models
                         with DAR to tackle dialogue act recognition challenges, forming the basis for addressing following
                         research questions.


                          SICSA REALLM Workshop 2024
                          Envelope-Open s.anwar3@rgu.ac.uk (S. Anwar)
                          Orcid 0000-0001-5839-5314 (S. Anwar); 0000-0003-4040-249 (N. Wiratunga); 0000-0001-9979-9374 (M. Snaith)
                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    Table 1
    Example conversation snippet annotated by dialogue acts from the MRDA corpus.
      Speaker    Utterance                                                                      DA
      me012      Who would be the subject of this trial run?                                    Question
      mn015      Pardon me?                                                                     Request
      me012      Is one of you going to be the subject?                                         Question
      mn015      Liz volunteered to be the first subject, which might be even better than us.   Statement
      fe004      Good.                                                                          Agreement
      me003      One of us.                                                                     Acknowl-
                                                                                                edgement


    • What is the most effective method to properly combine speaker identification and dialogue act
      recognition into a single framework to improve our overall understanding of multiparty conversa-
      tions?
    • When speaker-specific data such as speech patterns and styles, frequency, response time and per-
      sonalised phrasing are included, how does the accuracy of dialogue recognition systems improve?
    • How does the proposed system handle imprecise speaker transitions such as overlapping speech
      and shifting topics unexpectedly in real-world multi-party interactions?
    • What strategies may be used to guarantee this dual-task system’s performance and scalability in
      intricate multiparty interactions across a range of conversational contexts?


1. Related Work
In natural language processing (NLP), speaker identification (SI) and dialogue act recognition (DAR)
are important research fields [5]. Early speaker identification techniques primarily focused on linguistic
information derived from speech transcriptions. Discourse patterns, grammatical structures, and semantic
content are key indicators that provide valuable insights into the linguistic preferences of specific speakers.
The study [15] suggests classifying film dialogue speakers based on discrete stylistic features using the K
Nearest Neighbour Algorithm, Naive Bayes Classifier, and Conditional Random Field [3, 16, 17]. The
approaches were difficult to handle a variety of language styles and complex transcribing conditions,
even if they worked well in controlled settings. Due to their thorough contextual representations, pre-
trained language models such as BERT and RoBERTa have shown success in speech processing and
conversational tasks [18, 19].
   Traditionally, DAR relied on statistical models and rule-based systems, including Hidden Markov
Models (HMMs) and Conditional Random Fields (CRFs), to categorise dialogue acts according to lexical
and syntactic aspects [5, 20]. Deep learning algorithms Improve DAR Accuracy by using LSTM networks
to enhance the representation of ambiguity in real-world conversations. when multiple participants are
involved [21]. The recent advancements such as transformer-based models such as BERT and RoBERTA
enhance conversational relationships by adjusting dialogue act recognition and enhancing performance
when paired with dialogue-specific data such as dialogue history and speaker information [1, 22]. But
these models have drawbacks, especially when it comes to shorter utterances, when poor performance
is caused by inadequate context [1]. Furthermore, models that just use pre-trained embeddings to
identify dialogue actions frequently ignore the entirety of the conversational context and speaker-specific
information. The model’s inability to differentiate between dialogue acts is hindered by this lack of
contextual richness and customisation, which may make it more difficult to comprehend and interpret
the conversation’s intended meaning. Adding speaker-specific data can help to improve dialogue act
recognition by giving the required context.
   The integration of SI with DAR has not received much attention since it relies heavily on auditory
features and is not immediately relevant to text transcriptions [23]. For text-based, multiparty conver-
sation contexts, additional research is required [24]. Recent works have investigated the use of speaker
        Figure 1: Workflow of the proposed dual-task system for dialogue understanding, integrating
        Speaker Identification and Dialogue Act Recognition (SIDAR). The method improves both tasks’
        accuracy by utilising conversational context and speaker-specific features, which makes it possible
        to understand multiparty conversations more effectively.


embeddings in dialogue act models, offering a basis for increasing DAR through speaker identification.
Study conducted in [25] indicates that discourse structure has an important role in understanding utterance
purpose, enhancing model performance, and recognising dialogue acts. Recent studies have also explored
techniques involving discourse structure analysis and speaker identification in dialogue act recognition
[26]. Therefore, our study intends to create complex, context-aware chat systems by utilising discourse
structure and speaker identity to increase dialogue act recognition accuracy and coherence.


2. Datasets
The dual-task learning method will be tested on two publicly available datasets to demonstrate its
reliability in accurately identifying speakers regardless of their complexity or speaking style. This study
uses two datasets: the ICSI Meeting Recorder Dialogue Act (MRDA) [27] and the Switchboard Dialogue
Act (SwDA) [28], which contain over 180,000 real-world meeting conversation utterances and 223605
utterances from phone talks between two speakers on a predetermined topic, respectively, to analyse
academic and professional meetings.


3. Proposed Approach
The proposed method aims to enhance dialogue act recognition by integrating speaker identification into
the conversational analysis pipeline. This dual-task learning approach addresses the drawbacks of existing
models in handling intricate conversational contexts by integrating speaker identification with dialogue
act recognition to better capture the link between discourse purpose and speaker identity. Following are
the major elements of the our proposed methodology. The Figure 1 illustrates the overall workflow and
how different components are linked in the proposed SIDAR model.

3.1. Speaker Identification (SI) Model
The Speaker Identification (SI) model plays a major role in our dual-task method by enhancing contextual
awareness in multiparty conversations. It enhances the identification process by capturing distinct speech
and behaviour patterns by adding speaker-specific information.

3.1.1. Features of SI Model
The SI model improves interactions between multiple speakers in text-based translations by incorporating
speaker-specific features that influence communication styles. We aim to utilise the following features.

    • Speech Style and Patterns: Each speaker uses different syntactic patterns, repeats particular
      phrases, and builds utterances in different ways. Through the integration and application of these
      patterns, the model enhances its ability to identify dialogue acts, hence improving its understanding
      of the speaker’s communication style and context [20, 29].

    • Personalised Phrasing: "Would you mind" and "Can you" are examples of spoken words and
      phrases that speakers frequently employ. The model is able to accurately predict the meaning of an
      utterance by identifying personalised phrases since some words, such as requests, directions, and
      queries, are suggestive of certain conversation activities [3, 4].

    • Frequency and Timing of Responses: The SI model analyses the frequency of a speaker’s replies
      as well as their timing, including start and end timings, in order to comprehend their response
      in the conversation. Faster answers demonstrate interaction, but slower responses imply more
      complex contributions. People who take longer to reply, for instance, could be writing more
      intricate or in-depth remarks, such long suggestions. Overall the SI model enhances the DAR
      model’s performance by understanding the speaker’s identity and the correlation between response
      timing and the dialogue act [4, 30].

3.1.2. SI Model Architecture
Advanced pre-trained language models, such as DeBERTa, RoBERTa, BART and Llama, are used in this
work to identify the speaker in text-based transcriptions; these models are perfect for conversations that
vary widely in context. To simulate long-range interactions in conversation text, these models, according to
[22, 24], specifically account for effective speaker recognition and gather extensive contextual information.
The research aims to increase speaker identification models’ performance by utilising contextual factors
from these models and the extracted linguistic elements. Byte-pair encoding (BPE) adds context by
breaking utterances into subword units and assigning each token to an embedding vector. Speaker
embeddings recognise distinct speech patterns, whereas position embeddings preserve word order in
utterance [23, 24].
   Furthermore, position embeddings identify speaker transitions while preserving the utterances’ sentence
structure by tracking the conversation’s flow. Large numbers of speakers are a challenge for traditional
speaker identification techniques such as tf-idf vectors, speaker tokens, and word2vec [2, 3, 29, 31, 32].
But even in comparable utterances, our work makes use of speaker embeddings to improve the model’s
ability to distinguish between speakers in intricate multiparty conversations [24, 29]. Therefore, as stated
in Section 2, conversational datasets that highlight multiple speakers in various contexts are used to train
the DAR model. By learning both speaker identification and dialogue act recognition simultaneously,
it maximises their effectiveness. The dual-task technique leverages speaker-specific traits for enhanced
recognition in scenarios when the speaker identification is crucial to the discourse [25, 30]. By focusing
on pertinent segments of the input dialogue utterance, attention mechanisms are also intended to be used
to dynamically balance the importance of conversational segments [26, 33].
3.2. Dialogue Act Recognition (DAR) Model
DAR models categorise speech in conversations based on communicative goals such as inquiry, statement,
or order [5]. The most advanced models capture linguistic and contextual features by using pre-trained
language models such as BERT or RoBERTa. Transformer ensembles using lexical-based techniques
(BERT) have been developed recently as a result of advancements in spoken language processing; however,
these models frequently perform inefficiently on shorter utterances. For instance, the utterance "Sure"
in a customer support chat system can signify agreement, acknowledgement, or confirmation [1]. Input
representations, such as word embeddings, positional embeddings, and speaker-specific data, are used
to analyse data. Moreover, transformer encoders decode the input data, while multi-head self-attention
mechanisms understand conversation progression and connections between dialogue turns.
   Traditional models often face challenges due to short utterances lacking context. We often include
speaker-specific embeddings from speaker identification models, such as DeBERTa, RoBERTa, BART
and Llama, to get over the limitation and enhance the model’s ability to handle brief or unclear utterances,
particularly in multiparty interactions [34]. With the aid of these personalised embeddings, the model is
better able to understand speaker behaviour, including patterns of reaction from customers. By clarifying
the goal and making speech acts simpler to identify, this enhances interpretation and interaction patterns.
The model’s last layer improves overall accuracy in real-world conversational scenarios by predicting
dialogue actions based on the speaker’s location and the substance of the utterance. In order to improve
dialogue act recognition for shorter utterances, our SIDAR models will identify the speakers first, perhaps
providing additional information to understand unique speech patterns and conversational styles.


4. Conclusion and Future Work
In this work, we have presented a proposed methodology that integrates SI and DAR in a dual-task
learning approach to improve conversational flow precision. We suggest using state-of-the-art language
models such as DeBERTa, RoBERTa, BART and Llama in light of the shortcomings of traditional speaker
identification techniques and the limitations with BERT models have when recognising dialogue acts,
especially with shorter utterances. By including speaker-specific information and conversational history
into the dialogue act recognition process, our methodology aims to improve upon the inadequacies of
current techniques. This research provides a conceptual framework; however, further work will need to
be done to put our suggested approaches into practice and validate them through experimentation. The
study findings are expected to have a substantial impact on the domains of dialogue act recognition and
speaker identification, which will eventually improve the efficacy of conversational AI systems.


5. Acknowledgements
The authors would like to thank Robert Gordon University which supported this work through a funded
PhD studentship.


References
 [1] H. Maltby, J. Wall, T. Goodluck Constance, M. Moniri, C. Glackin, M. Rajwadi, N. Cannings, Short
     utterance dialogue act classification using a transformer ensemble, UA-DIGITAL 2023: UA Digital
     Theme Research Twinning (2023).
 [2] D. Holmer, L. Ahrenberg, J. Monsen, A. Jönsson, M. Apel, M. B. Grimaldi, Who said what?
     speaker identification from anonymous minutes of meetings, in: The 24rd Nordic Conference on
     Computational Linguistics, 2023.
 [3] K. Ma, C. Xiao, J. D. Choi, Text-based speaker identification on multiparty dialogues using
     multi-document convolutional neural networks, in: Proceedings of ACL 2017, Student Research
     Workshop, 2017, pp. 49–55.
 [4] S. Salim, S. Shahnawazuddin, W. Ahmad, Automatic speaker verification system for dysarthric
     speakers using prosodic features and out-of-domain data augmentation, Applied Acoustics 210
     (2023) 109412.
 [5] V. Raheja, J. Tetreault, Dialogue act classification with context-aware self-attention, arXiv preprint
     arXiv:1904.02594 (2019).
 [6] Y. Si, L. Wang, J. Dang, M. Wu, A. Li, A hierarchical model for dialogue act recognition considering
     acoustic and lexical context information, in: ICASSP 2020-2020 IEEE International Conference on
     Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 7994–7998.
 [7] A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. V. Ess-
     Dykema, M. Meteer, Dialogue act modeling for automatic tagging and recognition of conversational
     speech, Computational linguistics 26 (2000) 339–373.
 [8] T. Saha, S. Srivastava, M. Firdaus, S. Saha, A. Ekbal, P. Bhattacharyya, Exploring machine learning
     and deep learning frameworks for task-oriented dialogue act classification, in: 2019 International
     Joint Conference on Neural Networks (IJCNN), IEEE, 2019, pp. 1–8.
 [9] M. Kim, H. Kim, Integrated neural network model for identifying speech acts, predicators, and
     sentiments of dialogue utterances, Pattern recognition letters 101 (2018) 1–5.
[10] A. Qamar, A. Pyarelal, R. Huang, Who is speaking? speaker-aware multiparty dialogue act
     classification, in: Findings of the Association for Computational Linguistics: EMNLP 2023, 2023,
     pp. 10122–10135.
[11] C. Sun, L.-P. Morency, Dialogue act recognition using reweighted speaker adaptation, in: Proceed-
     ings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2012, pp.
     118–125.
[12] A. Enayet, G. Sukthankar, An analysis of dialogue act sequence similarity across multiple domains,
     in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp.
     3122–3130.
[13] Z. He, L. Tavabi, K. Lerman, M. Soleymani, Speaker turn modeling for dialogue act classification,
     arXiv preprint arXiv:2109.05056 (2021).
        ̇
[14] P. Zelasko,  R. Pappagari, N. Dehak, What helps transformers recognize conversational structure?
     importance of context, punctuation, and labels in dialog act recognition, Transactions of the
     Association for Computational Linguistics 9 (2021) 1163–1179.
[15] A. Kundu, D. Das, S. Bandyopadhyay, Speaker identification from film dialogues, in: 2012 4th
     International Conference on Intelligent Human Computer Interaction (IHCI), IEEE, 2012, pp. 1–4.
[16] R. Lowe, N. Pow, I. Serban, J. Pineau, The ubuntu dialogue corpus: A large dataset for research in
     unstructured multi-turn dialogue systems, arXiv preprint arXiv:1506.08909 (2015).
[17] M. K. Singh, S. Manusha, K. Balaramakrishna, S. Gamini, Speaker identification analysis based on
     long-term acoustic characteristics with minimal performance, International Journal of Electrical
     and Electronics Research 10 (2022) 848–852.
[18] C. S. Xia, Y. Wei, L. Zhang, Practical program repair in the era of large pre-trained language models,
     arXiv preprint arXiv:2210.14179 (2022).
[19] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, D. Roth,
     Recent advances in natural language processing via large pre-trained language models: A survey,
     ACM Computing Surveys 56 (2023) 1–40.
[20] H. Kumar, A. Agarwal, R. Dasgupta, S. Joshi, Dialogue act sequence labeling using hierarchical
     encoder with crf, in: Proceedings of the aaai conference on artificial intelligence, volume 32, 2018.
[21] C. Bothe, C. Weber, S. Magg, S. Wermter, A context-based approach for dialogue act recognition
     using simple recurrent neural networks, arXiv preprint arXiv:1805.06280 (2018).
[22] G. Guaquiere, P. ENSAE, A. N. T. SON, Roberta vs bert for intent classification (2021).
[23] T. Kinnunen, H. Li, An overview of text-independent speaker recognition: From features to
     supervectors, Speech communication 52 (2010) 12–40.
[24] Z. Jia, Y. Shi, W. Liu, Z. Huang, X. Sun, Speaker-aware interactive graph attention network for
     emotion recognition in conversation, ACM Transactions on Asian and Low-Resource Language
     Information Processing 22 (2023) 1–18.
[25] Z. Shi, M. Huang, A deep sequential model for discourse parsing on multi-party dialogues, in:
     Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp. 7007–7014.
[26] C.-J. Peng, Y.-J. Chan, C. Yu, S.-S. Wang, Y. Tsao, T.-S. Chi, Attention-based multi-task learning
     for speech-enhancement and speaker-identification in multi-speaker dialogue scenario, in: 2021
     IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, 2021, pp. 1–5.
[27] E. Shriberg, R. Dhillon, S. Bhagat, J. Ang, H. Carvey, The icsi meeting recorder dialogue act (mrda)
     corpus, in: Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL
     2004, 2004, pp. 97–100.
[28] D. Jurafsky, E. Shriberg, Switchboard swbd-damsl shallow-discourse-function annotation coders
     manual, draft 13 daniel jurafsky*, elizabeth shriberg+, and debra biasca** university of colorado at
     boulder &+ sri international (1997).
[29] M.-Q. Nghiem, N. Roberts, D. Sityaev, Speaker role identification in call centre dialogues: Lever-
     aging opening sentences and large language models, in: Proceedings of the 24th Meeting of the
     Special Interest Group on Discourse and Dialogue, 2023, pp. 388–392.
[30] R. Le, W. Hu, M. Shang, Z. You, L. Bing, D. Zhao, R. Yan, Who is speaking to whom? learning to
     identify utterance addressee in multi-party conversations, in: Proceedings of the 2019 Conference
     on Empirical Methods in Natural Language Processing and the 9th International Joint Conference
     on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 1909–1919.
[31] D. Baum, Recognising speakers from the topics they talk about, Speech Communication 54 (2012)
     1132–1142.
[32] E. Ekstedt, G. Skantze, Turngpt: a transformer-based language model for predicting turn-taking in
     spoken dialog, arXiv preprint arXiv:2010.10874 (2020).
[33] D. Bahdanau, Neural machine translation by jointly learning to align and translate, arXiv preprint
     arXiv:1409.0473 (2014).
[34] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, et al., Transformers: State-of-the-art natural language processing, in: Proceedings of
     the 2020 conference on empirical methods in natural language processing: system demonstrations,
     2020, pp. 38–45.