1. Introduction

Rethinking Dialogue Disentanglement for LLMs via Dialogue-Level Assignment and Subsequent Context

NaokiTakada

TatsunorMiori

Dialogue Disentanglement

Multi-Party Chat

Large Language Models (LLMs)

0 Yokohama National University , 79-7 Tokiwadai, Hodogaya-ku, Yokohama, Kanagawa, 240-8501 , Japan

2026

In online chat platforms, multiple dialogues are often entangled in a complex manner. Disentangling them into coherent dialogues is essential for understanding conversations. Supervised models have long dominated this task. The application of large language models (LLMs) to this problem remains underexplored; however, preliminary studies have shown that their performance is substantially inferior to that of conventional non-LLMs methods. This raises fundamental questions: Do LLMs inherently lack the capability for dialogue disentanglement, and what approaches are necessary to enhance their performance? We answer these questions by introducing two novel methods: dialogue-level assignment (DLA), which tasks LLMs with assigning an utterance to a dialogue, and subsequent context (SC), which provides subsequent context as auxiliary evidence. Experiments on the benchmark IRC dataset show that our method, equipped with DLA+SC, achieves new state-of-the-art results across all evaluation metrics. These results prove that LLMs possess a strong capability for the studied task. Furthermore, an ablation study was conducted on open-source models to investigate the efectiveness of DLA and SC, revealing that it is model-dependent. Nevertheless, it also demonstrates that both methods are key factors in improving performance. The findings of this study mark a paradigm shift for dialogue disentanglement, transitioning from conventional non-LLMs approaches to applying LLMs.

1. Introduction

In multi-party chat platforms such as Slack and Internet Relay Chat (IRC), multiple concurrent The 10th Linguistic and Cognitive Approaches to Dialog Agents Workshop at the 40th AAAI conference, January 26, 2026, Singapore https://haniwara.github.i(oN/. Takada);https://forest.ynu.ac.jp/mor(Ti. Mori)

CEUR Workshop

ISSN1613-0073 dialogues frequently become intertwined. This entanglement leads to a chaotic conversational flow. For human participants, following individual dialogues becomes challenging, imposing a substantial cognitive burden. For systems, these entangled dialogues introduce noise into the conversational context and complicate subsequent processing. To address this challenge, the concept of dialogue disentanglement has been propose1d, 2[]. Dialogue disentanglement aims to partition entangled utterances into coherent clusters connected through reply-to relation1)s. (TFihgiusraepproach supports a wide range of downstream applications, including dialogue state t3r,a4]ckainndg r[esponse generation5[ , 6, 7, 8 ]. This task has predominantly been addressed using pairwise relation classification based on supervised machine learnin2g, 9[, 10]. In this approach, a model scores every candidate utterance pair within a conversation and makes a local binary decision: is one utterance a direct reply to the other? A global dialogue structure is then assembled from these local links. This approach operationalizes disentanglement by atomizing decisions for machine learning.

Meanwhile, large language models (LLMs) ofer strong contextual reasoning and adaptation, suggesting their potential to revolutionize dialogue disentanglement. However, initial attempts to apply LLMs to dialogue disentanglement resulted in substantially worse performance compared to that of conventional non-LLMs method1s1][. This performance gap motivates a re-examination of the task’s formulation for LLMs. We propose two methods that recast dialogue disentanglement as a problem suited to LLMs. First, we propose the dialogue-level assignment (DLA) method. Rather than identifying one parent utterance, thereby replicating traditional pairwise relation classification, we apply a method identifying dialogues with reply-to relations. The LLM is provided with dialogue clusters that have been identified from previous assignments. The LLM must then assign the target utterance to one of these existing dialogues or classify it as the start of a new one. Second, we introduce the subsequent context (SC) method, in which the model receives subsequent utterances as auxiliary evidence to inform the assignment.

Our contributions are threefold. First, we focus explicitly on applying LLMs to dialogue disentanglement and develop methods that improve their disentanglement capability. Second, we show that using our method with LLMs achieves state-of-the-art (SOTA) performance on the IRC benchmark. Third, we conducted an ablation study regarding the efectiveness of DLA and SC. The analysis shows the optimal method is model-dependent, but establishes that DLA and SC are key components for improved accuracy.

2. Related Work 2.1. Dialogue Disentanglement

Dialogue disentanglement, also known as conversation disentang1l,e2m]e,ncotn[versation management [12], or thread detectio1n3][, is a crucial research topic for understanding entangled dialogues. Early approaches framed this task as a two-stage problem. First, a classifier assesses every potential utterance pair to determine whether one is a direct reply to the other. Second, based on this pairwise relation classification, a clustering algorithm assembles these pairs into a d1i,a2l]o. gTuhees[e initial models relied heavily on handcrafted features, such as lexical overlap, time gaps, and explicit user mentions. The availability of large-scale annotated corpora, notably the Ubuntu IR1C4]d, afatcaislei-t [ tated the development of data-driven, end-to-end neural models. A separate, subsequent advancement came with the adoption of fine-tuned pre-trained language models (PLMs), such as B1E5R],Tw[hich substantially improved performance and set a new standard for thes1e6t,9a]s.ks [

However, these powerful encoders typically treated conversations as simple, unstructured utterance sequences and thus failed to leverage the structure of the discourse. This prompted a subsequent line of research that focused on explicitly integrating discourse structure to better capture conversational dynamics. Models started to incorporate static, dialogue-specific features, such as speaker identity and user-mention dependencies10[]. Several key refinements have been introduced in the current SOTA model [11], enriching the model with dynamic discourse information, such as time gap and continuously updated reply chains. Hierarchical learning loss has also been integrated with an easy-first decoding algorithm to capture global conversational properties.

2.2. LLMs for Dialogue Disentanglement

The rise of LLMs has shifted natural language processing paradigms, as these models have shown strong proficiency across many application1s7[]. Their power stems from their ability to adapt to new problems through prompts. Foundational paradigms such as in-context learning and few-shot prompting, in which the model learns from a handful of examples provided directly in context, enable LLMs to perform tasks without task-specific training or fine-tuni1n8g]. [

Despite these powerful capabilities, the application of LLMs to dialogue disentanglement remains largely underexplored. To the best of our knowledge, the only notable investigation in this area is a preliminary experiment by Li et a[1l1.]. They framed the task as a zero-shot classification problem, asking an LLM to identify a parent utterance without examples. The task is performed through an approach similar to the pairwise relation classification method using conventional dialogue disentanglement. In this method, LLMs are asked to perform utterance-level assignments. This approach performed significantly worse than conventional non-LLMs methods. Consequently, the potential of LLMs to succeed in dialogue disentanglement has not been fully assessed to date. In this study, we reframed the task by introducing two new formulations: DLA and SC. To our knowledge, this is the first study to place LLMs at the center of dialogue disentanglement—a task long dominated by conventional non-LLMs approaches.

3. Methods 3.1. Task Definition

To formally define the dialogue disentanglement task addressed in this study, we first need to clarify the terminology. We refer to the entire observed sequence of utterances, where multiple dialogues are entangled, as conversatio.nSet represents the partitioned dialogues fr, owmhere each in is a separated semantic unit comprising a chain of response-related utterances sharing a specific topic or purpose. Accordingly, the task is formalized as follows. The input is a conver=sa(tio1n, 2, … , ) comprising utterances in chronological order. Each utt e ra∈n ceis a tupl e = ( , , ), where is the timestamp , is the speaker ID, and is the message content. The goal is to partitiinotno a set of mutually disjoint dialog u=es{

1, 2, … , }. This partitioning must be a strict partition of that satisfies the following two conditions: exhaustive n=ess ⋃( ∈ ) and exclusiveness (∀ ≠ ,

∩ = ∅). Therefore, dialogue disentanglement is the problem of estimating the opt imal set that satisfies these conditions for a give.n

3.2. Dialogue Disentanglement Methods for LLMs 3.2.1. Core Framework

For the ablation study, we instantiated four methods. First, we describe the common core framework shared by all four methods. The core framework is an iterative greedy algorithm. The LLM processes each utterance chronologically, fr2omto . The local context for each target utteranceis constructed based on two fixed hyperparameters: the previous and subsequent window sizesa,nd , respectively. These values remain constant across all utterances within a single experimental run, rather than being dynamically adjusted. In this study, we selected these window sizes experimentally to evaluate the sensitivity of the model to context length. The LLM’s task is to return a single assignment, indicating wheth e r

belongs to an existing dialogue or initiates a new one. The system updates its state, the set of dialogue, sbased on the LLM’s decision. This process repeats until all utterances have been assigned, yielding the final partition. The prompts are simple, zero-shot prompts. They ask the model to determine whether the target utterance should connect to exactly one candidate or start a new dialogue, guided by a locality prior that suggests that replies are often temporally close. We did not conduct an ablation study regarding this locality hint. We utilized the hint to align with the prompt formulation of Li et a[1l1.]. This ensures experimental consistency with their preliminary study. The LLM is constrained to output a JSON object containing three fields: is_new_dialogue (boolean), an assignment ID, and reason (a textual explanation). The DLA and DLA+SC methods task the LLMs with dialogue-level assignment, thus requiring a “dialogue_id” that links the utterance to an entire dialogue cluster. The Baseline and SC methods perform utterance-level assignment, which requires an “utterance_id” to specify a direct parent utterance. To handle malformed outputs, we implemented an error-handling and retry mechanism. If the generated JSON is invalid or its fields fail type or range validation, the query is re-sent with a corrective instruction appended to the prompt. We set a maximum of five retries for each assignment. Details of the prompts used in this study are listed in ABp.pendix

3.2.2. Framework Variations

Next, we define the four framework variations used in our experiments, which are distinguished by how they provide the LLMs with utterance sequences. The Baseline method provides LLMs with the previous context as a simple chronological sequence of utterances. The LLM’s task is to perform utterance-level assignment by identifying which specific previous utterance the target is replying to. This formulation closely mirrors the approach in Li [e1t1]a.lH.owever, our implementation modifies it by including not only the utterances within the immediate context window but also all other utterances previously assigned to the same dialogues represented in that window. This change ensures consistent information availability across methods to ensure a fair ablation. Further improvements are then applied based on this Baseline method. The DLA method organizes the previous context into coherent dialogue clusters. This structure is built dynamically, reflecting the assignment decisions made at each prior step. The LLM’s task is thus transformed into assigning the target utterance to one of these previously established dialogues or initiating a new one. The SC method uses the same utterance-level assignment, but also provides subsequent utterances as auxiliary evidence. The DLA+SC method combines DLA and SC. A schematic workflow of the proposed method is shown in Figu2r.eThe complete DLA+SC definition is provided in Algorith1m. Other framework variations are described in AppeBn.dix

4.2. Evaluation Metrics

For direct comparison, we adopted the evaluation metrics from the current SOTA method, Li et al. [11]. These metrics evaluate dialogue disentanglement from three perspectives. First, we used the Variation of Information (V1I9)][, Adjusted Rand Index (ARI)2[0], and Normalized Mutual Information (NMI) [21] to measure the overall similarity between the gold standard data and the predicted utterance clusters. Second, we used Loc3a[l2] to indicate the prediction accuracy when focusing on every three utterances as a local precision metric. Third, we used One-to-One2](,1S-1h)e[n-F1 (S-F) [13], and exact matching Precision, Recall, and F1 score (P, R, and 1F41]) [to show how many clusters match between the gold standard data and the predictions. Among these, the most stringent evaluation metrics are P, R, and F1, which measure the percentage of predicted dialogues that perfectly match the gold standard data.

4.3. Comparison Methods

We evaluated the proposed method against methods representing major paradigms in dialogue disentanglement. These comparators include a seminal statistical model reliant on handcrafted features [ 1 ], a neural model that treats the task as a sequence labeling p1r4o]b, laenmd[a transition-based model that formulates disentanglement as an online state-transitio2n2p].rWoceesasls[o benchmarked against a generative model that predicts reply-to2l3i]n,mkset[hods that leverage PLMs for contextual understanding9][, and those incorporating discourse struct1u0]r.eO[ur primary baseline was the current SOTA method, which integrates additional discourse structure info1r1m]a.tion [

4.4. Implementation Details

We evaluated the methods using both proprietary and open-source LLMs. For proprietary models, we utilized GPT4.1 and GPT4.1-mini via the Azure OpenAI Service (specifically version 2025-01-01preview). We also employed Google’s Gemini2.5-pro and Gemini2.5-flash from the stable June 2025 release. For all proprietary model experiments, we set the temperature to 0.0 to minimize variance. However, strict determinism is not guaranteed; therefore, we report results from a single trial. It is worth noting that rigorous reproducibility protocols for API-based models remain unestablished. The output token limit was set to 8,192. The previous context window was fixed to 70 utterances. This size was informed by a preliminary experiment, in which we observed reply dependencies spanning up to 66 utterances. The subsequent window was set to 50 utterances. For open source models, we used the Ollama framework as well as qwen3-32b (Ollama tag qwen3:32b; library ID 030ee887880f), gemma3-27b (Ollama tag gemma3:27b; library ID a418f5838eaf), gpt-oss-20b (Ollama tag gpt-oss:20b; library ID 17052f91a42e), and gpt-oss-120b (Ollama tag gpt-oss:120b; library ID a951a23b46a1). Additionally, we ifxed the previous context window to 40 utterances. This reduction was necessary because larger contexts frequently resulted in malformed JSON outputs. We then systematically evaluated subsequent windows of 5, 10, 15, and 20 utterances. All open-source models were configured with a maximum input context (num_ctx) of 16384 and a maximum output token limit of 8192. Similar to the open-source models, the temperature was set to 0.0. For the gpt-oss-120b and gpt-oss-20b models specifically, we set the reasoning efort parameter to “high” in the system prompt.

5. Results 5.1. Preliminary Experiments

We compared our DLA+SC method against the preliminary experiments using DiHRL [11]. Although our Baseline method is conceptually related, it is not identical to the DiHRL method, as it incorporates modifications discussed in Section3.2.2. We performed this evaluation on a 500-utterance subset of the IRC development set (IDs: 2005-06-27_12, 2005-08-08_01). Table 1 presents the results, with the superior method for each model underlined and the overall best value shown in bold. As these results confirm the efectiveness of DLA+SC across all models, we selected this approach for the main experiments on the test set.

5.2. Comparison with SOTA Methods 5.3. Ablation Study

We ablated DLA and SC on four open-source LLMs, using exact-ma1tcahs the primary metric. The results are summarized in Figu3r,ewhere indicates the number of utterances following the target utterance provided as subsequent context. These results show that the optimal formulation is modeldependent; each model required a diferent method configuration to achieve its peak performance. Among all setupsq,wen3-32b achieved the highest absolu 1te, indicating that DLA and SC are efective components. Notably, it1s score is comparable to that of the existing SOTA methods shown in2T,able despite dataset diferences. Full results for all experimental setups are provided in AAp.pendix

6. Analysis and Discussion

The efectiveness of subsequent context is model-dependent, as demonstrated by gemma3-27b. This model’s accuracy consistently deteriorated as its subsequent context window was enlarged. We hypothesize that the model misinterpreted these future utterances as valid parent candidates, which complicated the assignment task. This hypothesis is supported by our ablation study results. We initially configured the open-source models to use the same context window size as the proprietary models, but were compelled to reduce this value owing to a high rate of hallucinations. With longer input contexts, the model made errors; it incorrectly assigned the target utterance to a subsequent one. For certain LLMs, subsequent context may therefore act as a distractor rather than as auxiliary evidence, impairing performance. By contrast, the three models other than gemma3-27b demonstrated improved accuracy with the inclusion of subsequent context. This suggests that providing a limited window of subsequent utterances can be efective for dialogue disentanglement. However, its efectiveness is not universal and depends on the model’s ability to diferentiate evidence from assignment candidates.

We conducted a quantitative failure analysis on the results from all open-source models to identify conditions leading to assignment errors. We evaluated the average success rate by classifying each assignment decision as a binary success or failure based on dialogue cluster overlap. For a target utterance , an assignment is considered successful only if the assigned dialogue cluster shares common utterances with the ground truth cluster within the previous c,ownhtexrte(< ). We investigated the correlation between the average success rate and three potential factors: the number of utterances in the prompt, the count of assignment candidates, and the reply-to relation positional distance. Specifically, we analyzed the trends in success rates relative to the numerical variations of these factors. This examination aimed to determine whether specific quantitative increases negatively impact performance. Our analysis revealed no significant correlation between these factors and assignment accuracy; for simplicity, detailed results are omitted here. However, this analysis is limited to the window sizes used in this study and should not be interpreted as a general characteristic for LLM’s dialogue disentanglement. In future work, large-scale experiments with varying window sizes, including larger contexts, are required to rigorously verify LLMs’ dialogue disentanglement ability.

Additionally, we conducted a qualitative error analysis specifically on the results from our bestperforming open-source configuration: qwen3-32b with DLA+SC (sub = 05). This analysis identified two primary factors contributing to errors. The first major failure mode involved the misclassification of short, non-substantive utterances such as reactions (“WTF!”), backchannels (“it is”), or greetings (“hi”). Correctly assigning these utterances requires identifying them as reactions within a dialogue the speaker is already participating in. An examination of the model’s generated reasoning suggested that it often failed to properly recognize speaker identity. The model appeared to prioritize the semantic content of an utterance over speaker consistency, often defaulting to linking it with a temporally proximate utterance regardless of the speaker. This behavior may be exacerbated by the locality hint provided in our prompt, which instructs the model that temporally closer utterances are more likely to belong to the same dialogue. Misclassified short reactions were often temporally distant from the utterance they responded to. The model thus tended to make locally plausible but globally incorrect assignments. This finding suggests that the inclusion of a locality hint may hinder performance. The second type of common error was due to domain-specific terms, such as “hoary,” the codename for Ubuntu 5.04. The model’s generated reasoning showed that it understood that the term is related to Ubuntu but failed to correctly link it to the ongoing conversation. It frequently misinterpreted the jargon as the start of a new topic, thereby incorrectly fragmenting a single, coherent dialogue. For technical chats, retrieval-augmented generation could provide domain-specific knowledge and improve accuracy [25].

7. Conclusion

In this work, we demonstrated that LLMs can achieve SOTA performance in dialogue disentanglement. Applying our proposed methods to proprietary LLMs yielded accuracy surpassing that of conventional non-LLMs methods. We introduced two novel formulations to facilitate this achievement: DLA and SC. However, our ablation study with open-source models revealed that the efectiveness of DLA and SC is model-dependent. Although the optimal method varies by model, DLA and SC are proven to be valuable components for improving LLM-based dialogue disentanglement. The findings of this study provide a paradigm shift for dialogue disentanglement, moving from conventional non-LLMs approaches toward LLMs.

8. Future Work

There are three key avenues for future work. First, a more extensive analysis is necessary. Our ablation study was confined to a limited selection of LLMs. Future work should evaluate a wider range of open-source LLMs. Additionally, further investigation into context window sizes is necessary to identify optimal configurations for diferent models. Second, this study relied on simple, zero-shot prompts with a fixed locality hint. Consequently, comprehensive prompt optimization is required. Future research should investigate advanced strategies such as few-shot and chain-of-thought prompting. Furthermore, the impact of the locality hint must be rigorously ablated. Developing other appropriate hints and refining instruction methods are also essential steps. Finally, the domain dependency must be investigated. This research only focused on IRC data. It is crucial to assess the generalizability of our approach to other conversational contexts. Previous work has shown that supervised models trained on IRC data do not transfer well to Slack chat data without re2t6r].aining [

9. Limitations

Our study has three primary limitations. First, our method is constrained by a context window. It cannot identify reply-to relations beyond the window’s boundaries, which inherently limits its ability to capture long-range relations. Although expanding the window size could mitigate this issue, larger window sizes increase prompt lengths, risking performance degradation and higher computational costs. Second, the use of proprietary LLMs introduces significant financial and temporal costs. The iterative, per-utterance assignment triggers many API calls, and large prompts increase token usage and fees, making the approach slow and costly. Third, deploying open-source models presents challenges regarding computational resources and processing time. Our experiments required a high-end GPU, such as an NVIDIA RTX 6000 Ada, and consumed over 23 GB of VRAM. Furthermore, processing 2,500 utterances took approximately 24 h. These hardware and time requirements may limit the scalability of our approach for larger datasets and may prove prohibitive for researchers without access to substantial computational resources.

Acknowledgments

This paper is based on results obtained from a project, JPNP24003, commissioned by the New Energy and Industrial Technology Development Organization (NEDO). This research is also supported in part by JSPS KAKENHI Grant Numbers JP24K15084 and JP23H00491.

Declaration on Generative AI

The authors used Gemini2.5-pro and GPT5 for assistance with linguistic and technical formatting tasks. An initial draft was composed in our native language. We then utilized the specified generative AI to translate this draft into English. The tool was also used for grammar correction, style refinement, and assistance with formatting tAThEeXLcode. Following this process, the authors conducted a thorough review and edited the entire manuscript. We take full responsibility for this paper. [8] P. Ren, Z. Chen, Z. Ren, E. Kanoulas, C. Monz, M. De Rijke, Conversations with search engines: Serp-based conversational response generation, ACM Transactions on Information Systems (TOIS) 39 (2021) 1–29. [9] T. Li, J.-C. Gu, X. Zhu, Q. Liu, Z.-H. Ling, Z. Su, S. Wei, Dialbert: A hierarchical pre-trained model for conversation disentanglement, arXiv preprint arXiv:2004.03760 (2020). [10] X. Ma, Z. Zhang, H. Zhao, Structural characterization for dialogue disentanglement, arXiv preprint arXiv:2110.08018 (2021). [11] B. Li, H. Fei, F. Li, S. Wu, L. Liao, Y. Wei, T.-S. Chua, D. Ji, Revisiting conversation discourse for dialogue disentanglement, ACM Transactions on Information Systems 43 (2025) 1–34. [12] D. R. Traum, S. Robinson, J. Stephan, Evaluation of multi-party virtual reality dialogue interaction., in: LREC, volume 4, 2004, pp. 1699–1702. [13] D. Shen, Q. Yang, J.-T. Sun, Z. Chen, Thread detection in dynamic text message streams, in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp. 35–42. [14] J. K. Kummerfeld, S. R. Gouravajhala, J. Peper, V. Athreya, C. Gunasekara, J. Ganhotra, S. S. Patel, L. Polymenakos, W. S. Lasecki, A large-scale corpus for conversation disentanglement, arXiv preprint arXiv:1810.11118 (2018). [15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. [16] H. Zhu, F. Nan, Z. Wang, R. Nallapati, B. Xiang, Who did they respond to? conversation structure modeling using masked hierarchical transformer, in: Proceedings of the AAAI conference on artificial intelligence, volume 34, 2020, pp. 9741–9748. [17] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scaling language modeling with pathways, Journal of Machine Learning Research 24 (2023) 1–113. [18] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert‑Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few‑shot learners, in: Advances in Neural Information Processing Systems, volume 33, Vancouver, Canada, 2020, pp. 1877–1901. ArXiv:2005.14165. [19] M. Meilă, Comparing clusterings by the variation of information, in: Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, Springer, 2003, pp. 173–187. [20] L. Hubert, P. Arabie, Comparing partitions, Journal of classification 2 (1985) 193–218. [21] A. F. McDaid, D. Greene, N. Hurley, Normalized mutual information to evaluate overlapping community finding algorithms, arXiv preprint arXiv:1110.2515 (2011). [22] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,

Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [23] T. Yu, S. Joty, Online conversation disentanglement with pointer networks, arXiv preprint arXiv:2010.11080 (2020). [24] R. Zhu, J. H. Lau, J. Qi, Findings on conversation disentanglement, arXiv preprint arXiv:2112.05346 (2021). [25] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in neural information processing systems 33 (2020) 9459–9474. [26] P. Chatterjee, K. Damevski, N. A. Kraft, L. Pollock, Software-related slack chats with disentangled conversations, in: Proceedings of the 17th international conference on mining software repositories, 2020, pp. 588–592.

A. Full Results of the Ablation Study

The highest value for each model is underlined, and the highest value across all results is shown in bold.

B. Method Details: Algorithms and Prompts B.1. Baseline

Algorithm 2 Baseline (Utterance-Level-Assignment, No Subsequent Context) 1: Input: = ( 1, 2, … , ), where = ( , , ) with being the timestam p, the speaker ID, and 2: Output: A partition of into dialogue s= { 1, 2, … , } 3: function GreedyDisentangle(, : instruction to return the most suitable existing utt eorran“ncew”

: instruction to specify the output format (e.g., JSON) ∗ ) ▷ Assignment on persistent failure You are given a multi-user chat with each line labeled with an index number, timestamp, speaker's name, and text message for an utterance. Your task is to identify to which previous utterance the target utterance is responding. Assign the target utterance to exactly one existing utterance from candidate utterance IDs, or determine it starts a new dialogue. Note that the utterance is more likely to be responding to a chat contains system messages that are fundamentally different from user-generated messages. The following rule should be followed when handling these messages. "utterance_id": 1, "timestamp": "2023-01-15T10:00:00.000000", "speaker": "system_message", "message": "=== jack33 [jack33@ca-29palms-cmts2d-189.losaca.adelphia.net] has entered #channel " } ## How to assign Most system messages have no response relationship to other utterances. Generally, you should set "is_new_dialogue" to true. # Chat data ] }, { "utterance_id": 138, "timestamp": "2016-12-19T21:21:00.000000", "speaker": "worktoner", "message": "Did the 'top' program get replaced from ubuntu 10 to 14?" "utterance_id": 199, "timestamp": "2016-12-19T21:34:00.000000", "speaker": "nacc", "message": "figure002: can you pastebin the command and output you are using/get?"

} "utterance_id": 200, "timestamp": "2016-12-19T21:35:00.000000", "speaker": "froglok", "message": "I installed apache2.. what other packages could I be missing?" "is_new_dialogue": false, "utterance_id": "17", "reason": "..." 21: function FormatPrompt( B.2. DLA Algorithm 3 DLA (Dialogue-Level Assignment) 1: Input: = ( 1, 2, … , ), where = ( , , ) with being the timestam p, the speaker ID, and the message content; window s ize 2: Output: A partition of into dialogue s= { 1, 2, … , } 3: function GreedyDisentangle(,

}

Unique list of speakers participating in ▷

Timestamp of last utterance in ▷ Time gap from target to last utteran c e in : instruction to return the most suitable existing di aloorg“uneew”

: instruction to specify the output format (e.g., JSON) You are given a multi-user chat with each line labeled with an index number, timestamp, speaker's name, and text message for an utterance. Your task is to identify to which previous dialogue the target utterance is responding. Assign the target utterance to exactly one existing dialogue from candidate dialogue IDs, or determine it starts a new dialogue. Note that the utterance is more likely to be responding to a nearby one. chat contains system messages that are fundamentally different from user-generated messages. The following rule should be followed when handling these messages. "utterance_id": 1, "timestamp": "2023-01-15T10:00:00.000000", "speaker": "system_message", "message": "=== jack33 [jack33@ca-29palms-cmts2d-189.losaca.adelphia.net] has entered #channel " } ## How to assign Most system messages have no response relationship to other utterances. Generally, you should set "is_new_dialogue" to true. # Chat data }, { } "utterance_id": 190, "timestamp": "2016-12-19T21:34:00.000000", "speaker": "Elementalist", "message": "pavlos here?" ## Target utterance { "utterance_id": 200, "timestamp": "2016-12-19T21:35:00.000000", "speaker": "froglok", "message": "I installed apache2.. what other packages could I be missing?" # Confirm You assign this target utterance. { "utterance_id": 200, "timestamp": "2016-12-19T21:35:00.000000", "speaker": "froglok", "message": "I installed apache2.. what other packages could I be missing?" # Output (JSON ONLY) Constraints: - "is_new_dialogue": boolean(true if the target is determined to be the start of a new dialogue, false if it is a continuation of an existing dialogue). - "dialogue_id": If is_new_dialogue is true, set to null. If is_new_dialogue is false, set the ID of the selected candidate dialogue. - "reason": Detailed explanation based as to why that decision (start of a new dialogue/continuation of an existing dialogue) was made.

Output example: { 25: function FormatPrompt( 1: Input: = ( 1, 2, … , ), where = ( , , ) with being the timestam p, the speaker ID, and 2: Output: A partition of into dialogue s= { 1, 2, … , } 3: function GreedyDisentangle(, : instruction to return the most suitable existing utt eorran“ncew”

Example: ## How to assign Most system messages have no response relationship to other utterances. Generally, you should set "is_new_dialogue" to true. "utterance_id": 199, "timestamp": "2016-12-19T21:34:00.000000", "speaker": "nacc", "message": "figure002: can you pastebin the command and output you are using/get?"

} "utterance_id": 200, "timestamp": "2016-12-19T21:35:00.000000", "speaker": "froglok", "message": "I installed apache2.. what other packages could I be missing?" }, "utterance_id": 200, "timestamp": "2016-12-19T21:35:00.000000", "speaker": "froglok", "message": "I installed apache2.. what other packages could I be missing?" # Output (JSON ONLY) Constraints: - "is_new_dialogue": boolean(true if the target is determined to be the start of a new dialogue, false if it is a continuation of an existing utterance). - "utterance_id": If is_new_dialogue is true, set to null. If is_new_dialogue is false, set the ID of the selected candidate utterance. - "reason": Detailed explanation based as to why that decision (start of a new dialogue/continuation of an existing utterance) was made.

Output example: { B.4. DLA+SC The algorithm is shown in Algorit1h,mso we only show the prompt.

# Instruction You are given a multi-user chat with each line labeled with an index number, timestamp, speaker's name, and text message for an utterance. Your task is to identify to which previous dialogue the target utterance is responding. Assign the target utterance to exactly one existing dialogue from candidate dialogue IDs, or determine it starts a new dialogue. Note that the utterance is more likely to be responding to a nearby one. Subsequent utterances are provided only as reference information - never select the id from Subsequent utterances. # Rule chat contains system messages that are fundamentally different from user-generated messages. The following rule should be followed when handling these messages. - "speaker" is "system_message". - "message" starts with "===".

Example: { ## How to assign Most system messages have no response relationship to other utterances. Generally, you should set "is_new_dialogue" to true. # Chat data "utterance_id": 172, "timestamp": "2016-12-19T21:27:00.000000", "speaker": "worktoner", "message": "Ahh I see they've changed around the commands." "utterance_id": 199, "timestamp": "2016-12-19T21:34:00.000000", "speaker": "nacc", "message": "figure002: can you pastebin the command and output you are using/get?"

} "utterance_id": 200, "timestamp": "2016-12-19T21:35:00.000000", "speaker": "froglok", "message": "I installed apache2.. what other packages could I be missing?" ## Target utterance { ## Subsequent utterances [

[1]

Elsner , E. Charniak, You talking to me? a corpus and algorithm for conversation disentanglement, in: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL ), Association for Computational Linguistics , 2008 , pp. 834 - 842 .

[2]

Elsner , E. Charniak, Disentangling chat , in: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , 2010 , pp. 117 - 126 .

[3]

Ouyang ,

Chen ,

Dai ,

Zhao ,

Huang ,

Chen , Dialogue state tracking with explicit slot connection modeling, in: Proceedings of the 58th annual meeting of the association for computational linguistics , 2020 , pp. 34 - 40 .

[4]

Zhang , L. Liao,

Huang ,

Zhu , T.-S. Chua, Neural multimodal belief tracker with adaptive attention for dialogue systems , in: The world wide web conference , 2019 , pp. 2401 - 2412 .

[5]

Cai ,

Fu ,

Zhao ,

Jiang ,

Pu , Memory graph with message rehearsal for multi‑turn dialogue generation , in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM) , Association for Computing Machinery , Atlanta, GA , USA, 2022 , pp. 108 - 117 . doi: 10 .1145/3511808.3557392, cC

‑SA 4 .0.

[6]

Gu ,

Tan ,

Tao ,

Ling ,

Hu ,

Geng , D. Jiang, HeterMPC: A heterogeneous graph neural network for response generation in multi‑party conversations, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Association for Computational Linguistics , Dublin, Ireland, 2022 , pp. 5086 - 509170 ..d1o8i6: 53 / v1/ 2022 . acl‑long . 349 .

[7]

Liang ,

Meng ,

Zhang ,

Chen ,

Xu ,

Zhou , Infusing multi-source knowledge with heterogeneous graph neural network for emotional conversation generation , in: Proceedings of the AAAI conference on artificial intelligence , volume 35 , 2021 , pp. 13343 - 13352 .