<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Rethinking Dialogue Disentanglement for LLMs via Dialogue-Level Assignment and Subsequent Context</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>NaokiTakada</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>TatsunorMiori</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dialogue Disentanglement</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Multi-Party Chat</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Large Language Models (LLMs)</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Yokohama National University</institution>
          ,
          <addr-line>79-7 Tokiwadai, Hodogaya-ku, Yokohama, Kanagawa, 240-8501</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>In online chat platforms, multiple dialogues are often entangled in a complex manner. Disentangling them into coherent dialogues is essential for understanding conversations. Supervised models have long dominated this task. The application of large language models (LLMs) to this problem remains underexplored; however, preliminary studies have shown that their performance is substantially inferior to that of conventional non-LLMs methods. This raises fundamental questions: Do LLMs inherently lack the capability for dialogue disentanglement, and what approaches are necessary to enhance their performance? We answer these questions by introducing two novel methods: dialogue-level assignment (DLA), which tasks LLMs with assigning an utterance to a dialogue, and subsequent context (SC), which provides subsequent context as auxiliary evidence. Experiments on the benchmark IRC dataset show that our method, equipped with DLA+SC, achieves new state-of-the-art results across all evaluation metrics. These results prove that LLMs possess a strong capability for the studied task. Furthermore, an ablation study was conducted on open-source models to investigate the efectiveness of DLA and SC, revealing that it is model-dependent. Nevertheless, it also demonstrates that both methods are key factors in improving performance. The findings of this study mark a paradigm shift for dialogue disentanglement, transitioning from conventional non-LLMs approaches to applying LLMs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In multi-party chat platforms such as Slack and Internet Relay Chat (IRC), multiple concurrent
The 10th Linguistic and Cognitive Approaches to Dialog Agents Workshop at the 40th AAAI conference, January 26, 2026, Singapore
https://haniwara.github.i(oN/. Takada);https://forest.ynu.ac.jp/mor(Ti. Mori)</p>
      <p>CEUR
Workshop</p>
      <p>
        ISSN1613-0073
dialogues frequently become intertwined. This entanglement leads to a chaotic conversational flow.
For human participants, following individual dialogues becomes challenging, imposing a substantial
cognitive burden. For systems, these entangled dialogues introduce noise into the conversational
context and complicate subsequent processing. To address this challenge, the concept of dialogue
disentanglement has been propose1d, 2[]. Dialogue disentanglement aims to partition entangled
utterances into coherent clusters connected through reply-to relation1)s. (TFihgiusraepproach
supports a wide range of downstream applications, including dialogue state t3r,a4]ckainndg r[esponse
generation5[
        <xref ref-type="bibr" rid="ref6 ref7">, 6, 7, 8</xref>
        ]. This task has predominantly been addressed using pairwise relation classification
based on supervised machine learnin2g, 9[, 10]. In this approach, a model scores every candidate
utterance pair within a conversation and makes a local binary decision: is one utterance a direct reply
to the other? A global dialogue structure is then assembled from these local links. This approach
operationalizes disentanglement by atomizing decisions for machine learning.
      </p>
      <p>Meanwhile, large language models (LLMs) ofer strong contextual reasoning and adaptation,
suggesting their potential to revolutionize dialogue disentanglement. However, initial attempts to apply
LLMs to dialogue disentanglement resulted in substantially worse performance compared to that of
conventional non-LLMs method1s1][. This performance gap motivates a re-examination of the task’s
formulation for LLMs. We propose two methods that recast dialogue disentanglement as a problem
suited to LLMs. First, we propose the dialogue-level assignment (DLA) method. Rather than identifying
one parent utterance, thereby replicating traditional pairwise relation classification, we apply a method
identifying dialogues with reply-to relations. The LLM is provided with dialogue clusters that have
been identified from previous assignments. The LLM must then assign the target utterance to one of
these existing dialogues or classify it as the start of a new one. Second, we introduce the subsequent
context (SC) method, in which the model receives subsequent utterances as auxiliary evidence to inform
the assignment.</p>
      <p>Our contributions are threefold. First, we focus explicitly on applying LLMs to dialogue
disentanglement and develop methods that improve their disentanglement capability. Second, we show that using
our method with LLMs achieves state-of-the-art (SOTA) performance on the IRC benchmark. Third,
we conducted an ablation study regarding the efectiveness of DLA and SC. The analysis shows the
optimal method is model-dependent, but establishes that DLA and SC are key components for improved
accuracy.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Dialogue Disentanglement</title>
        <p>Dialogue disentanglement, also known as conversation disentang1l,e2m]e,ncotn[versation
management [12], or thread detectio1n3][, is a crucial research topic for understanding entangled dialogues.
Early approaches framed this task as a two-stage problem. First, a classifier assesses every potential
utterance pair to determine whether one is a direct reply to the other. Second, based on this pairwise
relation classification, a clustering algorithm assembles these pairs into a d1i,a2l]o. gTuhees[e initial
models relied heavily on handcrafted features, such as lexical overlap, time gaps, and explicit user
mentions. The availability of large-scale annotated corpora, notably the Ubuntu IR1C4]d, afatcaislei-t [
tated the development of data-driven, end-to-end neural models. A separate, subsequent advancement
came with the adoption of fine-tuned pre-trained language models (PLMs), such as B1E5R],Tw[hich
substantially improved performance and set a new standard for thes1e6t,9a]s.ks [</p>
        <p>However, these powerful encoders typically treated conversations as simple, unstructured utterance
sequences and thus failed to leverage the structure of the discourse. This prompted a subsequent line
of research that focused on explicitly integrating discourse structure to better capture conversational
dynamics. Models started to incorporate static, dialogue-specific features, such as speaker identity and
user-mention dependencies10[]. Several key refinements have been introduced in the current SOTA
model [11], enriching the model with dynamic discourse information, such as time gap and continuously
updated reply chains. Hierarchical learning loss has also been integrated with an easy-first decoding
algorithm to capture global conversational properties.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. LLMs for Dialogue Disentanglement</title>
        <p>The rise of LLMs has shifted natural language processing paradigms, as these models have shown
strong proficiency across many application1s7[]. Their power stems from their ability to adapt to
new problems through prompts. Foundational paradigms such as in-context learning and few-shot
prompting, in which the model learns from a handful of examples provided directly in context, enable
LLMs to perform tasks without task-specific training or fine-tuni1n8g]. [</p>
        <p>Despite these powerful capabilities, the application of LLMs to dialogue disentanglement remains
largely underexplored. To the best of our knowledge, the only notable investigation in this area is a
preliminary experiment by Li et a[1l1.]. They framed the task as a zero-shot classification problem,
asking an LLM to identify a parent utterance without examples. The task is performed through an
approach similar to the pairwise relation classification method using conventional dialogue
disentanglement. In this method, LLMs are asked to perform utterance-level assignments. This approach
performed significantly worse than conventional non-LLMs methods. Consequently, the potential of
LLMs to succeed in dialogue disentanglement has not been fully assessed to date. In this study, we
reframed the task by introducing two new formulations: DLA and SC. To our knowledge, this is the first
study to place LLMs at the center of dialogue disentanglement—a task long dominated by conventional
non-LLMs approaches.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Task Definition</title>
        <p>To formally define the dialogue disentanglement task addressed in this study, we first need to clarify
the terminology. We refer to the entire observed sequence of utterances, where multiple dialogues are
entangled, as conversatio.nSet represents the partitioned dialogues fr, owmhere each  in  is a
separated semantic unit comprising a chain of response-related utterances sharing a specific topic or
purpose. Accordingly, the task is formalized as follows. The input is a conver=sa(tio1n,  2, … ,   )
comprising utterances in chronological order. Each utt e ra∈n ceis a tupl e  = ( ,   ,   ), where

  is the timestamp , is the speaker ID, and  is the message content. The goal is to partitiinotno
a set of mutually disjoint dialog u=es{</p>
        <p>1,  2, … ,   }. This partitioning must be a strict partition
of  that satisfies the following two conditions: exhaustive n=ess ⋃(
  ∈   ) and exclusiveness
(∀ ≠ ,</p>
        <p>∩   = ∅). Therefore, dialogue disentanglement is the problem of estimating the opt imal set
that satisfies these conditions for a give.n</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dialogue Disentanglement Methods for LLMs</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Core Framework</title>
          <p>For the ablation study, we instantiated four methods. First, we describe the common core framework
shared by all four methods. The core framework is an iterative greedy algorithm. The LLM processes
each utterance chronologically, fr2omto  . The local context for each target utteranceis
constructed based on two fixed hyperparameters: the previous and subsequent window sizesa,nd
, respectively. These values remain constant across all utterances within a single experimental run,
rather than being dynamically adjusted. In this study, we selected these window sizes experimentally to
evaluate the sensitivity of the model to context length. The LLM’s task is to return a single assignment,
indicating wheth e r</p>
          <p>belongs to an existing dialogue or initiates a new one. The system updates its
state, the set of dialogue, sbased on the LLM’s decision. This process repeats until all utterances have
been assigned, yielding the final partition. The prompts are simple, zero-shot prompts. They ask the
model to determine whether the target utterance should connect to exactly one candidate or start a
new dialogue, guided by a locality prior that suggests that replies are often temporally close. We did
not conduct an ablation study regarding this locality hint. We utilized the hint to align with the prompt
formulation of Li et a[1l1.]. This ensures experimental consistency with their preliminary study. The
LLM is constrained to output a JSON object containing three fields: is_new_dialogue (boolean), an
assignment ID, and reason (a textual explanation). The DLA and DLA+SC methods task the LLMs
with dialogue-level assignment, thus requiring a “dialogue_id” that links the utterance to an entire
dialogue cluster. The Baseline and SC methods perform utterance-level assignment, which requires an
“utterance_id” to specify a direct parent utterance. To handle malformed outputs, we implemented an
error-handling and retry mechanism. If the generated JSON is invalid or its fields fail type or range
validation, the query is re-sent with a corrective instruction appended to the prompt. We set a maximum
of five retries for each assignment. Details of the prompts used in this study are listed in ABp.pendix</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Framework Variations</title>
          <p>Next, we define the four framework variations used in our experiments, which are distinguished
by how they provide the LLMs with utterance sequences. The Baseline method provides LLMs with
the previous context as a simple chronological sequence of utterances. The LLM’s task is to perform
utterance-level assignment by identifying which specific previous utterance the target is replying to.
This formulation closely mirrors the approach in Li [e1t1]a.lH.owever, our implementation modifies it
by including not only the utterances within the immediate context window but also all other utterances
previously assigned to the same dialogues represented in that window. This change ensures consistent
information availability across methods to ensure a fair ablation. Further improvements are then applied
based on this Baseline method. The DLA method organizes the previous context into coherent dialogue
clusters. This structure is built dynamically, reflecting the assignment decisions made at each prior
step. The LLM’s task is thus transformed into assigning the target utterance to one of these previously
established dialogues or initiating a new one. The SC method uses the same utterance-level assignment,
but also provides subsequent utterances as auxiliary evidence. The DLA+SC method combines DLA
and SC. A schematic workflow of the proposed method is shown in Figu2r.eThe complete DLA+SC
definition is provided in Algorith1m. Other framework variations are described in AppeBn.dix</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>4.2. Evaluation Metrics</title>
        <p>For direct comparison, we adopted the evaluation metrics from the current SOTA method, Li et al.
[11]. These metrics evaluate dialogue disentanglement from three perspectives. First, we used the
Variation of Information (V1I9)][, Adjusted Rand Index (ARI)2[0], and Normalized Mutual Information
(NMI) [21] to measure the overall similarity between the gold standard data and the predicted utterance
clusters. Second, we used Loc3a[l2] to indicate the prediction accuracy when focusing on every three
utterances as a local precision metric. Third, we used One-to-One2](,1S-1h)e[n-F1 (S-F) [13], and exact
matching Precision, Recall, and F1 score (P, R, and 1F41]) [to show how many clusters match between
the gold standard data and the predictions. Among these, the most stringent evaluation metrics are P,
R, and F1, which measure the percentage of predicted dialogues that perfectly match the gold standard
data.</p>
      </sec>
      <sec id="sec-3-4">
        <title>4.3. Comparison Methods</title>
        <p>
          We evaluated the proposed method against methods representing major paradigms in dialogue
disentanglement. These comparators include a seminal statistical model reliant on handcrafted features
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a neural model that treats the task as a sequence labeling p1r4o]b, laenmd[a transition-based
model that formulates disentanglement as an online state-transitio2n2p].rWoceesasls[o benchmarked
against a generative model that predicts reply-to2l3i]n,mkset[hods that leverage PLMs for contextual
understanding9][, and those incorporating discourse struct1u0]r.eO[ur primary baseline was the
current SOTA method, which integrates additional discourse structure info1r1m]a.tion [
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>4.4. Implementation Details</title>
        <p>We evaluated the methods using both proprietary and open-source LLMs. For proprietary models,
we utilized GPT4.1 and GPT4.1-mini via the Azure OpenAI Service (specifically version
2025-01-01preview). We also employed Google’s Gemini2.5-pro and Gemini2.5-flash from the stable June 2025
release. For all proprietary model experiments, we set the temperature to 0.0 to minimize variance.
However, strict determinism is not guaranteed; therefore, we report results from a single trial. It is
worth noting that rigorous reproducibility protocols for API-based models remain unestablished. The
output token limit was set to 8,192. The previous context window was fixed to 70 utterances. This size
was informed by a preliminary experiment, in which we observed reply dependencies spanning up to
66 utterances. The subsequent window was set to 50 utterances. For open source models, we used the
Ollama framework as well as qwen3-32b (Ollama tag qwen3:32b; library ID 030ee887880f), gemma3-27b
(Ollama tag gemma3:27b; library ID a418f5838eaf), gpt-oss-20b (Ollama tag gpt-oss:20b; library ID
17052f91a42e), and gpt-oss-120b (Ollama tag gpt-oss:120b; library ID a951a23b46a1). Additionally, we
ifxed the previous context window to 40 utterances. This reduction was necessary because larger
contexts frequently resulted in malformed JSON outputs. We then systematically evaluated subsequent
windows of 5, 10, 15, and 20 utterances. All open-source models were configured with a maximum
input context (num_ctx) of 16384 and a maximum output token limit of 8192. Similar to the open-source
models, the temperature was set to 0.0. For the gpt-oss-120b and gpt-oss-20b models specifically, we set
the reasoning efort parameter to “high” in the system prompt.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <sec id="sec-4-1">
        <title>5.1. Preliminary Experiments</title>
        <p>We compared our DLA+SC method against the preliminary experiments using DiHRL [11]. Although
our Baseline method is conceptually related, it is not identical to the DiHRL method, as it incorporates
modifications discussed in Section3.2.2. We performed this evaluation on a 500-utterance subset of the
IRC development set (IDs: 2005-06-27_12, 2005-08-08_01). Table 1 presents the results, with the superior
method for each model underlined and the overall best value shown in bold. As these results confirm
the efectiveness of DLA+SC across all models, we selected this approach for the main experiments on
the test set.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Comparison with SOTA Methods</title>
      </sec>
      <sec id="sec-4-3">
        <title>5.3. Ablation Study</title>
        <p>We ablated DLA and SC on four open-source LLMs, using exact-ma1tcahs the primary metric. The
results are summarized in Figu3r,ewhere indicates the number of utterances following the target
utterance provided as subsequent context. These results show that the optimal formulation is
modeldependent; each model required a diferent method configuration to achieve its peak performance.
Among all setupsq,wen3-32b achieved the highest absolu 1te, indicating that DLA and SC are efective
components. Notably, it1s score is comparable to that of the existing SOTA methods shown in2T,able
despite dataset diferences. Full results for all experimental setups are provided in AAp.pendix</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Analysis and Discussion</title>
      <p>The efectiveness of subsequent context is model-dependent, as demonstrated by gemma3-27b. This
model’s accuracy consistently deteriorated as its subsequent context window was enlarged. We
hypothesize that the model misinterpreted these future utterances as valid parent candidates, which
complicated the assignment task. This hypothesis is supported by our ablation study results. We
initially configured the open-source models to use the same context window size as the proprietary
models, but were compelled to reduce this value owing to a high rate of hallucinations. With longer
input contexts, the model made errors; it incorrectly assigned the target utterance to a subsequent one.
For certain LLMs, subsequent context may therefore act as a distractor rather than as auxiliary evidence,
impairing performance. By contrast, the three models other than gemma3-27b demonstrated improved
accuracy with the inclusion of subsequent context. This suggests that providing a limited window of
subsequent utterances can be efective for dialogue disentanglement. However, its efectiveness is not
universal and depends on the model’s ability to diferentiate evidence from assignment candidates.</p>
      <p>We conducted a quantitative failure analysis on the results from all open-source models to identify
conditions leading to assignment errors. We evaluated the average success rate by classifying each
assignment decision as a binary success or failure based on dialogue cluster overlap. For a target
utterance , an assignment is considered successful only if the assigned dialogue cluster shares common
utterances with the ground truth cluster within the previous c,ownhtexrte(&lt;  ). We investigated the
correlation between the average success rate and three potential factors: the number of utterances in the
prompt, the count of assignment candidates, and the reply-to relation positional distance. Specifically,
we analyzed the trends in success rates relative to the numerical variations of these factors. This
examination aimed to determine whether specific quantitative increases negatively impact performance.
Our analysis revealed no significant correlation between these factors and assignment accuracy; for
simplicity, detailed results are omitted here. However, this analysis is limited to the window sizes used in
this study and should not be interpreted as a general characteristic for LLM’s dialogue disentanglement.
In future work, large-scale experiments with varying window sizes, including larger contexts, are
required to rigorously verify LLMs’ dialogue disentanglement ability.</p>
      <p>Additionally, we conducted a qualitative error analysis specifically on the results from our
bestperforming open-source configuration: qwen3-32b with DLA+SC (sub = 05). This analysis identified
two primary factors contributing to errors. The first major failure mode involved the misclassification
of short, non-substantive utterances such as reactions (“WTF!”), backchannels (“it is”), or greetings
(“hi”). Correctly assigning these utterances requires identifying them as reactions within a dialogue the
speaker is already participating in. An examination of the model’s generated reasoning suggested that
it often failed to properly recognize speaker identity. The model appeared to prioritize the semantic
content of an utterance over speaker consistency, often defaulting to linking it with a temporally
proximate utterance regardless of the speaker. This behavior may be exacerbated by the locality hint
provided in our prompt, which instructs the model that temporally closer utterances are more likely
to belong to the same dialogue. Misclassified short reactions were often temporally distant from the
utterance they responded to. The model thus tended to make locally plausible but globally incorrect
assignments. This finding suggests that the inclusion of a locality hint may hinder performance. The
second type of common error was due to domain-specific terms, such as “hoary,” the codename for
Ubuntu 5.04. The model’s generated reasoning showed that it understood that the term is related to
Ubuntu but failed to correctly link it to the ongoing conversation. It frequently misinterpreted the
jargon as the start of a new topic, thereby incorrectly fragmenting a single, coherent dialogue. For
technical chats, retrieval-augmented generation could provide domain-specific knowledge and improve
accuracy [25].</p>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion</title>
      <p>In this work, we demonstrated that LLMs can achieve SOTA performance in dialogue disentanglement.
Applying our proposed methods to proprietary LLMs yielded accuracy surpassing that of conventional
non-LLMs methods. We introduced two novel formulations to facilitate this achievement: DLA and SC.
However, our ablation study with open-source models revealed that the efectiveness of DLA and SC is
model-dependent. Although the optimal method varies by model, DLA and SC are proven to be valuable
components for improving LLM-based dialogue disentanglement. The findings of this study provide a
paradigm shift for dialogue disentanglement, moving from conventional non-LLMs approaches toward
LLMs.</p>
    </sec>
    <sec id="sec-7">
      <title>8. Future Work</title>
      <p>There are three key avenues for future work. First, a more extensive analysis is necessary. Our ablation
study was confined to a limited selection of LLMs. Future work should evaluate a wider range of
open-source LLMs. Additionally, further investigation into context window sizes is necessary to identify
optimal configurations for diferent models. Second, this study relied on simple, zero-shot prompts
with a fixed locality hint. Consequently, comprehensive prompt optimization is required. Future
research should investigate advanced strategies such as few-shot and chain-of-thought prompting.
Furthermore, the impact of the locality hint must be rigorously ablated. Developing other appropriate
hints and refining instruction methods are also essential steps. Finally, the domain dependency must be
investigated. This research only focused on IRC data. It is crucial to assess the generalizability of our
approach to other conversational contexts. Previous work has shown that supervised models trained
on IRC data do not transfer well to Slack chat data without re2t6r].aining [</p>
    </sec>
    <sec id="sec-8">
      <title>9. Limitations</title>
      <p>Our study has three primary limitations. First, our method is constrained by a context window. It
cannot identify reply-to relations beyond the window’s boundaries, which inherently limits its ability
to capture long-range relations. Although expanding the window size could mitigate this issue, larger
window sizes increase prompt lengths, risking performance degradation and higher computational
costs. Second, the use of proprietary LLMs introduces significant financial and temporal costs. The
iterative, per-utterance assignment triggers many API calls, and large prompts increase token usage and
fees, making the approach slow and costly. Third, deploying open-source models presents challenges
regarding computational resources and processing time. Our experiments required a high-end GPU,
such as an NVIDIA RTX 6000 Ada, and consumed over 23 GB of VRAM. Furthermore, processing 2,500
utterances took approximately 24 h. These hardware and time requirements may limit the scalability of
our approach for larger datasets and may prove prohibitive for researchers without access to substantial
computational resources.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This paper is based on results obtained from a project, JPNP24003, commissioned by the New Energy
and Industrial Technology Development Organization (NEDO). This research is also supported in part
by JSPS KAKENHI Grant Numbers JP24K15084 and JP23H00491.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>The authors used Gemini2.5-pro and GPT5 for assistance with linguistic and technical formatting tasks.
An initial draft was composed in our native language. We then utilized the specified generative AI to
translate this draft into English. The tool was also used for grammar correction, style refinement, and
assistance with formatting tAThEeXLcode. Following this process, the authors conducted a thorough
review and edited the entire manuscript. We take full responsibility for this paper.
[8] P. Ren, Z. Chen, Z. Ren, E. Kanoulas, C. Monz, M. De Rijke, Conversations with search engines:
Serp-based conversational response generation, ACM Transactions on Information Systems (TOIS)
39 (2021) 1–29.
[9] T. Li, J.-C. Gu, X. Zhu, Q. Liu, Z.-H. Ling, Z. Su, S. Wei, Dialbert: A hierarchical pre-trained model
for conversation disentanglement, arXiv preprint arXiv:2004.03760 (2020).
[10] X. Ma, Z. Zhang, H. Zhao, Structural characterization for dialogue disentanglement, arXiv preprint
arXiv:2110.08018 (2021).
[11] B. Li, H. Fei, F. Li, S. Wu, L. Liao, Y. Wei, T.-S. Chua, D. Ji, Revisiting conversation discourse for
dialogue disentanglement, ACM Transactions on Information Systems 43 (2025) 1–34.
[12] D. R. Traum, S. Robinson, J. Stephan, Evaluation of multi-party virtual reality dialogue interaction.,
in: LREC, volume 4, 2004, pp. 1699–1702.
[13] D. Shen, Q. Yang, J.-T. Sun, Z. Chen, Thread detection in dynamic text message streams, in:
Proceedings of the 29th annual international ACM SIGIR conference on Research and development
in information retrieval, 2006, pp. 35–42.
[14] J. K. Kummerfeld, S. R. Gouravajhala, J. Peper, V. Athreya, C. Gunasekara, J. Ganhotra, S. S. Patel,
L. Polymenakos, W. S. Lasecki, A large-scale corpus for conversation disentanglement, arXiv
preprint arXiv:1810.11118 (2018).
[15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186.
[16] H. Zhu, F. Nan, Z. Wang, R. Nallapati, B. Xiang, Who did they respond to? conversation structure
modeling using masked hierarchical transformer, in: Proceedings of the AAAI conference on
artificial intelligence, volume 34, 2020, pp. 9741–9748.
[17] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung,
C. Sutton, S. Gehrmann, et al., Palm: Scaling language modeling with pathways, Journal of
Machine Learning Research 24 (2023) 1–113.
[18] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, S. Agarwal, A. Herbert‑Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few‑shot
learners, in: Advances in Neural Information Processing Systems, volume 33, Vancouver, Canada,
2020, pp. 1877–1901. ArXiv:2005.14165.
[19] M. Meilă, Comparing clusterings by the variation of information, in: Learning Theory and Kernel
Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel
2003, Washington, DC, USA, August 24-27, 2003. Proceedings, Springer, 2003, pp. 173–187.
[20] L. Hubert, P. Arabie, Comparing partitions, Journal of classification 2 (1985) 193–218.
[21] A. F. McDaid, D. Greene, N. Hurley, Normalized mutual information to evaluate overlapping
community finding algorithms, arXiv preprint arXiv:1110.2515 (2011).
[22] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[23] T. Yu, S. Joty, Online conversation disentanglement with pointer networks, arXiv preprint
arXiv:2010.11080 (2020).
[24] R. Zhu, J. H. Lau, J. Qi, Findings on conversation disentanglement, arXiv preprint arXiv:2112.05346
(2021).
[25] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances
in neural information processing systems 33 (2020) 9459–9474.
[26] P. Chatterjee, K. Damevski, N. A. Kraft, L. Pollock, Software-related slack chats with disentangled
conversations, in: Proceedings of the 17th international conference on mining software repositories,
2020, pp. 588–592.</p>
    </sec>
    <sec id="sec-11">
      <title>A. Full Results of the Ablation Study</title>
      <p>The highest value for each model is underlined, and the highest value across all results is shown in bold.</p>
    </sec>
    <sec id="sec-12">
      <title>B. Method Details: Algorithms and Prompts</title>
      <sec id="sec-12-1">
        <title>B.1. Baseline</title>
        <p>Algorithm 2 Baseline (Utterance-Level-Assignment, No Subsequent Context)
1: Input:  = ( 1,  2, … ,   ), where  = ( ,   ,   ) with  being the timestam p, the speaker ID, and

2: Output: A partition of into dialogue s= { 1,  2, … ,   }
3: function GreedyDisentangle(, 
: instruction to return the most suitable existing utt eorran“ncew”</p>
        <p>: instruction to specify the output format (e.g., JSON)
∗
)
▷ Assignment on persistent failure
You are given a multi-user chat with each line labeled with an index number,
timestamp, speaker's name, and text message for an utterance. Your task is to identify
to which previous utterance the target utterance is responding. Assign the target
utterance to exactly one existing utterance from candidate utterance IDs, or determine
it starts a new dialogue. Note that the utterance is more likely to be responding to a
chat contains system messages that are fundamentally different from user-generated
messages. The following rule should be followed when handling these messages.
"utterance_id": 1,
"timestamp": "2023-01-15T10:00:00.000000",
"speaker": "system_message",
"message": "=== jack33 [jack33@ca-29palms-cmts2d-189.losaca.adelphia.net] has
entered #channel "
}
## How to assign
Most system messages have no response relationship to other utterances. Generally, you
should set "is_new_dialogue" to true.
# Chat data
]
},
{
"utterance_id": 138,
"timestamp": "2016-12-19T21:21:00.000000",
"speaker": "worktoner",
"message": "Did the 'top' program get replaced from ubuntu 10 to 14?"
"utterance_id": 199,
"timestamp": "2016-12-19T21:34:00.000000",
"speaker": "nacc",
"message": "figure002: can you pastebin the command and output you are
using/get?"</p>
        <p>}
"utterance_id": 200,
"timestamp": "2016-12-19T21:35:00.000000",
"speaker": "froglok",
"message": "I installed apache2.. what other packages could I be missing?"
"is_new_dialogue": false,
"utterance_id": "17",
"reason": "..."
 
21: function FormatPrompt( 
B.2. DLA
Algorithm 3 DLA (Dialogue-Level Assignment)
1: Input:  = ( 1,  2, … ,   ), where  = ( ,   ,   ) with  being the timestam p, the speaker ID, and

  the message content; window s ize

2: Output: A partition of into dialogue s= { 1,  2, … ,   }
3: function GreedyDisentangle(,</p>
        <p>}</p>
        <p>Unique list of speakers participating in
▷</p>
        <p>Timestamp of last utterance in
▷ Time gap from target to last utteran c e in
: instruction to return the most suitable existing di aloorg“uneew”</p>
        <p>: instruction to specify the output format (e.g., JSON)
You are given a multi-user chat with each line labeled with an index number,
timestamp, speaker's name, and text message for an utterance. Your task is to identify
to which previous dialogue the target utterance is responding. Assign the target
utterance to exactly one existing dialogue from candidate dialogue IDs, or determine
it starts a new dialogue. Note that the utterance is more likely to be responding to a
nearby one.
chat contains system messages that are fundamentally different from user-generated
messages. The following rule should be followed when handling these messages.
"utterance_id": 1,
"timestamp": "2023-01-15T10:00:00.000000",
"speaker": "system_message",
"message": "=== jack33 [jack33@ca-29palms-cmts2d-189.losaca.adelphia.net] has
entered #channel "
}
## How to assign
Most system messages have no response relationship to other utterances. Generally, you
should set "is_new_dialogue" to true.
# Chat data
},
{
}
"utterance_id": 190,
"timestamp": "2016-12-19T21:34:00.000000",
"speaker": "Elementalist",
"message": "pavlos here?"
## Target utterance
{
"utterance_id": 200,
"timestamp": "2016-12-19T21:35:00.000000",
"speaker": "froglok",
"message": "I installed apache2.. what other packages could I be missing?"
# Confirm
You assign this target utterance.
{
"utterance_id": 200,
"timestamp": "2016-12-19T21:35:00.000000",
"speaker": "froglok",
"message": "I installed apache2.. what other packages could I be missing?"
# Output (JSON ONLY)
Constraints:
- "is_new_dialogue": boolean(true if the target is determined to be the start of a new
dialogue, false if it is a continuation of an existing dialogue).
- "dialogue_id": If is_new_dialogue is true, set to null. If is_new_dialogue is false,
set the ID of the selected candidate dialogue.
- "reason": Detailed explanation based as to why that decision (start of a new
dialogue/continuation of an existing dialogue) was made.</p>
        <p>Output example:
{
25: function FormatPrompt( 
1: Input:  = ( 1,  2, … ,   ), where  = ( ,   ,   ) with  being the timestam p, the speaker ID, and

2: Output: A partition of into dialogue s= { 1,  2, … ,   }
3: function GreedyDisentangle(, 
: instruction to return the most suitable existing utt eorran“ncew”</p>
        <p>: instruction to specify the output format (e.g., JSON)
You are given a multi-user chat with each line labeled with an index number,
timestamp, speaker's name, and text message for an utterance. Your task is to identify
to which previous utterance the target utterance is responding. Assign the target
utterance to exactly one existing utterance from candidate utterance IDs, or determine
it starts a new dialogue. Note that the utterance is more likely to be responding to a
nearby one. Subsequent utterances are provided only as reference information - never
select the id from Subsequent utterances.
chat contains system messages that are fundamentally different from user-generated
messages. The following rule should be followed when handling these messages.</p>
        <p>Example:
## How to assign
Most system messages have no response relationship to other utterances. Generally, you
should set "is_new_dialogue" to true.
"utterance_id": 199,
"timestamp": "2016-12-19T21:34:00.000000",
"speaker": "nacc",
"message": "figure002: can you pastebin the command and output you are
using/get?"</p>
        <p>}
"utterance_id": 200,
"timestamp": "2016-12-19T21:35:00.000000",
"speaker": "froglok",
"message": "I installed apache2.. what other packages could I be missing?"
},
"utterance_id": 200,
"timestamp": "2016-12-19T21:35:00.000000",
"speaker": "froglok",
"message": "I installed apache2.. what other packages could I be missing?"
# Output (JSON ONLY)
Constraints:
- "is_new_dialogue": boolean(true if the target is determined to be the start of a new
dialogue, false if it is a continuation of an existing utterance).
- "utterance_id": If is_new_dialogue is true, set to null. If is_new_dialogue is
false, set the ID of the selected candidate utterance.
- "reason": Detailed explanation based as to why that decision (start of a new
dialogue/continuation of an existing utterance) was made.</p>
        <p>Output example:
{
B.4. DLA+SC
The algorithm is shown in Algorit1h,mso we only show the prompt.</p>
        <p># Instruction
You are given a multi-user chat with each line labeled with an index number,
timestamp, speaker's name, and text message for an utterance. Your task is to identify
to which previous dialogue the target utterance is responding. Assign the target
utterance to exactly one existing dialogue from candidate dialogue IDs, or determine
it starts a new dialogue. Note that the utterance is more likely to be responding to a
nearby one. Subsequent utterances are provided only as reference information - never
select the id from Subsequent utterances.
# Rule
chat contains system messages that are fundamentally different from user-generated
messages. The following rule should be followed when handling these messages.
- "speaker" is "system_message".
- "message" starts with "===".</p>
        <p>Example:
{
## How to assign
Most system messages have no response relationship to other utterances. Generally, you
should set "is_new_dialogue" to true.
# Chat data
"utterance_id": 172,
"timestamp": "2016-12-19T21:27:00.000000",
"speaker": "worktoner",
"message": "Ahh I see they've changed around the commands."
"utterance_id": 199,
"timestamp": "2016-12-19T21:34:00.000000",
"speaker": "nacc",
"message": "figure002: can you pastebin the command and output you are
using/get?"</p>
        <p>}
"utterance_id": 200,
"timestamp": "2016-12-19T21:35:00.000000",
"speaker": "froglok",
"message": "I installed apache2.. what other packages could I be missing?"
## Target utterance
{
## Subsequent utterances
[</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Elsner</surname>
          </string-name>
          , E. Charniak,
          <article-title>You talking to me? a corpus and algorithm for conversation disentanglement, in: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL</article-title>
          ),
          <source>Association for Computational Linguistics</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>834</fpage>
          -
          <lpage>842</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Elsner</surname>
          </string-name>
          , E. Charniak,
          <article-title>Disentangling chat</article-title>
          ,
          <source>in: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>117</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Dialogue state tracking with explicit slot connection modeling, in: Proceedings of the 58th annual meeting of the association for computational linguistics</article-title>
          ,
          <year>2020</year>
          , pp.
          <fpage>34</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Liao,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , T.-S. Chua,
          <article-title>Neural multimodal belief tracker with adaptive attention for dialogue systems</article-title>
          ,
          <source>in: The world wide web conference</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2401</fpage>
          -
          <lpage>2412</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <article-title>Memory graph with message rehearsal for multi‑turn dialogue generation</article-title>
          ,
          <source>in: Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management (CIKM)</source>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , Atlanta,
          <string-name>
            <surname>GA</surname>
          </string-name>
          , USA,
          <year>2022</year>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>117</lpage>
          . doi:
          <volume>10</volume>
          .1145/3511808.3557392,
          <string-name>
            <surname>cC</surname>
            <given-names>BY</given-names>
          </string-name>
          <source>‑SA 4</source>
          .0.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Geng</surname>
          </string-name>
          , D. Jiang,
          <article-title>HeterMPC: A heterogeneous graph neural network for response generation in multi‑party conversations, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>5086</fpage>
          -
          <lpage>509170</lpage>
          ..d1o8i6:
          <volume>53</volume>
          / v1/
          <year>2022</year>
          .
          <article-title>acl‑long</article-title>
          .
          <volume>349</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Infusing multi-source knowledge with heterogeneous graph neural network for emotional conversation generation</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>35</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>13343</fpage>
          -
          <lpage>13352</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>