<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Linguistic Speaker Profiles on Response Selection in Multi-Party Dialogue</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maryam Sajedinia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seyed Mahed Mousavi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Basile</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Modeling &amp; Simulation of Techno-Social Systems</institution>
          ,
          <addr-line>Fondazione Bruno Kessler</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Signals &amp; Interactive Systems Lab, University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Turin</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We investigate whether incorporating linguistically derived speaker profiles improves the response selection capabilities of instruction-tuned large language models (LLMs) in multi-party dialogues. Using the Wikipedia Talk Page dataset, we construct lightweight profiles for each speaker based on features extracted from their prior messages, including frequent nouns and verbs, and sentiment tendency. These profiles are incorporated into the input prompts and evaluated using in-context learning with LLaMA 3.2 Instruct (1B and 8B) and GPT-4o, without any model fine-tuning. We compare performance across models and prompt settings, with and without speaker profiles, and analyze the efect of diferent profile configurations. Results are compared against a Random baseline and a supervised Siamese RNNs (with GRU units) trained on the same data. Our results show that incorporating speaker profiles improves response selection performance across most LLM settings, with the strongest gains observed in larger models such as LLaMA 3.2 (8B). Lexical features (frequent nouns and verbs) demonstrate greater improvements than sentiment information, particularly in low-context or underspecified scenarios. However, profile efectiveness varies by model scale and prompt format, and provides limited benefit in cases where distractors are lexically and semantically similar to the ground-truth response.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Model</kwd>
        <kwd>Multiparty Dialogue</kwd>
        <kwd>User Profile</kwd>
        <kwd>Response Selection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>ink ow oko itrew rokw read try tan go eed say sak ifdn tlak teg tle add ievg epho ited aknh lfee trreev titseeah reevom lckob tragn ttrsa ccekh ecom
th kn l w n t
format, and profile composition? We evaluate this rating lightweight, linguistically derived speaker
approach using LLaMA 3.2 Instruct (1B and 8B) and GPT- profiles into LLM-based response selection for
4o, comparing performance with and without speaker multi-party dialogue1.
profiles under zero-shot and one-shot prompting. To • We conduct a systematic evaluation across model
contextualize results, we include two baselines: a random scales (1B, 8B, GPT-4o), prompt formats
(zeroranking strategy and a supervised Siamese RNN with shot, one-shot), and profile configurations
(lexiGRU units trained on the same dataset. All models are cal, lexical+sentiment).
tested on a standardized response selection task using • We present detailed analysis highlighting when
MPDs from the Wikipedia Talk Page dataset. and how speaker profiles help, supported by</p>
      <p>
        Our goal is not to build a personalized dialogue sys- both aggregate performance and error case
breaktem, but to assess whether minimal linguistic speaker downs.
information can influence LLM behavior in a selection
setting. We do not assume access to long-term user
history or stable user identities, and we make no changes to 2. Related Work
model parameters. Instead, we treat speaker profiling as a
lightweight, model-agnostic addition to the input prompt. Recent work on MPD has explored a range of strategies
This setup allows us to isolate the efect of speaker-level for modeling speaker identity, roles, and interaction
strucinformation on model performance and to compare its ture. Mahajan and Shaikh [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] introduce a graph-based
impact across multiple instruction-tuned LLMs. transformer that incorporates speaker and addressee
per
      </p>
      <p>
        Our results show that speaker profiles can enhance sonas as structured metadata, using crowdsourced
proresponse selection performance, particularly for larger ifles to condition response generation. Similarly, Ju et al.
models and in low-context scenarios. The most consis- [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] build a graph representation of utterances and speaker
tent gains are observed with lexical profiles (frequent personas to guide generation through a hierarchical
ennouns and verbs), while sentiment information yields coder and structured aggregation. These methods
emmarginal or mixed improvements. However, model scale phasize user profile incorporation but assume access to
and prompt format (e.g. 0-shot and 1-shot) significantly annotated profiles and require complex modeling. Sun
mediate the efectiveness of speaker profiles. et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] use contrastive learning to model
speakerOur contributions can be summarized as follows:
• We introduce a prompt-based method for
incorpo
      </p>
      <sec id="sec-1-1">
        <title>1The code and implementation details will be published in our repos</title>
        <p>itory</p>
        <sec id="sec-1-1-1">
          <title>Topic</title>
          <p>
            Business &amp; Entrepreneurs
Celebrity &amp; Pop Culture
Diaries &amp; Daily Life
Arts &amp; Culture
Learning &amp; Educational
Science &amp; Technology
News &amp; Social Concern
Relationships
Technology
specific discourse patterns without explicit profiles,
learning latent speaker distinctions optimized for generation
tasks. Penzo et al. [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] take a diagnostic approach,
analyzing how conversation structure afects performance
in response selection and addressee recognition. They
show that LLMs rely heavily on surface content for
response selection and are sensitive to prompt formulation
and structural variation. Finally, Hu et al. [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] propose
a role-aware modeling framework that combines
rolecontext pretraining with decoding constraints to favor
role-consistent outputs. While efective across multiple
MPD tasks, the approach depends on predefined role
labels and supervised training. Collectively, these studies
highlight the importance of speaker- and role-level
information in MPD, though most rely on supervised learning,
structured annotations, or architectural specialization.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Experiments</title>
      <p>We evaluate the efect of incorporating linguistic speaker
profiles on response selection in MPDs using a set of
instruction-tuned LLMs and baseline models. All
experiments are conducted on the same unseen test set, using
consistent prompt formatting and evaluation metrics.
Below, we describe the models, data, and profile features
used in our setup.
3.1. Dataset</p>
      <sec id="sec-2-1">
        <title>We use the Wikipedia Talk Page Conversations dataset</title>
        <p>
          [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which contains 124,957 multi-party dialogues in- across the dialogues involving these ten users, covering
volving 38,462 unique users and a total of 4,023,376 to- a broad range of domains including business, popular
kens with a vocabulary size of 108,416. The user activity culture, education, and technology.
in the dataset is not balanced, i.e. the top 10 most active We segment the data into three parts: a held-out test
users account for over 12% of all turns in the dataset. set of 2,500 previously unseen dialogues used for
evalua
        </p>
        <p>To model multi-party interactions, each conversation tion equal for all models; a training set of 206,633 samples
is represented as a tree, with the root corresponding to used exclusively for training the Siamese-RNN baseline
the initial post and branches representing reply chains. and for constructing one-shot examples; and a
developFor each reply path, we extract a linear dialogue his- ment set of 25,830 samples used only for optimizing the
tory leading to a candidate response. Each instance is RNN architecture and hyperparameters. Each response
framed as a response selection task with one ground- selection instance consists of a dialogue history and a
truth response and nine distractors drawn from the same pool of ten candidate responses drawn from the same
structural depth within other conversations. depth level in the reply tree. One candidate is the
cor</p>
        <p>
          We segment this subset into three partitions: a held- rect continuation, and the remaining nine are randomly
out test set of 2,500 previously unseen dialogues shared sampled distractors from other conversations at the same
across all models; a training set of 206,633 samples used to structural depth.
train the Siamese RNN and construct one-shot prompts;
and a development set of 25,830 samples for tuning the su- 3.2. Models
pervised model. To better understand the conversational
domain of the dialogues, we applied topic classification We evaluate three types of models:
using GPT-4o, following the categorization and
methodology of Antypas et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] (50 samples were randomly
selected and manually controlled to ensure prediction
validity). Table 1 presents the distribution of detected topics
• Random Baseline generates a uniformly ranked
list of candidate responses for each input context.
        </p>
        <p>This serves as a lower-bound reference point and
helps contextualize performance in the absence
e aepg itredo rreadg tiem tlak akhn rseu
l
c
i
tra t</p>
        <p>ayw tceonmm rsecou rokw leeopp ireevw itb itgnh tinpo seca iissscudno trcaegyo llrcaokb ited iandm tisenouq lisaavdnm iaegm tiseebw iteghw ilfaym ldbo tjrcepo isseu reech tra</p>
        <p>of data-driven inference. as structured natural language prompts,
includ• Siamese RNN is a supervised neural baseline us- ing a system instruction, dialogue history, and a
ing two GRU encoders with shared weights to list of candidate responses. When speaker
procompute the similarity between a dialogue con- ifles are used, they are appended to the input as
text and a candidate response. The model outputs plain-text feature descriptions associated with the
a matching score based on pairwise similarity and target speaker. We experiment with both
zerois trained using labeled context-response pairs shot prompting (task description only) and
onewith cross-entropy loss. Each GRU encoder uses shot prompting (including a single example of
the following hyperparameters: MAX_LENGTH = the desired input-output format). Inference is
300, input_size = 100, hidden_size = run with  = 0.2,  = 1.0, and
300, num_layers = 2, dropout = 0, and  = 50, and predictions are parsed to
bidirectional = True. The model is trained compute Recall@1/2/5.
for 10 epochs with a batch_size = 128 and Evaluation Metric We evaluate model performance
a learning rate of 0.0001. using Recall@k, a standard metric for response selection
• Instruction-Tuned LLMs include LLaMA 3.2 tasks. For each dialogue instance, the model ranks a set of
Instruct (1B and 8B) and GPT-4o, represent- ten candidate responses, consisting of the ground-truth
ing two families of recent state-of-the-art LLMs. response and nine distractors sampled from the same
LLaMA 3.2 Instruct is a publicly available model depth level in the conversation tree. Recall@k measures
family released by Meta, trained on a diverse mul- the proportion of instances where the correct response
tilingual corpus and further instruction-tuned to appears in the top  predictions. We report Recall@1,
Refollow natural language prompts. We include call@2, and Recall@5 to assess performance at diferent
both the 1B and 8B variants to examine the ef- levels of ranking sensitivity.
fect of model scale on profile sensitivity. GPT-4o Prompt Design We structure prompts for the
reis a proprietary model released by OpenAI, op- sponse selection task using a consistent template for
ranktimized for multimodal interaction and known ing responses based on the context. Each prompt
comfor strong instruction-following capabilities in prises three components: (i) a task instruction explaining
both zero-shot and few-shot settings. All models the ranking objective and expected output format, (ii) a
are used via API in inference-only mode without content section containing the dialogue history and 10
any additional fine-tuning. Inputs are provided candidate responses, and (iii) an optional speaker profile,</p>
        <sec id="sec-2-1-1">
          <title>System Prompt (abbreviated)</title>
          <p>&lt;|begin_of_text|&gt;
You will be given:
- A conversation transcript with numbered turns
- 10 candidate responses
- A user profile containing the most frequent nouns and
verbs used by the next speaker
Your task:</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Rank the candidate response indexes from best to</title>
          <p>worst based on how well they continue the conversation
and match the speaker profile.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Example output format:</title>
          <p>1. 3
2. 4
...</p>
          <p>Do NOT provide an explanation but the list of numbers.
&lt;|eot_id|&gt;</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>User Prompt (example structure) &lt;CONVERSATION&gt;</title>
          <p>
            Turn 1: Hi, how are you?
Turn 2: I’m doing well, thanks. You?
...
&lt;/CONVERSATION&gt;
&lt;Responses&gt;
1. I’m glad to hear that!
2. What’s new with you?
...
&lt;/Responses&gt;
&lt;User Profile&gt;
thank, update, read, discuss, feel, ...
&lt;/User Profile&gt;
&lt;|eot_id|&gt;
We construct speaker profiles for each user in the dataset,
using linguistic features extracted from their prior
messages. Each profile is fixed per speaker and remains
constant across all dialogue instances in which the user
appears. We create a lexical profile consisting of the 10
most frequent nouns and the 10 most frequent verbs used
by the speaker, extracted using the spaCy dependency
parser. These tokens reflect habitual vocabulary choices
and serve as coarse indicators of speaker identity and
discourse tendencies. This profile is then augmented with
a coarse-grained sentiment distribution. Each message
authored by the speaker is classified as positive, neutral,
or negative using GPT-4o, following a prior work [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]
and the resulting counts are normalized to produce a
speaker-level sentiment distribution (predictions were
manually verified for 50 randomly sampled messages to
ensure classifier quality). Profiles are incorporated into
the prompt and are explicitly associated with the speaker
expected to produce the next turn. This design allows
instruction-tuned LLMs to condition their ranking
decisions on user-specific linguistic traits without requiring
model fine-tuning or structural modifications.
          </p>
          <p>
            Figure 2 presents the overall sentiment distribution
across the turns in the dataset. The majority are neutral
(43%), followed by negative (37%) and positive (20%),
indicating a generally balanced emotional tone. Figures 1
and 3 show heatmaps of the top 10 most frequent verbs
and nouns, respectively, for 10 most frequent users. Each
heatmap reveals strong user-specific vocabulary patterns:
the most frequent items for a given user tend to be rarely
used by others. This lexical asymmetry suggests that
even simple word-level statistics can encode informative
signals about speaker identity. As a result, lexical profiles
may help disambiguate responses in MPD by aligning
candidate utterances with user-specific vocabulary
preferences.
appended when profiling is enabled. The speaker profile 4. Evaluation
provides the most frequent nouns and verbs used by the
next speaker, i.e. the user who is expected to respond, We evaluate the efect of incorporating linguistic speaker
extracted from their prior messages. The full prompt is profiles on the response selection performance of
framed in natural language and formatted using system instruction-tuned LLMs in MPDs. Our analysis compares
and user tags. The model is explicitly instructed to return three models, GPT-4o, LLaMA 3.2 Instruct (1B and 8B),
a ranked list of response indices without any explana- and a Siamese RNN baseline, under both zero-shot and
tion or commentary. In one-shot settings, we prepend a one-shot prompting conditions. We assess each model’s
demonstration example showing the exact input-output performance with and without speaker profile
informastructure. The speaker profile, when present, is enclosed tion, using two profile configurations as frequent nouns
in a &lt;     &gt; section and labeled accordingly. and verbs, and the addition of sentiment tendency.
This design follows the practices for LLM prompting Baseline Behavior The Siamese RNN performs
modin prior work [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. We provide the prompt template in erately well in the profile-free condition, achieving 31%
Table 2. Recall@1. However, its performance declines when
proifles are added. This suggests that the architecture may
not efectively integrate linguistic profile information, or
          </p>
        </sec>
        <sec id="sec-2-1-5">
          <title>Model</title>
        </sec>
        <sec id="sec-2-1-6">
          <title>Random</title>
        </sec>
        <sec id="sec-2-1-7">
          <title>Siamese-RNN</title>
          <p>Llama 3.2. 1B
Llama 3. 8B
GPT-4o
w.o. User Profile
Freq. Nouns &amp; Verbs</p>
          <p>+ Sentiment
w.o. User Profile
Freq. Nouns &amp; Verbs</p>
          <p>+ Sentiment
w.o. User Profile
Freq. Nouns &amp; Verbs</p>
          <p>+ Sentiment
w.o. User Profile
Freq. Nouns &amp; Verbs</p>
          <p>+ Sentiment
w.o. User Profile
Freq. Nouns &amp; Verbs</p>
          <p>+ Sentiment
w.o. User Profile
Freq. Nouns &amp; Verbs</p>
          <p>+ Sentiment
w.o. User Profile
Freq. Nouns &amp; Verbs
+ Sentiment
that the additional features introduce noise in the learned file input, suggesting that profile utility may depend on
similarity space. The random baseline performs as ex- model size. Meanwhile, GPT-4o demonstrates strong
pected, confirming that all models operate well above baseline performance without profiles, but still benefits
chance. from profile inclusion. The highest Recall@1 for GPT-4o</p>
          <p>LLM Performance Table 3 presents the performance is 62% with both lexical and sentiment features in the
scores across models, prompt settings, and speaker profile one-shot setting. These improvements, though smaller
configurations. GPT-4o achieves the highest performance in magnitude compared to LLaMA 8B, indicate that even
in all conditions, with Recall@1 reaching 62% under one- high-performing models can leverage cost-efective
linshot prompting with profile information. LLaMA 3.2 guistic speaker information.</p>
          <p>Instruct (8B) performs substantially better than its 1B Prompt Structure Prompting style has non-uniform
variant, particularly in the zero-shot setting, where the impact on models’ performance. For LLaMA 3.2 Instruct
addition of speaker profiles yields the largest relative (8B), zero-shot prompting outperforms one-shot in
sevimprovements. eral configurations, particularly when profiles are
in</p>
          <p>Speaker Profiles Incorporating speaker profiles leads cluded. In contrast, GPT-4o benefits more consistently
to consistent gains across most LLM configurations. For from one-shot prompting, though the margin is small.
LLaMA 3.2 Instruct (8B), the inclusion of frequent nouns These results highlight interactions between model scale,
and verbs improves Recall@1 from 20% to 40% in the prompt format, and profile efectiveness.
zero-shot setting. However, sentiment augmentation
does not produce additional gains and, in some cases,
slightly degrades performance. Nevertheless, the smaller
LLaMA model (1B) shows minimal sensitivity to
pro4.1. Error Analysis</p>
          <p>themselves are noisy.</p>
          <p>To better understand the limitations and strengths of
speaker profiles, we manually analyzed several subsets 5. Conclusion
of the test set. In our analysis, we define a misclassified
instance as one in which the ground-truth (GT) response We investigate whether linguistically derived speaker
does not appear among the top five ranked candidates profiles can improve the response selection capabilities of
(i.e., not within Recall@5), and a correct instance as one instruction-tuned LLMs in multi-party dialogue. We
conwhere the GT response is ranked first (i.e., Recall@1). structed user profiles based on frequent nouns, verbs, and</p>
          <p>Out of 2,500 total instances, 1,500 cases were consis- sentiment tendencies from prior utterances, and
incorpotently misclassified by all models across all conditions. In rated them into prompts without any model fine-tuning.
these cases, the distractors were often semantically and Our experiments with LLaMA 3.2 and GPT-4o show that
lexically similar to the GT responses, making the ranking lexical profiles consistently improve performance,
partictask inherently dificult. Moreover, frequent nouns and ularly for larger models and in zero-shot settings. Our
verbs extracted for profile construction were typically results show that lexical speaker profiles improve
pergeneric (e.g., “thanks,” “help,” “response”), and occurred formance in nearly all LLM settings, especially in larger
in both GTs and distractors, limiting their discrimina- models and zero-shot conditions. This supports RQ1,
tive value. In such cases, the profile provided little to no demonstrating that even lightweight user information
additional context to support accurate disambiguation. can help response selection in MPD. In addressing RQ2,</p>
          <p>In contrast, 611 instances were correctly classified by we find that model scale and prompt design play a crucial
all models across all settings. Here, the GT responses role in how efectively speaker profiles are used. Larger
were clearly more contextually grounded and lexically models benefit more from profile information,
suggestaligned with the dialogue history, and the distractors ing that they can better leverage user context. However,
were often generic acknowledgements (e.g., “thanks,” the sentimental features show mixed results, in some
“okay”) or of-topic continuations. The linguistic profiles cases adding noise rather than clarity. We also observe
were more distinctive in these examples and appeared that profiles are particularly useful in low-context
situato support the model’s ability to prioritize the correct tions, but their impact diminishes when distractors are
response. semantically close or when the profiles themselves lack</p>
          <p>Finally, in 77 cases, all models failed without speaker specificity.
profiles but they all correctly selected the GT response In future work, we plan to explore richer profile
reponce profile information was added. These instances resentations, investigate cross-domain generalizability,
were typically characterized by minimal dialogue history and test the applicability of this approach in real-time or
(one-turn inputs), where contextual grounding was insuf- streaming dialogue systems. We also see potential in
exifcient for accurate prediction. The added speaker profile tending our method to multilingual MPD and combining
appeared to serve as an auxiliary context that supported profile signals with structural or discourse-level features.
correct ranking in these otherwise under-specified
dialogues. Conversely, there were 2 cases in which the
inclusion of sentiment in the profile led to improved pre- Limitations
dictions in all models. These examples featured strong This study relies exclusively on in-context learning and
afective alignment between the dialogue history and the does not involve any fine-tuning of the evaluated models.
GT response, while the distractors were neutral and short, While this makes our approach lightweight and
accessiallowing the model to benefit from the added sentiment ble, it also constrains the models’ ability to adapt more
context. deeply to user-specific behaviors. Due to computational</p>
          <p>Interestingly, in 12 cases the models ranked the correct constraints, we did not experiment with larger LLMs
response at R@1 without speaker profiles, but failed to beyond LLaMA 3.2 (8B) and GPT-4o, and were unable
do so when profiles were added. In these cases, sentiment to explore open-weight models at scale requiring GPU
distribution was nearly uniform across responses in these access. Our data is limited to English Wikipedia Talk
cases, providing no additional signal. Furthermore, the Pages, which restricts the generalizability of our
finddistractors were uniformly generic, with some distrac- ings to multilingual or informal conversational domains.
tors including non-English text or irrelevant long-form Additionally, speaker profiles are based on automatic
excontent. Thus, the profile content introduces more noise traction of lexical and sentiment features, which may
rather than useful contrast, confusing the model. introduce noise or inaccuracies that afect profile quality.</p>
          <p>Overall, speaker profiles provide most benefit when Finally, we focus exclusively on response selection and
dialogue context is minimal or generic, but lose efective- did not experiment with response generation. While this
ness when distractors are lexically similar or the profiles
choice enables robust and reproducible automatic
evaluation, it leaves open the question of how linguistic speaker
profiles might afect the quality of generated responses
in more open-ended dialogue settings.</p>
          <p>Declaration on Generative AI
During the preparation of this work, the author(s) used Grammarly in order to: Improve writing
style and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed
and edited the content as needed and take(s) full responsibility for the publication’s content.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          , E. Ježek,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <article-title>Preface to the Eleventh Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2025</year>
          ),
          <source>in: Proceedings of the Eleventh Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Alghisi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rizzoli</surname>
          </string-name>
          , G. Roccabruna,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Mousavi</surname>
          </string-name>
          , G. Riccardi,
          <article-title>Should we fine-tune or RAG? evaluating diferent techniques to adapt LLMs for dialogue</article-title>
          , in: S. Mahamood,
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Minh</surname>
          </string-name>
          , D. Ippolito (Eds.),
          <source>Proceedings of the 17th International Natural Language Generation Conference</source>
          , Association for Computational Linguistics, Tokyo, Japan,
          <year>2024</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>197</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .inlg-main.
          <volume>15</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .inlg-main.
          <volume>15</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Mousavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Caldarella</surname>
          </string-name>
          , G. Riccardi,
          <article-title>Response generation in longitudinal dialogues: Which knowledge representation helps?</article-title>
          , in: Y.
          <string-name>
            <surname>-N. Chen</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Rastogi (Eds.),
          <source>Proceedings of the 5th Workshop on NLP for Conversational AI</source>
          (NLP4ConvAI
          <year>2023</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .nlp4convai-
          <fpage>1</fpage>
          .1/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .nlp4convai-
          <fpage>1</fpage>
          .1.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Learning to improve persona consistency in multi-party dialogue generation via text knowledge enhancement</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>C.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pustejovsky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wanner</surname>
          </string-name>
          , K.-S. Choi,
          <string-name>
            <surname>P.-M. Ryu</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-H. Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Donatelli</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kurohashi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Paggio</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Hahm</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>T. K.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Santus</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bond</surname>
          </string-name>
          , S.-H. Na (Eds.),
          <source>Proceedings of the 29th International Conference on Computational Linguistics</source>
          ,
          <source>International Committee on Computational Linguistics</source>
          , Gyeongju, Republic of Korea,
          <year>2022</year>
          , pp.
          <fpage>298</fpage>
          -
          <lpage>309</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .coling-
          <volume>1</volume>
          .23/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Penzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sajedinia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lepri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guerini</surname>
          </string-name>
          ,
          <article-title>Do LLMs sufer from multi-party hangover? a diagnostic approach to addressee recognition and response selection in conversations</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>11210</fpage>
          -
          <lpage>11233</lpage>
          . URL: https: //aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>628</volume>
          /. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2024</year>
          .emnlp-main.
          <volume>628</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chang</surname>
          </string-name>
          , Q. Cheng,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Mou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Aggregation of reasoning: A hierarchical framework for enhancing answer selection in large language models</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>609</fpage>
          -
          <lpage>625</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>53</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.-M. Wu</surname>
          </string-name>
          ,
          <article-title>Towards LLM-driven dialogue state tracking</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>739</fpage>
          -
          <lpage>755</lpage>
          . URL: https: //aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>48</volume>
          /. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>48</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Crook</surname>
          </string-name>
          ,
          <article-title>Large language models as zero-shot dialogue state tracker through function calling</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>8688</fpage>
          -
          <lpage>8704</lpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>471</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>471</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Advancing multi-party dialogue framework with speaker-ware contrastive learning</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/ abs/2501.11292. arXiv:
          <volume>2501</volume>
          .
          <fpage>11292</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Enhancing multiparty dialogue discourse parsing with explanation generation</article-title>
          , in: O.
          <string-name>
            <surname>Rambow</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wanner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Apidianaki</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Al-Khalifa</surname>
            ,
            <given-names>B. D.</given-names>
          </string-name>
          <string-name>
            <surname>Eugenio</surname>
          </string-name>
          , S. Schockaert (Eds.),
          <source>Proceedings of the 31st International Conference on Computational Linguistics</source>
          , Association for Computational Linguistics, Abu Dhabi,
          <string-name>
            <surname>UAE</surname>
          </string-name>
          ,
          <year>2025</year>
          , pp.
          <fpage>1531</fpage>
          -
          <lpage>1544</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          . coling-main.
          <volume>103</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Mousavi</surname>
          </string-name>
          , G. Roccabruna,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lorandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Caldarella</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Riccardi, Evaluation of response generation models: Shouldn't it be shareable and replicable?</article-title>
          , in: A.
          <string-name>
            <surname>Bosselut</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Chandu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Dhole</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Gangal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gehrmann</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Jernite</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Novikova</surname>
          </string-name>
          , L. Perez-Beltrachini (Eds.),
          <source>Proceedings of the 2nd Workshop on Natural Language Generation</source>
          , Evaluation, and
          <string-name>
            <surname>Metrics</surname>
          </string-name>
          (GEM),
          <article-title>Association for Computational Linguistics</article-title>
          , Abu Dhabi,
          <source>United Arab Emirates (Hybrid)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>147</lpage>
          . URL: https: //aclanthology.org/
          <year>2022</year>
          .gem-
          <volume>1</volume>
          .12/. doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2022</year>
          .gem-
          <volume>1</volume>
          .
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mahajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaikh</surname>
          </string-name>
          ,
          <article-title>Persona-aware multi-party conversation response generation</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>12712</fpage>
          -
          <lpage>12723</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          . lrec-main.
          <volume>1113</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Contrastive speakeraware learning for multi-party dialogue generation with llms</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/ 2503.08842. arXiv:
          <volume>2503</volume>
          .
          <fpage>08842</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Danescu-Niculescu-Mizil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          ,
          <article-title>Echoes of power: language efects and power diferences in social interaction</article-title>
          ,
          <source>in: Proceedings of the 21st International Conference on World Wide Web, WWW '12</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2012</year>
          , p.
          <fpage>699</fpage>
          -
          <lpage>708</lpage>
          . URL: https://doi.org/10.1145/2187836. 2187931. doi:
          <volume>10</volume>
          .1145/2187836.2187931.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Antypas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ushio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Neves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <article-title>Twitter topic classification</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>C.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pustejovsky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wanner</surname>
          </string-name>
          , K.-S. Choi,
          <string-name>
            <surname>P.-M. Ryu</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-H. Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Donatelli</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kurohashi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Paggio</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Hahm</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>T. K.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Santus</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bond</surname>
          </string-name>
          , S.-H. Na (Eds.),
          <source>Proceedings of the 29th International Conference on Computational Linguistics</source>
          ,
          <source>International Committee on Computational Linguistics</source>
          , Gyeongju, Republic of Korea,
          <year>2022</year>
          , pp.
          <fpage>3386</fpage>
          -
          <lpage>3400</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .coling-
          <volume>1</volume>
          .299/.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nasukawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Muraoka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bhattacharjee</surname>
          </string-name>
          ,
          <article-title>A simple yet strong domain-agnostic debias method for zero-shot sentiment classification</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>3923</fpage>
          -
          <lpage>3931</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-acl.
          <volume>242</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . findings-acl.
          <volume>242</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>