1. Introduction

Evaluating Linguistic Speaker Profiles on Response Selection in Multi-Party Dialogue

Maryam Sajedinia

Seyed Mahed Mousavi

Valerio Basile

2 0 Modeling & Simulation of Techno-Social Systems , Fondazione Bruno Kessler , Italy 1 Signals & Interactive Systems Lab, University of Trento , Italy 2 University of Turin , Italy

2025

We investigate whether incorporating linguistically derived speaker profiles improves the response selection capabilities of instruction-tuned large language models (LLMs) in multi-party dialogues. Using the Wikipedia Talk Page dataset, we construct lightweight profiles for each speaker based on features extracted from their prior messages, including frequent nouns and verbs, and sentiment tendency. These profiles are incorporated into the input prompts and evaluated using in-context learning with LLaMA 3.2 Instruct (1B and 8B) and GPT-4o, without any model fine-tuning. We compare performance across models and prompt settings, with and without speaker profiles, and analyze the efect of diferent profile configurations. Results are compared against a Random baseline and a supervised Siamese RNNs (with GRU units) trained on the same data. Our results show that incorporating speaker profiles improves response selection performance across most LLM settings, with the strongest gains observed in larger models such as LLaMA 3.2 (8B). Lexical features (frequent nouns and verbs) demonstrate greater improvements than sentiment information, particularly in low-context or underspecified scenarios. However, profile efectiveness varies by model scale and prompt format, and provides limited benefit in cases where distractors are lexically and semantically similar to the ground-truth response.

eol>Large Language Model Multiparty Dialogue User Profile Response Selection

1. Introduction

ink ow oko itrew rokw read try tan go eed say sak ifdn tlak teg tle add ievg epho ited aknh lfee trreev titseeah reevom lckob tragn ttrsa ccekh ecom th kn l w n t format, and profile composition? We evaluate this rating lightweight, linguistically derived speaker approach using LLaMA 3.2 Instruct (1B and 8B) and GPT- profiles into LLM-based response selection for 4o, comparing performance with and without speaker multi-party dialogue1. profiles under zero-shot and one-shot prompting. To • We conduct a systematic evaluation across model contextualize results, we include two baselines: a random scales (1B, 8B, GPT-4o), prompt formats (zeroranking strategy and a supervised Siamese RNN with shot, one-shot), and profile configurations (lexiGRU units trained on the same dataset. All models are cal, lexical+sentiment). tested on a standardized response selection task using • We present detailed analysis highlighting when MPDs from the Wikipedia Talk Page dataset. and how speaker profiles help, supported by

Our goal is not to build a personalized dialogue sys- both aggregate performance and error case breaktem, but to assess whether minimal linguistic speaker downs. information can influence LLM behavior in a selection setting. We do not assume access to long-term user history or stable user identities, and we make no changes to 2. Related Work model parameters. Instead, we treat speaker profiling as a lightweight, model-agnostic addition to the input prompt. Recent work on MPD has explored a range of strategies This setup allows us to isolate the efect of speaker-level for modeling speaker identity, roles, and interaction strucinformation on model performance and to compare its ture. Mahajan and Shaikh [ 12 ] introduce a graph-based impact across multiple instruction-tuned LLMs. transformer that incorporates speaker and addressee per

Our results show that speaker profiles can enhance sonas as structured metadata, using crowdsourced proresponse selection performance, particularly for larger ifles to condition response generation. Similarly, Ju et al. models and in low-context scenarios. The most consis- [ 4 ] build a graph representation of utterances and speaker tent gains are observed with lexical profiles (frequent personas to guide generation through a hierarchical ennouns and verbs), while sentiment information yields coder and structured aggregation. These methods emmarginal or mixed improvements. However, model scale phasize user profile incorporation but assume access to and prompt format (e.g. 0-shot and 1-shot) significantly annotated profiles and require complex modeling. Sun mediate the efectiveness of speaker profiles. et al. [ 13 ] use contrastive learning to model speakerOur contributions can be summarized as follows: • We introduce a prompt-based method for incorpo

1The code and implementation details will be published in our repos

itory

Topic

Business & Entrepreneurs Celebrity & Pop Culture Diaries & Daily Life Arts & Culture Learning & Educational Science & Technology News & Social Concern Relationships Technology specific discourse patterns without explicit profiles, learning latent speaker distinctions optimized for generation tasks. Penzo et al. [ 5 ] take a diagnostic approach, analyzing how conversation structure afects performance in response selection and addressee recognition. They show that LLMs rely heavily on surface content for response selection and are sensitive to prompt formulation and structural variation. Finally, Hu et al. [ 9 ] propose a role-aware modeling framework that combines rolecontext pretraining with decoding constraints to favor role-consistent outputs. While efective across multiple MPD tasks, the approach depends on predefined role labels and supervised training. Collectively, these studies highlight the importance of speaker- and role-level information in MPD, though most rely on supervised learning, structured annotations, or architectural specialization.

3. Experiments

We evaluate the efect of incorporating linguistic speaker profiles on response selection in MPDs using a set of instruction-tuned LLMs and baseline models. All experiments are conducted on the same unseen test set, using consistent prompt formatting and evaluation metrics. Below, we describe the models, data, and profile features used in our setup. 3.1. Dataset

We use the Wikipedia Talk Page Conversations dataset

[ 14 ], which contains 124,957 multi-party dialogues in- across the dialogues involving these ten users, covering volving 38,462 unique users and a total of 4,023,376 to- a broad range of domains including business, popular kens with a vocabulary size of 108,416. The user activity culture, education, and technology. in the dataset is not balanced, i.e. the top 10 most active We segment the data into three parts: a held-out test users account for over 12% of all turns in the dataset. set of 2,500 previously unseen dialogues used for evalua

To model multi-party interactions, each conversation tion equal for all models; a training set of 206,633 samples is represented as a tree, with the root corresponding to used exclusively for training the Siamese-RNN baseline the initial post and branches representing reply chains. and for constructing one-shot examples; and a developFor each reply path, we extract a linear dialogue his- ment set of 25,830 samples used only for optimizing the tory leading to a candidate response. Each instance is RNN architecture and hyperparameters. Each response framed as a response selection task with one ground- selection instance consists of a dialogue history and a truth response and nine distractors drawn from the same pool of ten candidate responses drawn from the same structural depth within other conversations. depth level in the reply tree. One candidate is the cor

We segment this subset into three partitions: a held- rect continuation, and the remaining nine are randomly out test set of 2,500 previously unseen dialogues shared sampled distractors from other conversations at the same across all models; a training set of 206,633 samples used to structural depth. train the Siamese RNN and construct one-shot prompts; and a development set of 25,830 samples for tuning the su- 3.2. Models pervised model. To better understand the conversational domain of the dialogues, we applied topic classification We evaluate three types of models: using GPT-4o, following the categorization and methodology of Antypas et al. [ 15 ] (50 samples were randomly selected and manually controlled to ensure prediction validity). Table 1 presents the distribution of detected topics • Random Baseline generates a uniformly ranked list of candidate responses for each input context.

This serves as a lower-bound reference point and helps contextualize performance in the absence e aepg itredo rreadg tiem tlak akhn rseu l c i tra t

ayw tceonmm rsecou rokw leeopp ireevw itb itgnh tinpo seca iissscudno trcaegyo llrcaokb ited iandm tisenouq lisaavdnm iaegm tiseebw iteghw ilfaym ldbo tjrcepo isseu reech tra

of data-driven inference. as structured natural language prompts, includ• Siamese RNN is a supervised neural baseline us- ing a system instruction, dialogue history, and a ing two GRU encoders with shared weights to list of candidate responses. When speaker procompute the similarity between a dialogue con- ifles are used, they are appended to the input as text and a candidate response. The model outputs plain-text feature descriptions associated with the a matching score based on pairwise similarity and target speaker. We experiment with both zerois trained using labeled context-response pairs shot prompting (task description only) and onewith cross-entropy loss. Each GRU encoder uses shot prompting (including a single example of the following hyperparameters: MAX_LENGTH = the desired input-output format). Inference is 300, input_size = 100, hidden_size = run with = 0.2, = 1.0, and 300, num_layers = 2, dropout = 0, and = 50, and predictions are parsed to bidirectional = True. The model is trained compute Recall@1/2/5. for 10 epochs with a batch_size = 128 and Evaluation Metric We evaluate model performance a learning rate of 0.0001. using Recall@k, a standard metric for response selection • Instruction-Tuned LLMs include LLaMA 3.2 tasks. For each dialogue instance, the model ranks a set of Instruct (1B and 8B) and GPT-4o, represent- ten candidate responses, consisting of the ground-truth ing two families of recent state-of-the-art LLMs. response and nine distractors sampled from the same LLaMA 3.2 Instruct is a publicly available model depth level in the conversation tree. Recall@k measures family released by Meta, trained on a diverse mul- the proportion of instances where the correct response tilingual corpus and further instruction-tuned to appears in the top predictions. We report Recall@1, Refollow natural language prompts. We include call@2, and Recall@5 to assess performance at diferent both the 1B and 8B variants to examine the ef- levels of ranking sensitivity. fect of model scale on profile sensitivity. GPT-4o Prompt Design We structure prompts for the reis a proprietary model released by OpenAI, op- sponse selection task using a consistent template for ranktimized for multimodal interaction and known ing responses based on the context. Each prompt comfor strong instruction-following capabilities in prises three components: (i) a task instruction explaining both zero-shot and few-shot settings. All models the ranking objective and expected output format, (ii) a are used via API in inference-only mode without content section containing the dialogue history and 10 any additional fine-tuning. Inputs are provided candidate responses, and (iii) an optional speaker profile,

System Prompt (abbreviated)

<|begin_of_text|> You will be given: - A conversation transcript with numbered turns - 10 candidate responses - A user profile containing the most frequent nouns and verbs used by the next speaker Your task:

Rank the candidate response indexes from best to

worst based on how well they continue the conversation and match the speaker profile.

Example output format:

1. 3 2. 4 ...

Do NOT provide an explanation but the list of numbers. <|eot_id|>

User Prompt (example structure) <CONVERSATION>

Turn 1: Hi, how are you? Turn 2: I’m doing well, thanks. You? ... </CONVERSATION> <Responses> 1. I’m glad to hear that! 2. What’s new with you? ... </Responses> <User Profile> thank, update, read, discuss, feel, ... </User Profile> <|eot_id|> We construct speaker profiles for each user in the dataset, using linguistic features extracted from their prior messages. Each profile is fixed per speaker and remains constant across all dialogue instances in which the user appears. We create a lexical profile consisting of the 10 most frequent nouns and the 10 most frequent verbs used by the speaker, extracted using the spaCy dependency parser. These tokens reflect habitual vocabulary choices and serve as coarse indicators of speaker identity and discourse tendencies. This profile is then augmented with a coarse-grained sentiment distribution. Each message authored by the speaker is classified as positive, neutral, or negative using GPT-4o, following a prior work [ 16 ] and the resulting counts are normalized to produce a speaker-level sentiment distribution (predictions were manually verified for 50 randomly sampled messages to ensure classifier quality). Profiles are incorporated into the prompt and are explicitly associated with the speaker expected to produce the next turn. This design allows instruction-tuned LLMs to condition their ranking decisions on user-specific linguistic traits without requiring model fine-tuning or structural modifications.

Figure 2 presents the overall sentiment distribution across the turns in the dataset. The majority are neutral (43%), followed by negative (37%) and positive (20%), indicating a generally balanced emotional tone. Figures 1 and 3 show heatmaps of the top 10 most frequent verbs and nouns, respectively, for 10 most frequent users. Each heatmap reveals strong user-specific vocabulary patterns: the most frequent items for a given user tend to be rarely used by others. This lexical asymmetry suggests that even simple word-level statistics can encode informative signals about speaker identity. As a result, lexical profiles may help disambiguate responses in MPD by aligning candidate utterances with user-specific vocabulary preferences. appended when profiling is enabled. The speaker profile 4. Evaluation provides the most frequent nouns and verbs used by the next speaker, i.e. the user who is expected to respond, We evaluate the efect of incorporating linguistic speaker extracted from their prior messages. The full prompt is profiles on the response selection performance of framed in natural language and formatted using system instruction-tuned LLMs in MPDs. Our analysis compares and user tags. The model is explicitly instructed to return three models, GPT-4o, LLaMA 3.2 Instruct (1B and 8B), a ranked list of response indices without any explana- and a Siamese RNN baseline, under both zero-shot and tion or commentary. In one-shot settings, we prepend a one-shot prompting conditions. We assess each model’s demonstration example showing the exact input-output performance with and without speaker profile informastructure. The speaker profile, when present, is enclosed tion, using two profile configurations as frequent nouns in a < > section and labeled accordingly. and verbs, and the addition of sentiment tendency. This design follows the practices for LLM prompting Baseline Behavior The Siamese RNN performs modin prior work [ 16 ]. We provide the prompt template in erately well in the profile-free condition, achieving 31% Table 2. Recall@1. However, its performance declines when proifles are added. This suggests that the architecture may not efectively integrate linguistic profile information, or

Model Random Siamese-RNN

Llama 3.2. 1B Llama 3. 8B GPT-4o w.o. User Profile Freq. Nouns & Verbs

+ Sentiment w.o. User Profile Freq. Nouns & Verbs

+ Sentiment w.o. User Profile Freq. Nouns & Verbs + Sentiment that the additional features introduce noise in the learned file input, suggesting that profile utility may depend on similarity space. The random baseline performs as ex- model size. Meanwhile, GPT-4o demonstrates strong pected, confirming that all models operate well above baseline performance without profiles, but still benefits chance. from profile inclusion. The highest Recall@1 for GPT-4o

LLM Performance Table 3 presents the performance is 62% with both lexical and sentiment features in the scores across models, prompt settings, and speaker profile one-shot setting. These improvements, though smaller configurations. GPT-4o achieves the highest performance in magnitude compared to LLaMA 8B, indicate that even in all conditions, with Recall@1 reaching 62% under one- high-performing models can leverage cost-efective linshot prompting with profile information. LLaMA 3.2 guistic speaker information.

Instruct (8B) performs substantially better than its 1B Prompt Structure Prompting style has non-uniform variant, particularly in the zero-shot setting, where the impact on models’ performance. For LLaMA 3.2 Instruct addition of speaker profiles yields the largest relative (8B), zero-shot prompting outperforms one-shot in sevimprovements. eral configurations, particularly when profiles are in

Speaker Profiles Incorporating speaker profiles leads cluded. In contrast, GPT-4o benefits more consistently to consistent gains across most LLM configurations. For from one-shot prompting, though the margin is small. LLaMA 3.2 Instruct (8B), the inclusion of frequent nouns These results highlight interactions between model scale, and verbs improves Recall@1 from 20% to 40% in the prompt format, and profile efectiveness. zero-shot setting. However, sentiment augmentation does not produce additional gains and, in some cases, slightly degrades performance. Nevertheless, the smaller LLaMA model (1B) shows minimal sensitivity to pro4.1. Error Analysis

themselves are noisy.

To better understand the limitations and strengths of speaker profiles, we manually analyzed several subsets 5. Conclusion of the test set. In our analysis, we define a misclassified instance as one in which the ground-truth (GT) response We investigate whether linguistically derived speaker does not appear among the top five ranked candidates profiles can improve the response selection capabilities of (i.e., not within Recall@5), and a correct instance as one instruction-tuned LLMs in multi-party dialogue. We conwhere the GT response is ranked first (i.e., Recall@1). structed user profiles based on frequent nouns, verbs, and

Out of 2,500 total instances, 1,500 cases were consis- sentiment tendencies from prior utterances, and incorpotently misclassified by all models across all conditions. In rated them into prompts without any model fine-tuning. these cases, the distractors were often semantically and Our experiments with LLaMA 3.2 and GPT-4o show that lexically similar to the GT responses, making the ranking lexical profiles consistently improve performance, partictask inherently dificult. Moreover, frequent nouns and ularly for larger models and in zero-shot settings. Our verbs extracted for profile construction were typically results show that lexical speaker profiles improve pergeneric (e.g., “thanks,” “help,” “response”), and occurred formance in nearly all LLM settings, especially in larger in both GTs and distractors, limiting their discrimina- models and zero-shot conditions. This supports RQ1, tive value. In such cases, the profile provided little to no demonstrating that even lightweight user information additional context to support accurate disambiguation. can help response selection in MPD. In addressing RQ2,

In contrast, 611 instances were correctly classified by we find that model scale and prompt design play a crucial all models across all settings. Here, the GT responses role in how efectively speaker profiles are used. Larger were clearly more contextually grounded and lexically models benefit more from profile information, suggestaligned with the dialogue history, and the distractors ing that they can better leverage user context. However, were often generic acknowledgements (e.g., “thanks,” the sentimental features show mixed results, in some “okay”) or of-topic continuations. The linguistic profiles cases adding noise rather than clarity. We also observe were more distinctive in these examples and appeared that profiles are particularly useful in low-context situato support the model’s ability to prioritize the correct tions, but their impact diminishes when distractors are response. semantically close or when the profiles themselves lack

Finally, in 77 cases, all models failed without speaker specificity. profiles but they all correctly selected the GT response In future work, we plan to explore richer profile reponce profile information was added. These instances resentations, investigate cross-domain generalizability, were typically characterized by minimal dialogue history and test the applicability of this approach in real-time or (one-turn inputs), where contextual grounding was insuf- streaming dialogue systems. We also see potential in exifcient for accurate prediction. The added speaker profile tending our method to multilingual MPD and combining appeared to serve as an auxiliary context that supported profile signals with structural or discourse-level features. correct ranking in these otherwise under-specified dialogues. Conversely, there were 2 cases in which the inclusion of sentiment in the profile led to improved pre- Limitations dictions in all models. These examples featured strong This study relies exclusively on in-context learning and afective alignment between the dialogue history and the does not involve any fine-tuning of the evaluated models. GT response, while the distractors were neutral and short, While this makes our approach lightweight and accessiallowing the model to benefit from the added sentiment ble, it also constrains the models’ ability to adapt more context. deeply to user-specific behaviors. Due to computational

Interestingly, in 12 cases the models ranked the correct constraints, we did not experiment with larger LLMs response at R@1 without speaker profiles, but failed to beyond LLaMA 3.2 (8B) and GPT-4o, and were unable do so when profiles were added. In these cases, sentiment to explore open-weight models at scale requiring GPU distribution was nearly uniform across responses in these access. Our data is limited to English Wikipedia Talk cases, providing no additional signal. Furthermore, the Pages, which restricts the generalizability of our finddistractors were uniformly generic, with some distrac- ings to multilingual or informal conversational domains. tors including non-English text or irrelevant long-form Additionally, speaker profiles are based on automatic excontent. Thus, the profile content introduces more noise traction of lexical and sentiment features, which may rather than useful contrast, confusing the model. introduce noise or inaccuracies that afect profile quality.

Overall, speaker profiles provide most benefit when Finally, we focus exclusively on response selection and dialogue context is minimal or generic, but lose efective- did not experiment with response generation. While this ness when distractors are lexically similar or the profiles choice enables robust and reproducible automatic evaluation, it leaves open the question of how linguistic speaker profiles might afect the quality of generated responses in more open-ended dialogue settings.

Declaration on Generative AI During the preparation of this work, the author(s) used Grammarly in order to: Improve writing style and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Bosco , E. Ježek,

Polignano ,

Sanguinetti , Preface to the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025 ), in: Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025 ), 2025 .

[2]

Alghisi ,

Rizzoli , G. Roccabruna,

S. M.

Mousavi , G. Riccardi, Should we fine-tune or RAG? evaluating diferent techniques to adapt LLMs for dialogue , in: S. Mahamood,

N. L.

Minh , D. Ippolito (Eds.), Proceedings of the 17th International Natural Language Generation Conference , Association for Computational Linguistics, Tokyo, Japan, 2024 , pp. 180 - 197 . URL: https://aclanthology.org/ 2024 .inlg-main. 15 /. doi: 10 .18653/v1/ 2024 .inlg-main. 15 .

[3]

S. M.

Mousavi ,

Caldarella , G. Riccardi, Response generation in longitudinal dialogues: Which knowledge representation helps? , in: Y. -N. Chen , A . Rastogi (Eds.), Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023 ), Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 1 - 11 . URL: https://aclanthology.org/ 2023 .nlp4convai- 1 .1/. doi: 10 .18653/v1/ 2023 .nlp4convai- 1 .1.

[4]

Ju ,

Feng ,

Lv ,

Wang ,

Zhang , Learning to improve persona consistency in multi-party dialogue generation via text knowledge enhancement , in: N. Calzolari , C.-R.

Huang , H.

Kim , J.

Pustejovsky , L.

Wanner , K.-S. Choi, P.-M. Ryu , H. -H. Chen , L.

Donatelli , H.

Ji , S.

Kurohashi , P.

Paggio , N.

Xue , S.

Kim , Y.

Hahm , Z.

He , T. K.

Lee , E.

Santus , F.

Bond , S.-H. Na (Eds.), Proceedings of the 29th International Conference on Computational Linguistics , International Committee on Computational Linguistics , Gyeongju, Republic of Korea, 2022 , pp. 298 - 309 . URL: https://aclanthology.org/ 2022 .coling- 1 .23/.

[5]

Penzo ,

Sajedinia ,

Lepri ,

Tonelli ,

Guerini , Do LLMs sufer from multi-party hangover? a diagnostic approach to addressee recognition and response selection in conversations , in: Y. Al-Onaizan , M.

Bansal , Y.-N.

Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Miami, Florida, USA, 2024 , pp. 11210 - 11233 . URL: https: //aclanthology.org/ 2024 .emnlp-main. 628 /. doi: 10 . 18653/v1/ 2024 .emnlp-main. 628 .

[6]

Yin ,

Sun ,

Guo ,

Zeng ,

Li ,

Sun ,

Chang , Q. Cheng,

Wang ,

Mou ,

Qiu ,

Huang , Aggregation of reasoning: A hierarchical framework for enhancing answer selection in large language models , in: N. Calzolari , M.- Y.

Kan , V.

Hoste , A.

Lenci , S.

Sakti , N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Torino , Italia, 2024 , pp. 609 - 625 . URL: https://aclanthology.org/ 2024 .lrec-main. 53 /.

[7]

Feng ,

Lu ,

Liu ,

Zhan , X.-M. Wu , Towards LLM-driven dialogue state tracking , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2023 , pp. 739 - 755 . URL: https: //aclanthology.org/ 2023 .emnlp-main. 48 /. doi: 10 . 18653/v1/ 2023 .emnlp-main. 48 .

[8]

Li ,

Chen ,

Ross ,

Huber ,

Moon ,

Lin ,

Dong ,

Sagar ,

Yan ,

Crook , Large language models as zero-shot dialogue state tracker through function calling , in: L. -W. Ku , A. Martins , V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Bangkok, Thailand, 2024 , pp. 8688 - 8704 . URL: https://aclanthology. org/ 2024 . acl-long . 471 /. doi: 10 .18653/v1/ 2024 . acl-long . 471 .

[9]

Hu ,

He ,

Li ,

Zhao ,

Wang , Advancing multi-party dialogue framework with speaker-ware contrastive learning , 2025 . URL: https://arxiv.org/ abs/2501.11292. arXiv: 2501 . 11292 .

[10]

Liu ,

Li ,

Fan ,

Zhu , Enhancing multiparty dialogue discourse parsing with explanation generation , in: O. Rambow , L.

Wanner , M.

Apidianaki , H.

Al-Khalifa , B. D.

Eugenio , S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational Linguistics , Association for Computational Linguistics, Abu Dhabi, UAE , 2025 , pp. 1531 - 1544 . URL: https://aclanthology.org/ 2025 . coling-main. 103 /.

[11]

S. M.

Mousavi , G. Roccabruna,

Lorandi ,

Caldarella , G. Riccardi, Evaluation of response generation models: Shouldn't it be shareable and replicable? , in: A. Bosselut , K.

Chandu , K.

Dhole , V.

Gangal , S.

Gehrmann , Y.

Jernite , J.

Novikova , L. Perez-Beltrachini (Eds.), Proceedings of the 2nd Workshop on Natural Language Generation , Evaluation, and Metrics (GEM), Association for Computational Linguistics , Abu Dhabi, United Arab Emirates (Hybrid) , 2022 , pp. 136 - 147 . URL: https: //aclanthology.org/ 2022 .gem- 1 .12/. doi: 10 .18653/ v1/ 2022 .gem- 1 . 12 .

[12]

Mahajan ,

Shaikh , Persona-aware multi-party conversation response generation , in: N. Calzolari , M.- Y.

Kan , V.

Hoste , A.

Lenci , S.

Sakti , N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Torino , Italia, 2024 , pp. 12712 - 12723 . URL: https://aclanthology.org/ 2024 . lrec-main. 1113 /.

[13]

Sun ,

Qian ,

Wang , Contrastive speakeraware learning for multi-party dialogue generation with llms , 2025 . URL: https://arxiv.org/abs/ 2503.08842. arXiv: 2503 . 08842 .

[14]

Danescu-Niculescu-Mizil ,

Lee ,

Pang ,

Kleinberg , Echoes of power: language efects and power diferences in social interaction , in: Proceedings of the 21st International Conference on World Wide Web, WWW '12 , Association for Computing Machinery, New York, NY, USA, 2012 , p. 699 - 708 . URL: https://doi.org/10.1145/2187836. 2187931. doi: 10 .1145/2187836.2187931.

[15]

Antypas ,

Ushio ,

Camacho-Collados ,

Silva ,

Neves ,

Barbieri , Twitter topic classification , in: N. Calzolari , C.-R.

Huang , H.

Kim , J.

Pustejovsky , L.

Wanner , K.-S. Choi, P.-M. Ryu , H. -H. Chen , L.

Donatelli , H.

Ji , S.

Kurohashi , P.

Paggio , N.

Xue , S.

Kim , Y.

Hahm , Z.

He , T. K.

Lee , E.

Santus , F.

Bond , S.-H. Na (Eds.), Proceedings of the 29th International Conference on Computational Linguistics , International Committee on Computational Linguistics , Gyeongju, Republic of Korea, 2022 , pp. 3386 - 3400 . URL: https://aclanthology.org/ 2022 .coling- 1 .299/.

[16]

Zhao ,

Nasukawa ,

Muraoka ,

Bhattacharjee , A simple yet strong domain-agnostic debias method for zero-shot sentiment classification , in: A. Rogers , J. Boyd-Graber , N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 , Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 3923 - 3931 . URL: https://aclanthology.org/ 2023 .findings-acl. 242 /. doi: 10 .18653/v1/ 2023 . findings-acl. 242 .