1. Introduction

1613-0073

from Holocaust Diaries with Ensemble LLMs

Angelina Parfenova

angelina.parfenova@tum.de 0 1 2 3

Workshop

0 1 0 In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Litvak (eds.): Proceedings of the Text2Story'25 Workshop , Lucca , Italy 1 Inductive coding , Holocaust diaries, Ensemble models, Retrieval-Augmented Generation, Qualitative analysis 2 Lucerne University of Applied Sciences and Arts , Rotkreuz , Switzerland 3 Technical University of Munich , Munich , Germany

2025

This paper presents a novel application of ensemble-based large language models (LLMs) with RetrievalAugmented Generation (RAG) for automated inductive coding of Holocaust children's diaries. Our approach integrates multiple smaller LLMs, fine-tuned via Low-Rank Adaptation (LoRA), and employs a moderator-based mechanism to simulate collaborative human consensus. We evaluate our best model on a curated dataset of diaries, demonstrating significant improvements in coding consistency and specificity. Our results highlight the potential of ensemble-based LLMs with RAG for analyzing sensitive historical texts, ofering a scalable and eficient alternative to manual coding while preserving the nuanced emotional and thematic content of the diaries.

1. Introduction

[ 7 ].

Inductive coding is a qualitative analysis approach in which codes emerge directly from the data rather than being predefined. A

code represents a concise label that captures the core meaning of a text segment. This approach is a part of thematic analysis, a method for identifying and structuring patterns in qualitative data [ 8 ]. The process typically involves iteratively generating codes, clustering them into broader categories, and refining themes to represent the data’s underlying structure. Inductive coding is particularly useful for exploratory studies, such as historical text analysis, where themes emerge organically. However, manual thematic analysis is time-consuming and subjective, posing scalability challenges for large textual datasets.

In this work, we propose a novel framework for automated inductive coding using ensemble-based large language models (LLMs) with Retrieval-Augmented Generation (RAG) [ 9 ]. Our approach uses the strengths of multiple smaller LLMs (7B and 8B parameters) in an ensemble framework, combining their outputs and feeding them into larger moderator LLM to generate high-quality codes that reflect the thematic and emotional complexity of the texts. To ensure consistency and reduce redundancy, we integrate RAG, which references previously assigned codes to maintain coherence across similar inputs.

CEUR

ceur-ws.org

This combination of ensemble modeling and RAG addresses key limitations of existing methods [ 10, 5 ], ofering a scalable and eficient alternative to manual coding while preserving the nuanced content.

We apply our framework to a curated dataset of Holocaust children’s diaries, demonstrating its efectiveness in capturing recurring themes such as family separation, fear, and hope. Our results show significant improvements in coding consistency, specificity, and alignment with human-coded benchmarks, highlighting the potential of ensemble-based LLMs with RAG for analyzing sensitive historical texts.

2. Background

Qualitative data analysis (QDA) is one of the main methods in social science research, allowing researchers to identify, categorize, and interpret patterns within textual data [ 11, 12 ]. Central to this process is the concept of coding, where meaningful segments of text are assigned concise labels, or codes, that capture their essence (see Figure 1). According to Saldana [ 2 ], a code is ”a word or short phrase that symbolically assigns a summative, salient, essence-capturing, and/or evocative attribute for a portion of language-based or visual data.” In thematic analysis, one of the most widely used methods in QDA, these codes are further grouped into broader categories to reveal hierarchical relationships and underlying themes within the data [ 13 ].

Recent advances in natural language processing (NLP) have introduced the use of large language models (LLMs) to automate qualitative coding tasks [ 14, 10, 15 ]. However, two critical challenges remain unaddressed in this domain. First, traditional evaluation metrics such as BERTScore and ROUGE, while efective for summarization tasks, are insuficient to assess the quality of qualitative codes [ 10, 5 ]. Recent work by Chen et al. [ 5 ] introduced unsupervised metrics tailored for code evaluation, but these approaches lack the ability to directly compare model outputs to human annotations. In this work, we address this gap by proposing a supervised evaluation framework that aligns model-generated codes with human-coded benchmarks.

Second, while individual LLMs demonstrate remarkable performance, their outputs often vary due to diferences in training data, architectures, and model parameters [ 16, 17 ]. This variability mirrors the subjectivity inherent in human coding, where individual coders may interpret the same text diferently. To address this challenge, ensemble methods, techniques that combine multiple models, were explored to combine the strengths of diverse models and improve overall performance [ 18, 19 ]. For example, Jiang et al. [ 20 ] demonstrated the efectiveness of ensembling in complex natural language generation tasks, while Cai et al. [ 21 ] highlighted the potential of mixture-of-experts (MoE) frameworks for specialized sub-tasks.

This study builds upon the concept of ensemble methods but diverges from existing approaches by adopting a moderator-based framework. Unlike fusion techniques that combine outputs probabilistically, our approach incorporates a final decision-making model tasked with selecting the best candidate or proposing a novel output. This design reflects the dynamics of human collaboration with a leader, where consensus is driven by a final arbiter, rather than by averaging or blending opinions. By employing this moderator model, we aim to mimic the decision-making process and demonstrate its efectiveness in automating inductive coding tasks, particularly for sensitive historical texts such as personal diaries.

In the context of Holocaust studies, NLP has been increasingly applied to analyze historical texts, including survivor testimony, diaries, and archival documents. For instance, Schwartz et al. [ 22 ] used topic modeling to identify recurring themes in Holocaust survivor testimonies, while Eisenstein et al. [ 23 ] employed sentiment analysis to explore emotional patterns in wartime diaries. However, these studies often rely on traditional NLP techniques, which struggle to capture the nuanced emotional and thematic content of Holocaust texts.

Ensemble learning is a well-established strategy for improving model performance by combining the strengths of multiple models, often referred to as ”weaker models” [ 18, 24 ]. Common approaches include weighting individual models based on their performance or aggregating diverse outputs to produce a unified result. For example, the Mix-of-Experts (MoE) framework [ 21 ] employs specialized sub-models to make predictions and merges their outputs for improved accuracy. Similarly, LLM-Blender [ 20 ] demonstrates the potential of ensembling by combining ranked outputs from multiple models to achieve superior performance in complex natural language generation tasks.

3. Methodology

Our pipeline consists of three key stages: (1) input processing by multiple smaller LLMs, (2) moderation and refinement of outputs by larger LLMs, and (3) retrieval-augmented generation to ensure consistency. The steps are described below.

3.1. Ensemble Model Framework

Our ensemble framework combines three smaller LLMs (7B and 8B parameters) to process each input diary entry independently. These models were fine-tuned using Low-Rank Adaptation (LoRA) [ 25 ] on a diverse corpus of social science data (see Table 1, enabling them to capture domain-specific patterns while maintaining computational eficiency. LoRA fine-tuning allows for eficient adaptation of pretrained models to specialized tasks, such as inductive coding, without requiring extensive retraining or large-scale datasets.

The outputs from these models are evaluated by a moderator model, which refines and consolidates the results (see Appendix B). The moderator is tasked with assessing the quality and relevance of the generated codes, ensuring that the final output reflects a consensus among the ensemble. This approach reduces variability and improves the quality of the generated codes, addressing the inherent subjectivity of individual LLMs [ 16, 17 ]. RAG is integrated into our pipeline to ensure consistency and reduce redundancy in the coding process. RAG operates by referencing a database of previously assigned codes, which are retrieved based on semantic similarity to the current input. For each input, RAG computes the cosine similarity between the input embedding () and the embeddings of previously assigned codes ( ). If the similarity exceeds , the retrieved code is reused; otherwise, a new code is generated. The integration of RAG also addresses the challenge of code redundancy, a common issue in automated qualitative coding. By aligning new outputs with historical coding decisions, RAG ensures that similar inputs receive consistent labels.

3.3. Evaluation Metrics

We evaluate our approach using a combination of quantitative metrics (e.g., composite score, ROUGE [ 26 ], BERTScore [ 27 ]) and qualitative analysis. The composite score, which incorporates semantic, lexical, and structural alignment, serves as the primary metric for assessing coding quality. Composite Score To provide a comprehensive evaluation of coding quality, we introduce a Composite Score ( ) that combines multiple normalized metrics: =

41 [ ̃ + +̃ (1 − ) ̃+ (1 − )̃ ] , where: ̃ : Normalized cosine similarity between code embeddings [ 28 ], measuring semantic alignment with human-coded references; ̃ : Scaled METEOR score [ 29 ], which balances precision and recall while accounting for synonymy and stemming; ̃: Normalized code length percentile, where shorter codes are preferred to avoid verbosity; :̃ Normalized Jensen-Shannon divergence [ 30 ], which quantifies the distributional similarity between generated and reference codes. Each metric is normalized using min-max scaling: ⋅̃ =

⋅ − min , max − min (1) (2) ensuring that all components contribute equally to the Composite Score. The terms (1 − ) ̃ and (1 − )̃ invert the code length and divergence metrics, respectively, so that higher values indicate better performance across all dimensions.

4. Experiments and Results

Our experiments began with the training and evaluation of ensemble models using a dataset of 1,000 code-quote pairs compiled from social science research studies and the SemEval-2014 Task 4 dataset [ 31 ] (see Table 1). The dataset included 600 examples from social science studies and 400 examples from reviews, each annotated by 3–5 coders to establish mutually agreed golden standard codes. The dataset was split into training (900 examples) and test (100 examples) sets, with hyperparameters selected based on training performance.

Model Selection and Fine-Tuning We evaluated several open-source LLMs, including Llama3 [ 32 ], Falcon [ 33 ], Mistral [ 34 ], Vicuna [ 35 ], Gemma [ 36 ], and TinyLlama [ 37 ]. Each model was finetuned using Low-Rank Adaptation (LoRA) [ 25 ] on the training dataset, enabling eficient adaptation to the inductive coding task. The fine-tuned models generated outputs for each input , which were evaluated using BERTScore and ROUGE. The top three performing models—Llama3, Falcon, and Mistral—were selected for the ensemble framework (see Appendix A).

Overview of dataset characteristics used for LoRA training. (A) Data sources and descriptions, including 600 quotes from social science studies and 400 quotes from SemEval 2014 Task 4. (B) Dataset statistics and splits, with 900 examples for training and 100 for testing. This dataset was annotated by multiple coders to create a golden standard and served as the foundation for fine-tuning the base 7B and 8B models.

N Quotes Description

Social Science Studies Data: 600 quotes Study about interaction with self-tracking devices (interviews) Study about life transitions and mobility (interviews) Study about interaction with voice assistants (interviews) Study about museums and cultural experiences (interviews) Study on doctors’ experiences with pregnant women (interviews) Study on universal and national values (interviews) Study on procrastination and budget planning (interviews) Study on technology interactions and user feedback (reviews) Study about social expectations (interviews)

SemEval 2014; Task 4: 400 quotes Restaurant reviews Laptop reviews

Statistic Total Quotes Social Science Data SemEval Data Num of Data Sources Unique Codes Avg. Quote Length Avg. Code Length

Overall and previously assigned code embeddings ( ). If the similarity exceeds a threshold , the existing code is reused; otherwise, a new code is generated. codes by aligning new outputs with previously assigned codes. As demonstrated in Table 2, RAGenhanced ensembles produce more concise outputs, achieving an average code length reduction from 6.83 to 4.00 tokens—a 41.5% improvement over non-RAG ensembles.

Further analysis highlights the impact of RAG on code diversity. While the human gold standard comprises 47 unique codes with an average length of 2.79 tokens, non-RAG models exhibit excessive code proliferation, often generating unique codes for each input. In contrast, RAG integration significantly reduces this redundancy, with Llama3.3 70B Ensemble+RAG and Mixtral 8x7B Ensemble+RAG producing 53 and 71 unique codes, respectively. This brings the models closer to human-like coding eficiency, as illustrated in Table 2.

5. Holocaust Dataset Analysis

To evaluate the generalizability of our framework, we applied the best-performing ensemble model (Mixtral 8x7B with RAG) to a curated dataset of 224 Holocaust children’s diaries. The dataset was constructed from the book Children in the Holocaust and World War II: Their Secret Diaries by Laurel Holliday [ 38 ]. We selected diary entries that were explicitly labeled with both day and year, ensuring temporal consistency and facilitating the analysis of chronological patterns in the children’s experiences.

5.1. Results

Temporal Distribution of Diary Entries The dataset spans from 1939 to 1945, capturing key moments in World War II from the perspective of children. Figure 4 shows the distribution of diary entries over time, revealing a notable increase in the density of entries around major historical events. For example, the invasion of Poland in 1939 and the intensification of bombings and deportations in later years are reflected in the children’s writings. This temporal distribution demonstrates how the evolving wartime environment influenced the frequency and content of their diary entries. Thematic Analysis of Codes Our framework generated a diverse array of codes that reflect the children’s experiences and emotional states. Early entries, such as those from Janine Phillips in August and September 1939, focus on themes like Impact of unexpected war news and Family Reunion; Prepared for War. As the war progressed, the model identified more intense and emotionally charged themes, such as Devastating bombing begins, War-time scarcity; community support, and Fear of war’s soul-crushing impact.

Recurring codes like Loneliness, despair, longing for relief and Severe hunger, bread scarcity illustrate the isolation and deprivation on the children. At the same time, the model captured moments of resilience, such as Found purpose, devoted to homeland and Dreaming of peace amidst chaos, highlighting the children’s capacity for hope and adaptation even in dire circumstances. These findings demonstrate the model’s ability to capture both the emotional depth and thematic complexity of the diaries. Individual Variations The diaries reveal significant individual variations in how children responded to their experiences. For instance, Janine Phillips’ entries focus on the immediate shock and logistical challenges of war, while others, such as those from anonymous authors, emphasize personal reflections on family, loss, and survival (see Figure 5). For example, one entry describing the emotional toll of being separated from family members was labeled as Longing for family; emotional isolation, while another reflecting on the resilience of children in the face of adversity was coded as Hope amidst despair; finding strength. These examples highlight the model’s sensitivity to the nuanced emotional and thematic content of the diaries.

Table 3 presents the most frequently occurring codes generated by the Mixtral 8x7B Ensemble RAG model. These codes reflect the dominant themes and emotional states documented by children during the Holocaust. The frequency of each code provides insight into the shared experiences and collective trauma of the children, as well as their individual responses to the evolving wartime environment.

6. Discussion

Our study demonstrates the efectiveness of ensemble-based LLMs with Retrieval-Augmented Generation (RAG) for automating inductive coding tasks. The results highlight the framework’s ability to capture the emotional and thematic complexity of sensitive historical texts while maintaining consistency and reducing redundancy. Below, we discuss the key implications of our findings, address limitations, and outline directions for future research.

6.1. Ensembles Improve Coding Consistency

A major finding of our study is that ensemble models consistently outperform individual models in inductive coding tasks, as shown in Table 2. This suggests that aggregating multiple model outputs helps reduce inconsistencies, reflecting the consensus-building process employed by human coders in thematic analysis.

The increased consistency observed in ensemble-generated codes aligns with findings from prior research on LLM evaluation, which suggest that individual models often introduce unwanted variability in their outputs due to diferences in training data and architectural biases [ 16, 19 ]. In contrast, ensemble methods mitigate this variability by integrating diverse inputs, thereby improving robustness. Our results indicate that this efect holds even for smaller models, making ensemble approaches a practical solution for qualitative coding tasks.

6.2. RAG Enhances Code Stability

The integration of RAG significantly improves code stability, as demonstrated by higher composite and ROUGE scores in RAG-enhanced ensembles (Table 2). By referencing previously assigned codes, RAG reduces redundancy and promotes consistency across similar inputs. This is particularly evident in the reduction of unique code counts (e.g., 53 for Llama3.3 70B+RAG vs. 100 for non-RAG models) and code length (41.5% reduction), bringing model outputs closer to human-like eficiency.

In the context of Holocaust diaries, RAG’s ability to align new outputs with historical coding decisions is crucial for capturing recurring themes such as fear, loss, and resilience. For example, entries describing the emotional toll of family separation are consistently labeled as Longing for family; emotional isolation, while reflections on the resilience of children are coded as Hope amidst despair; finding strength . This consistency enhances the interpretability and usability of the generated codes, making the framework a valuable tool for analyzing large collections of historical texts.

6.3. Balancing Abstraction and Specificity

This finding reflects a fundamental trade-of in LLM-based coding: while abstraction improves generalizability, excessive abstraction can obscure critical nuances. Prior work has noted that LLMs trained on diverse corpora tend to favor generalized patterns over domain-specific details [ 16, 14 ]. Our results suggest that ensemble approaches can mitigate this issue by combining diverse levels of abstraction, thereby producing more balanced and contextually grounded outputs. For example, the Mixtral 8x7B ensemble generates codes like Devastating bombing begins and Found purpose, devoted to homeland, which capture both the emotional depth and thematic specificity of the diaries.

6.4. Insights into Holocaust Diaries

The application of our framework to Holocaust children’s diaries provides valuable insights into the experiences of children during World War II. The frequent codes generated by the model, such as Impact of unexpected war news, Devastating bombing begins, and Found purpose, devoted to homeland, reflect the diversity of responses to the war, from shock and despair to resilience and hope. These findings contribute to a deeper understanding of the emotional and psychological impact of the Holocaust on children, shedding light on their capacity for adaptation and survival in the face of unimaginable hardship.

Moreover, the framework’s ability to capture individual variations in the diaries—such as Janine Phillips’ focus on the immediate shock of war versus other children’s reflections on family

Despite its successes, our framework has several limitations that need consideration. First, the reliance on pre-trained LLMs introduces potential biases inherent in the training data, which may afect the quality and fairness of the generated codes. While ensemble methods and RAG mitigate some of these biases, further work is needed to develop bias detection.

Second, the evaluation of automated coding frameworks remains challenging, as no single metric can fully capture the nuances of human judgment. While our composite score combines multiple dimensions of coding quality, it may not fully reflect the interpretative depth required for sensitive historical texts. Future work should explore more sophisticated evaluation frameworks, incorporating human preference modeling and interactive evaluation setups.

Finally, the generalizability of our framework to other languages and cultural contexts remains untested. The Holocaust diaries analyzed in this study are written in English, and the framework’s performance on multilingual or non-Western texts may difer. Extending the framework to other languages and cultural settings could reveal additional challenges and opportunities for improvement.

7. Conclusion

Our study demonstrates the potential of ensemble-based LLMs with RAG for automating inductive coding tasks in sensitive and historically significant contexts. The framework’s ability to capture the emotional and thematic complexity of Holocaust children’s diaries, while maintaining consistency and scalability, highlights its value for qualitative research. By addressing the limitations and exploring future directions outlined above, we can further enhance the interpretability, fairness, and generalizability of automated coding, opening new possibilities for research in history and social science.

A. Detailed fine-tuning results

These results (see Table 4) demonstrate the performance of various models when fine-tuned on the task of open coding using diferent prompts. BERTScore and ROUGE are reported.

B. Moderator prompt template

Model Llama3 Falcon Mistral Vicuna Gemma TinyLlama Llama3 Falcon Mistral Vicuna Gemma TinyLlama Llama3 Falcon Mistral Vicuna Gemma TinyLlama Llama3 Falcon Mistral Vicuna Gemma Tinyllama Llama3 Falcon Mistral Vicuna Gemma TinyLlama Llama3 Falcon Mistral Vicuna Gemma Tinyllama Llama3 Falcon Mistral Vicuna Gemma Tinyllama Llama3 Falcon Mistral Vicuna Gemma Tinyllama Listing 1: Moderator Prompt Template with Model Suggestions

You will be given a paragraph from the text, which is: {textdescription}. Definition of the code: A word or short phrase that symbolically assigns a summative, salient, essence-capturing, and/or evocative attribute for a portion of language-based or visual data. Here is the excerpt to code: {row['Paragraph']} Here are three coding suggestions from previous models: 1. {row['Llama3_Code']} 2. {row['Falcon_Code']} 3. {row['Mistral_Code']} Please suggest a code taking into account all these answers. Output should be the code with no longer than 5 words.

[1]

Levi , The Drowned and the Saved , Summit Books, 1986 .

[2]

Saldana , The Coding Manual for Qualitative Researchers , SAGE Publications , 2016 .

[3]

Boyd ,

Crawford , Critical questions for big data , Information, Communication & Society 15 ( 2013 ) 662 - 679 .

[4]

Matter ,

Schirmer ,

Grinberg ,

Pfefer , Close to human-level agreement: Tracing journeys of violent speech in incel posts with gpt-4-enhanced annotations , 2024 . arXiv: 2401 . 02001 .

[5]

Chen ,

Lotsos ,

Zhao ,

Wang ,

Hullman ,

Sherin ,

Wilensky ,

Horn , A computational method for measuring ”open codes” in qualitative analysis , 2024 . URL: https://arxiv.org/abs/2411. 12142. arXiv: 2411 . 12142 .

[6]

Ziems ,

Held ,

Shaikh ,

Chen ,

Zhang ,

Yang , Can large language models transform computational social science? , Computational Linguistics 50 ( 2024 ) 237 - 291 .

[7]

Hirsch , Family Frames: Photography, Narrative, and Postmemory , Harvard University Press, 1997 .

[8]

Braun ,

Clarke , Thematic analysis: A reflexive approach , International Journal of Qualitative Research 11 ( 2019 ) 301 - 310 .

[9]

Lewis ,

Perez ,

Piktus ,

Petroni ,

Karpukhin ,

Goyal ,

Küttler ,

Lewis , W. tau Yih, T. Rocktäschel,

Riedel ,

Kiela , Retrieval-augmented generation for knowledge-intensive nlp tasks , 2021 . URL: https://arxiv.org/abs/ 2005 .11401. arXiv: 2005 .11401.

[10]

Parfenova , et al., Automating qualitative analysis with llms , Proceedings of ACL 2024 ( 2024 ).

[11]

G. A.

Miller ,

Beckwith ,

Fellbaum ,

Gross ,

K. J.

Miller , Introduction to wordnet: An on-line lexical database , International journal of lexicography 3 ( 1990 ) 235 - 244 .

[12]

J. W.

Creswell , 30 Essential Skills for the Qualitative Researcher , SAGE Publications , 2016 .

[13]

Braun ,

Clarke , Thematic Analysis: A Practical Guide , SAGE Publications , 2021 .

[14]

Tornberg , Using large language models for automated qualitative coding in the social sciences , Nature Machine Intelligence 5 ( 2023 ) 576 - 586 .

[15]

Fischer ,

Biemann , Exploring large language models for qualitative data analysis , in: M. Hämäläinen , E.

Öhman , S.

Miyagawa , K.

Alnajjar , Y. Bizzoni (Eds.), Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities , Association for Computational Linguistics, Miami, USA, 2024 , pp. 423 - 437 . URL: https://aclanthology.org/ 2024 . nlp4dh- 1 .41/. doi: 10 .18653/v1/ 2024 .nlp4dh- 1 . 41 .

[16]

Bubeck ,

Chandak , et al., Sparks of artificial general intelligence: Early experiments with gpt-4 , arXiv preprint arXiv: 2303 .12712 ( 2023 ).

[17]

Touvron ,

Lavril , et al., Llama: Open and eficient foundation language models , in: Proceedings of the 2023 Annual Conference on Machine Learning (ICML) , 2023 , pp. 123 - 134 .

[18]

Sagi , L. Rokach, Ensemble learning: A survey, Wiley interdisciplinary reviews: data mining and knowledge discovery 8 (2018) e1249 .

[19]

Jiang , et al., Llm-blender: Ensembling large language models with pairwise ranking and generative fusion , Proceedings of ACL 2023 ( 2023 ).

[20]

Jiang ,

Ren ,

B. Y.

Lin , Llm-blender: Ensembling large language models with pairwise ranking and generative fusion , 2023 . URL: https://arxiv.org/abs/2306.02561. arXiv: 2306 . 02561 .

[21]

Cai ,

Jiang ,

Wang ,

Tang ,

Kim ,

Huang , A survey on mixture of experts, 2024 . URL: https://arxiv.org/abs/2407.06204. arXiv: 2407 . 06204 .

[22]

Schwartz , et al., Topic modeling holocaust survivor testimonies , Journal of Digital Humanities 8 ( 2019 ) 45 - 60 .

[23]

Eisenstein , et al., Sentiment analysis of wartime diaries , Computational Linguistics 47 ( 2021 ) 601 - 630 .

[24]

Aniol ,

Pietron ,

Duda , Ensemble approach for natural language question answering problem , in: 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW) , IEEE, 2019 , pp. 180 - 183 .

[25]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Chen , Lora: Low-rank adaptation of large language models , 2021 . URL: https://arxiv.org/abs/2106.09685. arXiv: 2106 . 09685 .

[26] C.-Y. Lin , ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics , Barcelona, Spain, 2004 , pp. 74 - 81 . URL: https://aclanthology.org/W04-1013.

[27]

Zhang ,

Kishore ,

Wu ,

K. Q.

Weinberger ,

Artzi , Bertscore: Evaluating text generation with BERT , CoRR abs/ 1904 .09675 ( 2019 ). URL: http://arxiv.org/abs/ 1904 .09675. arXiv: 1904 .09675.

[28]

Steck ,

Ekanadham ,

Kallus , Is cosine-similarity of embeddings really about similarity? , in: Companion Proceedings of the ACM Web Conference 2024 , WWW '24, ACM , 2024 , p. 887 - 890 . URL: http://dx.doi.org/10.1145/3589335.3651526. doi: 10 .1145/3589335.3651526.

[29]

Banerjee ,

Lavie , METEOR: An automatic metric for MT evaluation with improved correlation with human judgments , in: J. Goldstein , A.

Lavie , C.-Y.

Lin , C. Voss (Eds.), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics , Ann Arbor, Michigan, 2005 , pp. 65 - 72 . URL: https://aclanthology.org/W05-0909/.

[30] M. L. Menéndez , J.

Pardo , L.

Pardo , M.

Pardo , The jensen-shannon divergence , Journal of the Franklin Institute 334 ( 1997 ) 307 - 318 .

[31]

Pontiki ,

Galanis ,

Pavlopoulos ,

Papageorgiou , I. Androutsopoulos, S. Manandhar, SemEval -2014 task 4: Aspect based sentiment analysis , in: P. Nakov, T. Zesch (Eds.), Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014 ), Association for Computational Linguistics , Dublin, Ireland, 2014 , pp. 27 - 35 . URL: https://aclanthology.org/S14-2004. doi: 10 .3115/v1/ S14 -2004.

[32]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , et al., Llama: Open and eficient foundation language models , arXiv preprint arXiv:2302.13971 ( 2023 ).

[33]

Pineda ,

Milliere ,

Vlachos ,

Ott ,

Yates ,

Glaese , et al., The falcon series of language models , arXiv preprint arXiv:2306.01116 ( 2023 ).

[34]

M. A.

Team , Mistral: Eficient pretraining of transformer language models , 2023 . URL: https: //mistral.ai.

[35]

Li ,

Xie , et al., Vicuna: An open-source chatbot , FastChat: Open Assistant ( 2023 ). Available at https://vicuna.lmsys.org/.

[36]

G. A. R.

Team , Gemma: An instructable, open-source large language model , 2024 . URL: https: //gemma.ai.

[37]

Jiang , et al., Tinyllama: Distilling large language models for eficiency , arXiv preprint arXiv:2310.05637 ( 2023 ).

[38]

Holliday , Children in the Holocaust and World War II: Their Secret Diaries , Washington Square Press, 1995 .