<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>from Holocaust Diaries with Ensemble LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Angelina Parfenova</string-name>
          <email>angelina.parfenova@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Litvak (eds.): Proceedings of the Text2Story'25 Workshop</institution>
          ,
          <addr-line>Lucca</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inductive coding</institution>
          ,
          <addr-line>Holocaust diaries, Ensemble models, Retrieval-Augmented Generation, Qualitative analysis</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Lucerne University of Applied Sciences and Arts</institution>
          ,
          <addr-line>Rotkreuz</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Technical University of Munich</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents a novel application of ensemble-based large language models (LLMs) with RetrievalAugmented Generation (RAG) for automated inductive coding of Holocaust children's diaries. Our approach integrates multiple smaller LLMs, fine-tuned via Low-Rank Adaptation (LoRA), and employs a moderator-based mechanism to simulate collaborative human consensus. We evaluate our best model on a curated dataset of diaries, demonstrating significant improvements in coding consistency and specificity. Our results highlight the potential of ensemble-based LLMs with RAG for analyzing sensitive historical texts, ofering a scalable and eficient alternative to manual coding while preserving the nuanced emotional and thematic content of the diaries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Inductive coding is a qualitative analysis approach in which codes emerge directly from the data
rather than being predefined. A</p>
      <p>
        code represents a concise label that captures the core meaning of a text
segment. This approach is a part of thematic analysis, a method for identifying and structuring patterns
in qualitative data [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The process typically involves iteratively generating codes, clustering them into
broader categories, and refining themes to represent the data’s underlying structure. Inductive coding
is particularly useful for exploratory studies, such as historical text analysis, where themes emerge
organically. However, manual thematic analysis is time-consuming and subjective, posing scalability
challenges for large textual datasets.
      </p>
      <p>
        In this work, we propose a novel framework for automated inductive coding using ensemble-based
large language models (LLMs) with Retrieval-Augmented Generation (RAG) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Our approach uses
the strengths of multiple smaller LLMs (7B and 8B parameters) in an ensemble framework, combining
their outputs and feeding them into larger moderator LLM to generate high-quality codes that reflect
the thematic and emotional complexity of the texts. To ensure consistency and reduce redundancy, we
integrate RAG, which references previously assigned codes to maintain coherence across similar inputs.
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>
        This combination of ensemble modeling and RAG addresses key limitations of existing methods [
        <xref ref-type="bibr" rid="ref10 ref5">10, 5</xref>
        ],
ofering a scalable and eficient alternative to manual coding while preserving the nuanced content.
      </p>
      <p>We apply our framework to a curated dataset of Holocaust children’s diaries, demonstrating its
efectiveness in capturing recurring themes such as family separation, fear, and hope. Our results
show significant improvements in coding consistency, specificity, and alignment with human-coded
benchmarks, highlighting the potential of ensemble-based LLMs with RAG for analyzing sensitive
historical texts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        Qualitative data analysis (QDA) is one of the main methods in social science research, allowing
researchers to identify, categorize, and interpret patterns within textual data [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. Central to this
process is the concept of coding, where meaningful segments of text are assigned concise labels, or
codes, that capture their essence (see Figure 1). According to Saldana [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a code is ”a word or short
phrase that symbolically assigns a summative, salient, essence-capturing, and/or evocative attribute for
a portion of language-based or visual data.” In thematic analysis, one of the most widely used methods
in QDA, these codes are further grouped into broader categories to reveal hierarchical relationships
and underlying themes within the data [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Recent advances in natural language processing (NLP) have introduced the use of large language
models (LLMs) to automate qualitative coding tasks [
        <xref ref-type="bibr" rid="ref10 ref14 ref15">14, 10, 15</xref>
        ]. However, two critical challenges
remain unaddressed in this domain. First, traditional evaluation metrics such as BERTScore and ROUGE,
while efective for summarization tasks, are insuficient to assess the quality of qualitative codes [
        <xref ref-type="bibr" rid="ref10 ref5">10, 5</xref>
        ].
Recent work by Chen et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] introduced unsupervised metrics tailored for code evaluation, but these
approaches lack the ability to directly compare model outputs to human annotations. In this work, we
address this gap by proposing a supervised evaluation framework that aligns model-generated codes
with human-coded benchmarks.
      </p>
      <p>
        Second, while individual LLMs demonstrate remarkable performance, their outputs often vary due to
diferences in training data, architectures, and model parameters [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. This variability mirrors the
subjectivity inherent in human coding, where individual coders may interpret the same text diferently.
To address this challenge, ensemble methods, techniques that combine multiple models, were explored
to combine the strengths of diverse models and improve overall performance [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. For example, Jiang
et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] demonstrated the efectiveness of ensembling in complex natural language generation tasks,
while Cai et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] highlighted the potential of mixture-of-experts (MoE) frameworks for specialized
sub-tasks.
      </p>
      <p>This study builds upon the concept of ensemble methods but diverges from existing approaches by
adopting a moderator-based framework. Unlike fusion techniques that combine outputs probabilistically,
our approach incorporates a final decision-making model tasked with selecting the best candidate or
proposing a novel output. This design reflects the dynamics of human collaboration with a leader, where
consensus is driven by a final arbiter, rather than by averaging or blending opinions. By employing this
moderator model, we aim to mimic the decision-making process and demonstrate its efectiveness in
automating inductive coding tasks, particularly for sensitive historical texts such as personal diaries.</p>
      <p>
        In the context of Holocaust studies, NLP has been increasingly applied to analyze historical texts,
including survivor testimony, diaries, and archival documents. For instance, Schwartz et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] used
topic modeling to identify recurring themes in Holocaust survivor testimonies, while Eisenstein et al.
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] employed sentiment analysis to explore emotional patterns in wartime diaries. However, these
studies often rely on traditional NLP techniques, which struggle to capture the nuanced emotional and
thematic content of Holocaust texts.
      </p>
      <p>
        Ensemble learning is a well-established strategy for improving model performance by combining the
strengths of multiple models, often referred to as ”weaker models” [
        <xref ref-type="bibr" rid="ref18 ref24">18, 24</xref>
        ]. Common approaches include
weighting individual models based on their performance or aggregating diverse outputs to produce a
unified result. For example, the Mix-of-Experts (MoE) framework [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] employs specialized sub-models
to make predictions and merges their outputs for improved accuracy. Similarly, LLM-Blender [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]
demonstrates the potential of ensembling by combining ranked outputs from multiple models to achieve
superior performance in complex natural language generation tasks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our pipeline consists of three key stages: (1) input processing by multiple smaller LLMs, (2) moderation
and refinement of outputs by larger LLMs, and (3) retrieval-augmented generation to ensure consistency.
The steps are described below.</p>
      <sec id="sec-3-1">
        <title>3.1. Ensemble Model Framework</title>
        <p>
          Our ensemble framework combines three smaller LLMs (7B and 8B parameters) to process each input
diary entry independently. These models were fine-tuned using Low-Rank Adaptation (LoRA) [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] on
a diverse corpus of social science data (see Table 1, enabling them to capture domain-specific patterns
while maintaining computational eficiency. LoRA fine-tuning allows for eficient adaptation of
pretrained models to specialized tasks, such as inductive coding, without requiring extensive retraining or
large-scale datasets.
        </p>
        <p>
          The outputs from these models are evaluated by a moderator model, which refines and consolidates
the results (see Appendix B). The moderator is tasked with assessing the quality and relevance of the
generated codes, ensuring that the final output reflects a consensus among the ensemble. This approach
reduces variability and improves the quality of the generated codes, addressing the inherent subjectivity
of individual LLMs [
          <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
          ].
RAG is integrated into our pipeline to ensure consistency and reduce redundancy in the coding process.
RAG operates by referencing a database of previously assigned codes, which are retrieved based on
semantic similarity to the current input. For each input, RAG computes the cosine similarity between
the input embedding () and the embeddings of previously assigned codes (  ). If the similarity
exceeds  , the retrieved code is reused; otherwise, a new code is generated. The integration of RAG
also addresses the challenge of code redundancy, a common issue in automated qualitative coding.
By aligning new outputs with historical coding decisions, RAG ensures that similar inputs receive
consistent labels.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Evaluation Metrics</title>
        <p>
          We evaluate our approach using a combination of quantitative metrics (e.g., composite score, ROUGE
[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], BERTScore [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]) and qualitative analysis. The composite score, which incorporates semantic,
lexical, and structural alignment, serves as the primary metric for assessing coding quality.
Composite Score To provide a comprehensive evaluation of coding quality, we introduce a Composite
Score ( ) that combines multiple normalized metrics:
 =
        </p>
        <p>
          41 [ ̃ +  +̃ (1 − ) ̃+ (1 −  )̃ ] ,
where:  ̃ : Normalized cosine similarity between code embeddings [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], measuring semantic alignment
with human-coded references;  ̃ : Scaled METEOR score [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], which balances precision and recall
while accounting for synonymy and stemming;  ̃: Normalized code length percentile, where shorter
codes are preferred to avoid verbosity;  :̃ Normalized Jensen-Shannon divergence [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], which quantifies
the distributional similarity between generated and reference codes. Each metric is normalized using
min-max scaling:
⋅̃ =
        </p>
        <p>⋅ − min ,
max − min
(1)
(2)
ensuring that all components contribute equally to the Composite Score. The terms (1 − ) ̃ and
(1 −  )̃ invert the code length and divergence metrics, respectively, so that higher values indicate better
performance across all dimensions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>
        Our experiments began with the training and evaluation of ensemble models using a dataset of 1,000
code-quote pairs compiled from social science research studies and the SemEval-2014 Task 4 dataset
[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] (see Table 1). The dataset included 600 examples from social science studies and 400 examples from
reviews, each annotated by 3–5 coders to establish mutually agreed golden standard codes. The dataset
was split into training (900 examples) and test (100 examples) sets, with hyperparameters selected based
on training performance.
      </p>
      <p>
        Model Selection and Fine-Tuning We evaluated several open-source LLMs, including Llama3
[
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], Falcon [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], Mistral [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], Vicuna [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], Gemma [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ], and TinyLlama [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]. Each model was
finetuned using Low-Rank Adaptation (LoRA) [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] on the training dataset, enabling eficient adaptation
to the inductive coding task. The fine-tuned models generated outputs   for each input  , which
were evaluated using BERTScore and ROUGE. The top three performing models—Llama3, Falcon, and
Mistral—were selected for the ensemble framework (see Appendix A).
      </p>
      <p>Overview of dataset characteristics used for LoRA training. (A) Data sources and descriptions, including 600
quotes from social science studies and 400 quotes from SemEval 2014 Task 4. (B) Dataset statistics and splits,
with 900 examples for training and 100 for testing. This dataset was annotated by multiple coders to create a
golden standard and served as the foundation for fine-tuning the base 7B and 8B models.</p>
      <p>N Quotes Description</p>
      <p>Social Science Studies Data: 600 quotes
Study about interaction with self-tracking devices (interviews)
Study about life transitions and mobility (interviews)
Study about interaction with voice assistants (interviews)
Study about museums and cultural experiences (interviews)
Study on doctors’ experiences with pregnant women (interviews)
Study on universal and national values (interviews)
Study on procrastination and budget planning (interviews)
Study on technology interactions and user feedback (reviews)
Study about social expectations (interviews)</p>
      <p>SemEval 2014; Task 4: 400 quotes
Restaurant reviews
Laptop reviews</p>
      <p>Statistic
Total Quotes
Social Science Data
SemEval Data
Num of Data Sources
Unique Codes
Avg. Quote Length
Avg. Code Length</p>
      <p>Overall
and previously assigned code embeddings (  ). If the similarity exceeds a threshold  , the existing
code is reused; otherwise, a new code is generated.
codes by aligning new outputs with previously assigned codes. As demonstrated in Table 2,
RAGenhanced ensembles produce more concise outputs, achieving an average code length reduction from
6.83 to 4.00 tokens—a 41.5% improvement over non-RAG ensembles.</p>
      <p>Further analysis highlights the impact of RAG on code diversity. While the human gold standard
comprises 47 unique codes with an average length of 2.79 tokens, non-RAG models exhibit excessive code
proliferation, often generating unique codes for each input. In contrast, RAG integration significantly
reduces this redundancy, with Llama3.3 70B Ensemble+RAG and Mixtral 8x7B Ensemble+RAG producing
53 and 71 unique codes, respectively. This brings the models closer to human-like coding eficiency, as
illustrated in Table 2.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Holocaust Dataset Analysis</title>
      <p>
        To evaluate the generalizability of our framework, we applied the best-performing ensemble model
(Mixtral 8x7B with RAG) to a curated dataset of 224 Holocaust children’s diaries. The dataset was
constructed from the book Children in the Holocaust and World War II: Their Secret Diaries by Laurel
Holliday [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]. We selected diary entries that were explicitly labeled with both day and year, ensuring
temporal consistency and facilitating the analysis of chronological patterns in the children’s experiences.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Results</title>
        <p>Temporal Distribution of Diary Entries The dataset spans from 1939 to 1945, capturing key
moments in World War II from the perspective of children. Figure 4 shows the distribution of diary
entries over time, revealing a notable increase in the density of entries around major historical events.
For example, the invasion of Poland in 1939 and the intensification of bombings and deportations in
later years are reflected in the children’s writings. This temporal distribution demonstrates how the
evolving wartime environment influenced the frequency and content of their diary entries.
Thematic Analysis of Codes Our framework generated a diverse array of codes that reflect the
children’s experiences and emotional states. Early entries, such as those from Janine Phillips in August
and September 1939, focus on themes like Impact of unexpected war news and Family Reunion; Prepared
for War. As the war progressed, the model identified more intense and emotionally charged themes, such
as Devastating bombing begins, War-time scarcity; community support, and Fear of war’s soul-crushing
impact.</p>
        <p>Recurring codes like Loneliness, despair, longing for relief and Severe hunger, bread scarcity illustrate
the isolation and deprivation on the children. At the same time, the model captured moments of
resilience, such as Found purpose, devoted to homeland and Dreaming of peace amidst chaos, highlighting
the children’s capacity for hope and adaptation even in dire circumstances. These findings demonstrate
the model’s ability to capture both the emotional depth and thematic complexity of the diaries.
Individual Variations The diaries reveal significant individual variations in how children responded
to their experiences. For instance, Janine Phillips’ entries focus on the immediate shock and logistical
challenges of war, while others, such as those from anonymous authors, emphasize personal reflections
on family, loss, and survival (see Figure 5). For example, one entry describing the emotional toll of being
separated from family members was labeled as Longing for family; emotional isolation, while another
reflecting on the resilience of children in the face of adversity was coded as Hope amidst despair; finding
strength. These examples highlight the model’s sensitivity to the nuanced emotional and thematic
content of the diaries.</p>
        <p>Table 3 presents the most frequently occurring codes generated by the Mixtral 8x7B Ensemble RAG
model. These codes reflect the dominant themes and emotional states documented by children during
the Holocaust. The frequency of each code provides insight into the shared experiences and collective
trauma of the children, as well as their individual responses to the evolving wartime environment.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Our study demonstrates the efectiveness of ensemble-based LLMs with Retrieval-Augmented
Generation (RAG) for automating inductive coding tasks. The results highlight the framework’s ability
to capture the emotional and thematic complexity of sensitive historical texts while maintaining
consistency and reducing redundancy. Below, we discuss the key implications of our findings, address
limitations, and outline directions for future research.</p>
      <sec id="sec-6-1">
        <title>6.1. Ensembles Improve Coding Consistency</title>
        <p>A major finding of our study is that ensemble models consistently outperform individual models in
inductive coding tasks, as shown in Table 2. This suggests that aggregating multiple model outputs
helps reduce inconsistencies, reflecting the consensus-building process employed by human coders in
thematic analysis.</p>
        <p>
          The increased consistency observed in ensemble-generated codes aligns with findings from prior
research on LLM evaluation, which suggest that individual models often introduce unwanted variability
in their outputs due to diferences in training data and architectural biases [
          <xref ref-type="bibr" rid="ref16 ref19">16, 19</xref>
          ]. In contrast, ensemble
methods mitigate this variability by integrating diverse inputs, thereby improving robustness. Our
results indicate that this efect holds even for smaller models, making ensemble approaches a practical
solution for qualitative coding tasks.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. RAG Enhances Code Stability</title>
        <p>The integration of RAG significantly improves code stability, as demonstrated by higher composite and
ROUGE scores in RAG-enhanced ensembles (Table 2). By referencing previously assigned codes, RAG
reduces redundancy and promotes consistency across similar inputs. This is particularly evident in the
reduction of unique code counts (e.g., 53 for Llama3.3 70B+RAG vs. 100 for non-RAG models) and code
length (41.5% reduction), bringing model outputs closer to human-like eficiency.</p>
        <p>In the context of Holocaust diaries, RAG’s ability to align new outputs with historical coding decisions
is crucial for capturing recurring themes such as fear, loss, and resilience. For example, entries describing
the emotional toll of family separation are consistently labeled as Longing for family; emotional isolation,
while reflections on the resilience of children are coded as Hope amidst despair; finding strength . This
consistency enhances the interpretability and usability of the generated codes, making the framework a
valuable tool for analyzing large collections of historical texts.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Balancing Abstraction and Specificity</title>
        <p>
          This finding reflects a fundamental trade-of in LLM-based coding: while abstraction improves
generalizability, excessive abstraction can obscure critical nuances. Prior work has noted that LLMs trained on
diverse corpora tend to favor generalized patterns over domain-specific details [
          <xref ref-type="bibr" rid="ref14 ref16">16, 14</xref>
          ]. Our results
suggest that ensemble approaches can mitigate this issue by combining diverse levels of abstraction,
thereby producing more balanced and contextually grounded outputs. For example, the Mixtral 8x7B
ensemble generates codes like Devastating bombing begins and Found purpose, devoted to homeland,
which capture both the emotional depth and thematic specificity of the diaries.
        </p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Insights into Holocaust Diaries</title>
        <p>The application of our framework to Holocaust children’s diaries provides valuable insights into the
experiences of children during World War II. The frequent codes generated by the model, such as Impact
of unexpected war news, Devastating bombing begins, and Found purpose, devoted to homeland, reflect
the diversity of responses to the war, from shock and despair to resilience and hope. These findings
contribute to a deeper understanding of the emotional and psychological impact of the Holocaust
on children, shedding light on their capacity for adaptation and survival in the face of unimaginable
hardship.</p>
        <p>Moreover, the framework’s ability to capture individual variations in the diaries—such as Janine
Phillips’ focus on the immediate shock of war versus other children’s reflections on family</p>
        <p>Despite its successes, our framework has several limitations that need consideration. First, the
reliance on pre-trained LLMs introduces potential biases inherent in the training data, which may afect
the quality and fairness of the generated codes. While ensemble methods and RAG mitigate some of
these biases, further work is needed to develop bias detection.</p>
        <p>Second, the evaluation of automated coding frameworks remains challenging, as no single metric
can fully capture the nuances of human judgment. While our composite score combines multiple
dimensions of coding quality, it may not fully reflect the interpretative depth required for sensitive
historical texts. Future work should explore more sophisticated evaluation frameworks, incorporating
human preference modeling and interactive evaluation setups.</p>
        <p>Finally, the generalizability of our framework to other languages and cultural contexts remains
untested. The Holocaust diaries analyzed in this study are written in English, and the framework’s
performance on multilingual or non-Western texts may difer. Extending the framework to other
languages and cultural settings could reveal additional challenges and opportunities for improvement.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Our study demonstrates the potential of ensemble-based LLMs with RAG for automating inductive
coding tasks in sensitive and historically significant contexts. The framework’s ability to capture the
emotional and thematic complexity of Holocaust children’s diaries, while maintaining consistency and
scalability, highlights its value for qualitative research. By addressing the limitations and exploring future
directions outlined above, we can further enhance the interpretability, fairness, and generalizability of
automated coding, opening new possibilities for research in history and social science.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Detailed fine-tuning results</title>
      <p>These results (see Table 4) demonstrate the performance of various models when fine-tuned on the task
of open coding using diferent prompts. BERTScore and ROUGE are reported.</p>
    </sec>
    <sec id="sec-9">
      <title>B. Moderator prompt template</title>
      <p>Model
Llama3
Falcon
Mistral
Vicuna
Gemma
TinyLlama
Llama3
Falcon
Mistral
Vicuna
Gemma
TinyLlama
Llama3
Falcon
Mistral
Vicuna
Gemma
TinyLlama
Llama3
Falcon
Mistral
Vicuna
Gemma
Tinyllama
Llama3
Falcon
Mistral
Vicuna
Gemma
TinyLlama
Llama3
Falcon
Mistral
Vicuna
Gemma
Tinyllama
Llama3
Falcon
Mistral
Vicuna
Gemma
Tinyllama
Llama3
Falcon
Mistral
Vicuna
Gemma
Tinyllama
Listing 1: Moderator Prompt Template with Model Suggestions</p>
      <sec id="sec-9-1">
        <title>You will be given a paragraph from the text, which is: {textdescription}.</title>
      </sec>
      <sec id="sec-9-2">
        <title>Definition of the code: A word or short phrase that symbolically assigns a summative, salient, essence-capturing, and/or evocative attribute for a portion of language-based or visual data.</title>
      </sec>
      <sec id="sec-9-3">
        <title>Here is the excerpt to code: {row['Paragraph']}</title>
      </sec>
      <sec id="sec-9-4">
        <title>Here are three coding suggestions from previous models:</title>
      </sec>
      <sec id="sec-9-5">
        <title>1. {row['Llama3_Code']}</title>
      </sec>
      <sec id="sec-9-6">
        <title>2. {row['Falcon_Code']}</title>
      </sec>
      <sec id="sec-9-7">
        <title>3. {row['Mistral_Code']}</title>
      </sec>
      <sec id="sec-9-8">
        <title>Please suggest a code taking into account all these answers.</title>
      </sec>
      <sec id="sec-9-9">
        <title>Output should be the code with no longer than 5 words.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Levi</surname>
          </string-name>
          ,
          <article-title>The Drowned and the Saved</article-title>
          , Summit Books,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Saldana</surname>
          </string-name>
          ,
          <article-title>The Coding Manual for Qualitative Researchers</article-title>
          ,
          <source>SAGE Publications</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Boyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Crawford</surname>
          </string-name>
          ,
          <article-title>Critical questions for big data</article-title>
          ,
          <source>Information, Communication &amp; Society</source>
          <volume>15</volume>
          (
          <year>2013</year>
          )
          <fpage>662</fpage>
          -
          <lpage>679</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Matter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schirmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Grinberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pfefer</surname>
          </string-name>
          ,
          <article-title>Close to human-level agreement: Tracing journeys of violent speech in incel posts with gpt-4-enhanced annotations</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <fpage>2401</fpage>
          .
          <year>02001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lotsos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hullman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sherin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Wilensky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Horn</surname>
          </string-name>
          ,
          <article-title>A computational method for measuring ”open codes” in qualitative analysis</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2411. 12142. arXiv:
          <volume>2411</volume>
          .
          <fpage>12142</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ziems</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Held</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Shaikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Can large language models transform computational social science?</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>50</volume>
          (
          <year>2024</year>
          )
          <fpage>237</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hirsch</surname>
          </string-name>
          , Family Frames: Photography, Narrative, and
          <string-name>
            <surname>Postmemory</surname>
          </string-name>
          , Harvard University Press,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Braun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <article-title>Thematic analysis: A reflexive approach</article-title>
          ,
          <source>International Journal of Qualitative Research</source>
          <volume>11</volume>
          (
          <year>2019</year>
          )
          <fpage>301</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W. tau Yih, T. Rocktäschel,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2005</year>
          .11401. arXiv:
          <year>2005</year>
          .11401.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Parfenova</surname>
          </string-name>
          , et al.,
          <article-title>Automating qualitative analysis with llms</article-title>
          ,
          <source>Proceedings of ACL</source>
          <year>2024</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Beckwith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Introduction to wordnet: An on-line lexical database</article-title>
          ,
          <source>International journal of lexicography 3</source>
          (
          <year>1990</year>
          )
          <fpage>235</fpage>
          -
          <lpage>244</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Creswell</surname>
          </string-name>
          ,
          <article-title>30 Essential Skills for the Qualitative Researcher</article-title>
          ,
          <source>SAGE Publications</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Braun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <article-title>Thematic Analysis: A Practical Guide</article-title>
          ,
          <source>SAGE Publications</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Tornberg</surname>
          </string-name>
          ,
          <article-title>Using large language models for automated qualitative coding in the social sciences</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>5</volume>
          (
          <year>2023</year>
          )
          <fpage>576</fpage>
          -
          <lpage>586</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <article-title>Exploring large language models for qualitative data analysis</article-title>
          , in: M.
          <string-name>
            <surname>Hämäläinen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Öhman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Miyagawa</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Alnajjar</surname>
          </string-name>
          , Y. Bizzoni (Eds.),
          <source>Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities</source>
          , Association for Computational Linguistics, Miami, USA,
          <year>2024</year>
          , pp.
          <fpage>423</fpage>
          -
          <lpage>437</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          . nlp4dh-
          <fpage>1</fpage>
          .41/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .nlp4dh-
          <fpage>1</fpage>
          .
          <fpage>41</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandak</surname>
          </string-name>
          , et al.,
          <source>Sparks of artificial general intelligence: Early experiments with gpt-4</source>
          , arXiv preprint arXiv:
          <volume>2303</volume>
          .12712 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>in: Proceedings of the 2023 Annual Conference on Machine Learning (ICML)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>O.</given-names>
            <surname>Sagi</surname>
          </string-name>
          , L. Rokach,
          <article-title>Ensemble learning: A survey, Wiley interdisciplinary reviews: data mining and knowledge discovery 8 (2018) e1249</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , et al.,
          <article-title>Llm-blender: Ensembling large language models with pairwise ranking and generative fusion</article-title>
          ,
          <source>Proceedings of ACL</source>
          <year>2023</year>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Llm-blender: Ensembling large language models with pairwise ranking and generative fusion</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2306.02561. arXiv:
          <volume>2306</volume>
          .
          <fpage>02561</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>W.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>A survey on mixture of experts, 2024</article-title>
          . URL: https://arxiv.org/abs/2407.06204. arXiv:
          <volume>2407</volume>
          .
          <fpage>06204</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , et al.,
          <article-title>Topic modeling holocaust survivor testimonies</article-title>
          ,
          <source>Journal of Digital Humanities</source>
          <volume>8</volume>
          (
          <year>2019</year>
          )
          <fpage>45</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          , et al.,
          <article-title>Sentiment analysis of wartime diaries</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>47</volume>
          (
          <year>2021</year>
          )
          <fpage>601</fpage>
          -
          <lpage>630</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aniol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pietron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Duda</surname>
          </string-name>
          ,
          <article-title>Ensemble approach for natural language question answering problem</article-title>
          ,
          <source>in: 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>183</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.09685. arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with BERT</article-title>
          , CoRR abs/
          <year>1904</year>
          .09675 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1904</year>
          .09675. arXiv:
          <year>1904</year>
          .09675.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>H.</given-names>
            <surname>Steck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ekanadham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kallus</surname>
          </string-name>
          ,
          <article-title>Is cosine-similarity of embeddings really about similarity?</article-title>
          ,
          <source>in: Companion Proceedings of the ACM Web Conference</source>
          <year>2024</year>
          , WWW '24,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2024</year>
          , p.
          <fpage>887</fpage>
          -
          <lpage>890</lpage>
          . URL: http://dx.doi.org/10.1145/3589335.3651526. doi:
          <volume>10</volume>
          .1145/3589335.3651526.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>METEOR:</surname>
          </string-name>
          <article-title>An automatic metric for MT evaluation with improved correlation with human judgments</article-title>
          , in: J.
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>C.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Voss (Eds.),
          <source>Proceedings of the ACL Workshop</source>
          on Intrinsic and
          <article-title>Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics</article-title>
          , Ann Arbor, Michigan,
          <year>2005</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          . URL: https://aclanthology.org/W05-0909/.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>M. L. Menéndez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pardo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Pardo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Pardo</surname>
          </string-name>
          ,
          <article-title>The jensen-shannon divergence</article-title>
          ,
          <source>Journal of the Franklin Institute</source>
          <volume>334</volume>
          (
          <year>1997</year>
          )
          <fpage>307</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pontiki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Papageorgiou</surname>
          </string-name>
          , I. Androutsopoulos, S. Manandhar, SemEval
          <article-title>-2014 task 4: Aspect based sentiment analysis</article-title>
          , in: P. Nakov, T. Zesch (Eds.),
          <source>Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2014</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Dublin, Ireland,
          <year>2014</year>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>35</lpage>
          . URL: https://aclanthology.org/S14-2004. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>S14</fpage>
          -2004.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Milliere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Glaese</surname>
          </string-name>
          , et al.,
          <source>The falcon series of language models</source>
          ,
          <source>arXiv preprint arXiv:2306.01116</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <source>Mistral: Eficient pretraining of transformer language models</source>
          ,
          <year>2023</year>
          . URL: https: //mistral.ai.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xie</surname>
          </string-name>
          , et al.,
          <article-title>Vicuna: An open-source chatbot</article-title>
          , FastChat: Open Assistant (
          <year>2023</year>
          ). Available at https://vicuna.lmsys.org/.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>G. A. R.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <surname>Gemma:</surname>
          </string-name>
          <article-title>An instructable, open-source large language model</article-title>
          ,
          <year>2024</year>
          . URL: https: //gemma.ai.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , et al.,
          <article-title>Tinyllama: Distilling large language models for eficiency</article-title>
          ,
          <source>arXiv preprint arXiv:2310.05637</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>L.</given-names>
            <surname>Holliday</surname>
          </string-name>
          ,
          <article-title>Children in the Holocaust and World War II: Their Secret Diaries</article-title>
          , Washington Square Press,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>