=Paper=
{{Paper
|id=Vol-3834/paper74
|storemode=property
|title=Direct and Indirect Annotation with Generative AI: A Case Study into Finding Animals and Plants in Historical Text
|pdfUrl=https://ceur-ws.org/Vol-3834/paper74.pdf
|volume=Vol-3834
|authors=Arjan van Dalfsen,Folgert Karsdorp,Ayoub Bagheri,Dieuwertje Mentink,Thirza van Engelen,Els Stronks
|dblpUrl=https://dblp.org/rec/conf/chr/DalfsenKBMES24
}}
==Direct and Indirect Annotation with Generative AI: A Case Study into Finding Animals and Plants in Historical Text==
<pdf width="1500px">https://ceur-ws.org/Vol-3834/paper74.pdf</pdf>
<pre>
                                Direct and Indirect Annotation with Generative AI: A
                                Case Study into Finding Animals and Plants in
                                Historical Text
                                Arjan van Dalfsen1,∗ , Folgert Karsdorp2 , Ayoub Bagheri3 , Dieuwertje Mentink1 ,
                                Thirza van Engelen1 and Els Stronks1
                                1
                                  Department of Language, Literature and Communication, Utrecht University, Trans 10, Utrecht, 3512 JK, The
                                Netherlands
                                2
                                  KNAW Meertens Instituut, Oudezijds Achterburgwal 185, 1012 DK Amsterdam, The Netherlands
                                3
                                  Department of Methods and Statistics, Utrecht University, Padualaan 14, 3584 CH, Utrecht, The Netherlands


                                           Abstract
                                           This study explores the use of generative AI (GenAI) for annotation in the humanities, comparing direct
                                           and indirect annotation approaches with human annotations. Direct annotation involves using GenAI
                                           to annotate the entire corpus, while indirect annotation uses GenAI to create training data for a special-
                                           ized model. The research investigates zero-shot and few-shot methods for direct annotation, alongside
                                           an indirect approach incorporating active learning, few-shotting, and k-NN example retrieval. The task
                                           focuses on identifying words (also referred to as entities) related to plants and animals in Early Modern
                                           Dutch texts. Results show that indirect annotation outperforms zero-shot direct annotation in mimick-
                                           ing human annotations. However, with just a few examples, direct annotation catches up, achieving
                                           similar performance to indirect annotation. Analysis of confusion matrices reveals that GenAI annota-
                                           tors make similar types of mistakes, such as confusing parts and products or failing to identify entities,
                                           which are broader than those made by humans. Manual error analysis indicates that each annotation
                                           method (human, direct, and indirect) has some unique errors. Given the limited scale of this study, it is
                                           worthwhile to further explore the relative affordances of direct and indirect GenAI annotation methods.

                                           Keywords
                                           large language models, natural language processing, historical text, token classification, environmental
                                           humanities


                                1. Introduction
                                The introduction of advanced generative AI (GenAI) models has sparked interest among hu-
                                manities scholars in leveraging these tools to extract structured information from texts [3, 23,
                                2, 12, 25, 18, 19, 4, 5]. So far, the use of GenAI in the humanities has primarily involved “direct


                                CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark
                                ∗
                                 Corresponding author.
                                £ j.a.vandalfsen@uu.nl (A. v. Dalfsen); folgert.karsdorp@meertens.knaw.nl (F. Karsdorp); a.bagheri@uu.nl
                                (A. Bagheri); d.l.mentink@students.uu.nl (D. Mentink); t.w.e.vanengelen@students.uu.nl (T. v. Engelen);
                                e.stronks@uu.nl (E. Stronks)
                                ç https://www.karsdorp.io/ (F. Karsdorp); https://ayoubbagheri.nl/ (A. Bagheri)
                                ȉ 0000-0002-4209-4063 (A. v. Dalfsen); 0000-0002-5958-0551 (F. Karsdorp); 0000-0001-6366-2173 (A. Bagheri);
                                0000-0001-9741-7264 (E. Stronks)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                         1053
CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
annotation”, where GenAI analyzes a corpus without further interference. This approach has
shown promise, potentially “supercharging the humanities” [11].
   Researchers in Natural Language Processing (NLP) have proposed an alternative “indirect
annotation” framework. This two-step process involves GenAI generating training data, which
is then used to train a specialized model. This approach offers potential cost and performance
advantages over direct annotation [31, 32]. However, indirect annotation’s effectiveness has
primarily been demonstrated on languages well-represented in GenAI training data, raising
questions about its applicability to texts from smaller languages or historical variants often
encountered in humanities research.
   As a first exploration into its usability in the humanities, our study tests GenAI as an indirect
annotator for nature-entities in historical Dutch texts. We employ the LLMaAA (Large Lan-
guage Models as Active Annotators) framework [32], which combines few-shotting, k-Nearest
Neighbors (k-NN) example retrieval, and active learning. Our research compares the perfor-
mance of indirect annotation, i.e. an LLMaAA-derived model, against both human annotations
and direct annotation with GenAI. We find that our proposed method of indirect GenAI annota-
tion performs better than fully-unsupervised direct GenAI annotation. However, we also find
that providing direct annotation with demonstrations (i.e., examples of annotations) results in
similar performance. Moreover, our study reveals that humans, direct GenAI annotators, and
indirect GenAI annotators each have unique weaknesses and strengths.
   This study is structured as follows: We first examine the broader context of using GenAI for
annotation in humanities research. We then provide an overview of current research on direct
and indirect annotators in NLP. Subsequently, we introduce our specific use-case: identifying
animals and plants in historical texts. Finally, we detail our methodology for comparing the
performances of human annotation, direct GenAI annotation, and indirect GenAI annotation.


2. Related Work
2.1. GenAI annotations in humanities
In the humanities, research on GenAI annotation has primarily focused on direct annotation
experiments. Studies have compared GenAI methods with traditional approaches and human
annotators across various tasks, including sentiment analysis [2, 25, 5], topic detection [18],
and text classification [19]. Findings generally suggest that while GenAI often outperforms
dictionary-based methods, it typically falls short of specialized models. However, Karjus [12]
reports human-level annotations by GenAI across diverse tasks and languages, proposing a
machine-assisted mixed methods approach. These studies underscore the potential of GenAI
in humanities research, while also highlighting the need to explore both direct and indirect
annotation approaches to fully leverage its capabilities.

2.2. GenAI as direct annotators in NLP
Direct GenAI annotation involves prompting GenAI to annotate a dataset for immediate use.
Studies assessing this approach have found that while GenAI generally lags behind state-of-the-
art models [9, 16, 24, 33], it often equals or outperforms crowd-workers [30, 33, 10]. Challenges


                                               1054
in direct GenAI annotation include difÏculties with long-tail target types, irrelevant context,
and specific tasks like sequence tagging [16, 24]. These limitations have led to the exploration
of indirect annotation methods, which aim to address these shortcomings by integrating GenAI
in a more targeted manner.

2.3. GenAI as indirect annotator in NLP
In what we coin the indirect GenAI annotation framework, GenAI is not employed to perform
the entire annotation task on a given dataset. Instead, it is used to annotate a specific subset of
the dataset. This annotated subset is then subjected to further fine tuning by another model,
such as a BERT model.
   Wang et al. [31] found models trained on GenAI-annotated data equal to human-annotated
models and outperforming direct use of GenAI. Ding et al. [7] largely echo this but also high-
light a practical problem when it comes to textual analysis: GenAI is good at finding entities,
but oftentimes struggles with defining the boundaries of these entities. Li et al. [20] pro-
pose a CoAnnotating framework, in which GenAI output-uncertainty is measured and annota-
tions with the highest uncertainty (i.e., a lack of result robustness when confronted with small
prompt perturbations) are sent to a human annotator. While they report promising results,
there is the disadvantage of higher costs. With Large Language Models as Active Annotators
(LLMaAA) by Zhang et al. [32], the idea is to use active annotation to make the downstream
specific model better. It includes:

    • Few-shotting: putting exemplary annotations in the prompt for the GenAI (this helps
      GenAI to annotate [21]);
    • k-NN example retrieval: sequence embeddings of the text to annotate and the examples
      are used to select the examples closest to the new text for few-shotting;
    • Training cycles: doing step-by-step training, where first a specific model is trained, new
      data is annotated by the GenAI, and the specific model is trained again;
    • Active learning: selecting examples for indirect annotation for which the current model
      struggle;
    • Automatic reweighting: assigning learnable weights to the annotated training sam-
      ples [27] (this makes it possible to reduce the impact of noisy labeling by GenAI).

The authors test their method for NER and Information Extraction (modern Chinese and mod-
ern English) and find that the resulting model strongly outperforms zero-shot direct annotation
with GenAI. In comparison to few-shot GenAI annotation (with k-NN optimized examples),
LLMaAA shows a marginal performance advantage, besides the obvious advantages when it
comes to robustness, costs, and speed, making it promising for humanities research.

2.4. Plants and animals
In this study, we research the detection of plants and animals in historical texts, which can be
seen as the traditional NLP task of token (in sequence) classification or NER. Roughly starting
with publications as Man and the Natural World [29] and The Animal Estate [28], humanities’
scholarly interest in nature has skyrocketed. This is only natural, considering a widely shared


                                              1055
sense of humanity being in environmental, ecological, and climate crises. For cultural histori-
ans, the main question has been how humans observed, interpreted, and represented and thus
perceived nature [22]. Research on this topic has virtually exclusively been done qualitatively.
Although this approach provides illuminative insights, a qualitative approach will always work
with a relatively narrow scope because of the sheer size of the historical record, thus resulting
in a less precise large-scale overview of the studied phenomenon. It has, therefore, great po-
tential to complement qualitative studies with quantitative research.


3. Methods
In this section, we describe the methodology employed to compare direct and indirect GenAI
annotation strategies for identifying plants and animals in historical Dutch texts. First we ex-
plain the data parsing used to prepare our dataset. Following this, we describe the annotation
procedure undertaken by human annotators. Then, we address the dataset creation. Subse-
quently, we describe the token classification. After this, we describe our prompts. Then, we
document the training process of the indirect annotation models. Finally, we outline the ways
in which we compare the annotation approaches.

3.1. Data parsing
Our study used the Digitale Bibliotheek voor de Nederlandse Letteren (DBNL) [6], comprising
about 1,500 diverse Dutch texts. After preprocessing, the corpus yielded approximately 7 mil-
lion unique sentences [17, 8].

3.2. Annotation procedure
Two texts from the 1750s were selected for manual annotation by two Early Modern Dutch
literature experts. 200 sentences, parsed to have a minimal length of 10 and a maximal length
of 100 words, were annotated using the INCEpTION tool [15] (cf. Fig. 1), following iteratively
developed guidelines (Appendix A). The annotation schema tagged entities on three levels:
Category (Plants/Animals), Type (Organism/Part/Product/Collective), and Usage (Literal/Sym-
bolical/Petrified). For example, in “The bear grabbed an apple with its claw”, “bear” would be
tagged as Animals-Organisms-Literal.

3.3. Task description
The annotated sentences were split into demonstration, validation, and test sets. We concep-
tualized the detection of animals and plants as a token classification task. Prompts for both
direct and indirect annotators included the annotation schema, with technical details omitted
to improve performance (full prompts in Appendix B, model settings in Appendix C).

3.4. Training indirect annotation models
For indirect annotation, we adapted the LLMaAA framework by Zhang et al. [32], integrat-
ing it with Huggingface and OpenAI ecosystems. We used GysBERT [1] for historical Dutch,


                                             1056
Figure 1: The annotation interface of the INCEpTION tool, which was used for conducting the human
annotations.


applied k-NN few-shot selection (with the paraphrase-multilingual-mpnet-base-v2 sentence
transformer [26]), and confidence-based active learning. Automatic reweighting was not in-
cluded. GPT4o served as the LLM backbone. To address the scarcity of plant and animal enti-
ties, we employed a pre-filtering strategy using GPT-3.5. The specialized model underwent 10
training rounds with 10 epochs each, adding 50 example sentences per round (25 pre-filtered
sentences + 25 sentences with lowest confidence). This process was repeated for two datasets,
using five distinct random seeds, resulting in 10 indirect models.
   It is important to point out that the indirect annotators have not seen any of the human
annotations in their training regime. However, the 500 sentences they are trained on are an-
notated with the help of the human-annotated demonstration set and the performance of the
model is determined by its score on a human-annotated validation set.

3.5. Comparing annotation strategies
All of the analyses below were done on the annotations of the various strategies on the held-out
test set. Note that the sentences in this set came from the same documents as the demonstration
and validation.
   To assess the inter-annotator agreement among human annotators, direct GenAI, and indi-
rect GenAI approaches, we conducted inter-annotator agreement analysis, confusion matrix
analyses, and performed a manual error analysis. We compare each human annotator’s results
against each automatic system’s output:
   1. Human Annotations
         • Human1: Annotations from the first human annotator
         • Human2: Annotations from the second human annotator


                                             1057
   2. Direct GenAI Annotations
         • Direct zero-shot: Zero-shot direct annotation (without examples)
         • Direct few-shot 1: Few-shot direct annotation using examples from Human1
         • Direct few-shot 2: few-shot direct annotation using examples from Human2
   3. Indirect GenAI Annotations
         • Indirect1: Indirect annotation using examples from Human1
         • Indirect2: Indirect annotation using examples from Human2
   For the inter-annotator agreement, positives-only weighted F1 is used as a metric. The
positives-only weighted F1 is the weighted average of all harmonious means of precision and
recall of labeled entities. Thus, words that were not labeled as referring to a plants or animals
related word, which is by far the most common category in this situation, are disregarded. For
all GenAI annotations (direct and indirect), predictions were done five times. The average and
standard deviation of the inter-annotator agreements were calculated from these iterations.
The observed low variance across models and approaches suggests that these results are likely
stable, despite the relatively small number of iterations.
   The confusion matrices were made by comparing the human annotations to single instances
of the other strategies. It should be noted that Human1 had labeled one example as ”None-
label” (a token was tagged but no label was chosen), which was removed later for the process
of making the confusion matrices. In addition to the confusion matrix analysis, we performed
a manual error analysis on the annotations for the held-out test set.


4. Results
4.1. Inter-annotator agreement
Table 1 shows the inter-annotator agreement using positives-only weighted F1 as a metric. Key
findings include:
   1. Annotators of the same type resemble each other best (e.g., Human1 is closest to Hu-
      man2).
   2. Zero-shot direct annotation demonstrates lower internal coherence (F1 = 0.74) compared
      to few-shot direct (F1 = 0.9 and 0.88) and indirect annotation (F1 = 0.81 and 0.86).
   3. Direct zero-shot annotation consistently underperforms, while annotations from the
      other human annotator achieve the highest agreement.
   4. Few-shot direct and indirect annotations perform similarly, falling between zero-shot
      and human performance.
   5. GenAI models don’t simply mimic the specific human annotator they were trained on,
      but generalize from the examples.
  These results suggest that indirect annotation and few-shot direct annotation are more reli-
able methods for replicating human-like annotations compared to zero-shot approaches. The
choice between these methods may depend on factors beyond performance, such as ease of
implementation or specific task requirements.


                                             1058
Table 1
Inter-annotator agreement (positives-only weighted F1): The first two columns of the table display the
annotations treated as a reference point, the so-called “gold labels”. The corresponding rows indicate
the level of agreement of the other strategies with these gold labels.
 Method             human1     human2     direct zero-shot   direct few-shot1   direct few-shot2   indirect1   indirect2
 human1             1.0000

 human2             0.8332     1.0000

 direct zero-shot   0.5456     0.5206         0.7352
                    ± 0.0629   ± 0.0856       ± 0.1572
 direct few-shot1   0.5939     0.5842         0.6258             0.9039
                    ± 0.0256   ± 0.0341       ± 0.0869           ± 0.0523
 direct few-shot2   0.5738     0.6002         0.6090             0.8323             0.8822
                    ± 0.0212   ± 0.0249       ± 0.0756           ± 0.0290           ± 0.0650
 indirect1          0.5481     0.5489         0.4828             0.6927             0.6805          0.8099
                    ± 0.0524   ± 0.0330       ± 0.0718           ± 0.0419           ± 0.0428        ± 0.1163
 indirect2          0.5753     0.5785         0.5099             0.7218             0.7149          0.7976      0.8555
                    ± 0.0277   ± 0.0139       ± 0.0746           ± 0.0328           ± 0.0400        ± 0.0589    ± 0.0773


4.2. Confusion matrices
While F1 scores provide an overall measure of performance, they do not offer insights into
how well the annotation methods perform for individual labels. To gain a more nuanced un-
derstanding of the agreement (or lack thereof) between the annotations, we turn to confusion
matrices (Fig. 2), which provide insights into individual label performance across annotation
methods:
   1. Human2’s annotations most closely align with Human1’s.
   2. Zero-shot direct annotation shows low recall, with many entities remaining unlabeled.
   3. All GenAI annotators struggle with both precision and recall:
             • Precision errors: confusion between “Animals Parts Literal” and “Animals Products
               Literal”.
             • Recall errors: suggesting “No Label” for entities labeled by Human1.
   4. GenAI methods, including indirect approach, produce labels not found by human anno-
      tators, demonstrating higher label diversity.
These patterns highlight the strengths and weaknesses of each annotation method, emphasiz-
ing the need for careful selection and potential combination of approaches in annotation tasks.

4.3. Manual Error Analysis
Our manual examination of the annotations (Appendix D) reveals distinct error patterns across
different annotation strategies. First, human annotators occasionally overlook words in a sen-
tence. For instance, in sentence 10, Human1 tagged the word vleesch (meat) only twice out
of its three occurrences. Such errors likely stem from simple oversight rather than misunder-
standing (although misunderstandings occur too). Second, few-shot direct annotation strate-
gies sometimes struggle with entity aggregation. A notable example is sentence 41, where


                                                         1059
          (a) Human1 vs. Human2                              (b) Human1 vs. Direct Zero-Shot


     (c) Human1 vs. Direct Few-Shot1                         (d) Human1 vs. Direct Few-Shot2


         (e) Human1 vs. Indirect1                                (f) Human1 vs. Indirect2

Figure 2: Confusion matrices comparing Human1 annotations with those from Human2 and different
GenAI strategies. The red box indicates animals parts and animal products, which often proves to be
a hard category for the annotators to decide on. Although all GenAI annotations were done five fold,
only one of these instances is used for the confusion matrix.


                                               1060
“nek van het Varken” (neck of the pig) is incorrectly labeled as a single entity, instead of rec-
ognizing “neck” and “pig” as separate entities (with distinct labels “Animals Products Literal”
and “Animals Organisms Literal”, respectively). Importantly, these errors are incidental, not
systematic, suggesting they are unlikely to be consistently repeated. Third and finally, while
indirect annotation models are (once trained) deterministic, and therefore not susceptible to in-
cidental mistakes, they can produce counter-intuitive systematic errors. A revealing example
is sentence 31, where “salt” is misclassified as “Plant Product Literal”. This error likely stems
from the proximity of salt to spices like pepper and nutmeg in the transformer model’s vector
space.


5. Discussion
This study compared direct and indirect annotation with GenAI in a humanities context, fo-
cusing on identifying plant and animal-related words in Early Modern Dutch texts. While we
studied a specific case, we believe that this method can be used for a wide range of applica-
tions. Our findings reveal both the potential and limitations of various annotation strategies
for historical humanities studies.
   Indirect annotation demonstrates clear advantages over fully-unsupervised zero-shot direct
annotation, particularly in terms of recall. However, few-shot direct annotation achieves com-
parable performance to indirect annotation, suggesting that both approaches have merit in
different contexts. Based on these results, we advise against using zero-shot direct annota-
tions for historical humanities research. Its significantly lower recall compared to the alterna-
tives means that many relevant entities are likely to be missed, potentially skewing research
outcomes. The choice between few-shot direct annotation and indirect annotation is less clear-
cut, as both display similar F1 scores. Here, time, cost, and technical considerations should be
considered.
   The unique error patterns suggest two important points. First, it is crucial to investigate
shortcomings of chosen methods on a micro-level to be aware of specific pitfalls. Second,
there’s potential for stacking annotation methods: human, direct, and indirect annotation can
be applied to the same texts, after which points of contention can be analyzed. In this way,
they may bundle their strengths and cover each other’s weaknesses.
   Regarding the generalizability of this explorative study, several points should be noted. To-
ken labeling is a specific task, and the behavior of direct and indirect GenAI annotators may
differ for tasks of another nature. The prompts used for direct annotation have not been system-
atically tested, and it’s possible that especially zero-shot direct annotation would have better
results with more guidance regarding the output format. The held-out test set was small and
from the same document (i.e., not the same data) as the training data, which might have influ-
enced the results. The training of the indirect annotation model has been done with just 500
examples, a typically low number for fine-tuning its underlying transformer model. Addition-
ally, during the training of the indirect model, automatic reweighting was not applied (as we
deemed its effects in the LLMaAA paper to be marginal), but integrating it might approve the
model.


                                             1061
6. Conclusion
Despite these limitations, this study shows potential for applying GenAI as indirect annota-
tors in humanities research. However, there are are notable differences to other annotation
strategies. Future research should address questions about indirect GenAI annotation’s perfor-
mance on other tasks (e.g., text classification), the impact of prompt optimizing frameworks
(e.g., DSPy [13, 14]), and the potential of combining human and GenAI annotations to check
on each other.


Author Contributions
Conceptualization: Arjan van Dalfsen, Folgert Karsdorp, Ayoub Bagheri, Els Stronks; Data Cu-
ration: Thirza van Engelen, Dieuwertje Mentink; Investigation: Arjan van Dalfsen; Method-
ology: Arjan van Dalfsen; Writing - Original Draft: Arjan van Dalfsen; Writing - Review &
Editing: Folgert Karsdorp, Ayoub Bagheri, Els Stronks; Visualization: Arjan van Dalfsen; Su-
pervision: Folgert Karsdorp, Ayoub Bagheri, Els Stronks.


Acknowledgments
This research would not have been possible without the financial support from Utrecht Uni-
versity AI Labs, the Meertens Instituut, and the Utrecht University focus area Advanced Data
Science. Their generous contributions provided the necessary resources to conduct this study.
Additionally, we would like to extend our gratitude to SURF for providing cloud computing
services, which were instrumental in the analysis and processing of our data.


References
 [1] E. M. Arevalo and L. Fonteyn. “Non-Parametric Word Sense Disambiguation for Histor-
     ical Languages”. In: Proceedings of the 2nd International Workshop on Natural Language
     Processing for Digital Humanities. Taipei, Taiwan, 2022. url: https://aclanthology.org/20
     22.nlp4dh-1.16.
 [2] J. Borst, J. Klähn, and M. Burghardt. “Death of the Dictionary?– The Rise of Zero-
     Shot Sentiment Classification”. In: Computational Humanities Research Conference (CHR).
     Paris, France, 2023, pp. 303–319. url: https://ceur-ws.org/Vol-3558/paper3130.pdf.
 [3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P.
     Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R.
     Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
     S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D.
     Amodei. Language Models are Few-Shot Learners. arXiv preprint https://arxiv.org/abs/2
     005.14165. 2020. doi: 10.48550/arXiv.2005.14165.


                                             1062
 [4] Y. Chen, S. Li, Y. Li, and M. Atari. Surveying the Dead Minds: Historical-Psychological Text
     Analysis with Contextualized Construct Representation (CCR) for Classical Chinese. arXiv
     preprint https://arxiv.org/abs/2403.00509. 2024. doi: https://doi.org/10.48550/arXiv.240
     3.00509.
 [5] T. Dejaeghere, P. Singh, E. Lefever, and J. Birkholz. “Exploring Aspect-Based Senti-
     ment Analysis Methodologies for Literary-Historical Research Purposes”. In: Proceedings
     of the Third Workshop on Language Technologies for Historical and Ancient Languages
     (LT4HALA) LREC-COLING-2024. Torino, Italia: ELRA and ICCL, 2024. url: https://acla
     nthology.org/2024.lt4hala-1.16.
 [6] Digitale Bibliotheek voor de Nederlandse Letteren (DBNL). Collectie publiek domein. htt
     ps://www.dbnl.org/letterkunde/pd/index.php. 2023.
 [7] B. Ding, C. Qin, L. Liu, Y. K. Chia, B. Li, S. Joty, and L. Bing. “Is GPT-3 a Good Data An-
     notator?” In: Proceedings of the 61st Annual Meeting of the Association for Computational
     Linguistics (Volume 1: Long Papers). Toronto, Canada, 2023. doi: 10.18653/v1/2023.acl-lo
     ng.626.
 [8] M. van Gompel. python-ucto [computer software]. https://languagemachines.github.io/u
     cto/. 2023.
 [9] R. Han, T. Peng, C. Yang, B. Wang, L. Liu, and X. Wan. Is Information Extraction Solved by
     ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors. arXiv
     preprint https://arxiv.org/abs/2305.14450. 2023. doi: https://doi.org/10.48550/arXiv.230
     5.01445.
[10]   X. He, Z. Lin, Y. Gong, A.-L. Jin, H. Zhang, C. Lin, J. Jiao, S. M. Yiu, N. Duan, and W. Chen.
       AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators. arXiv
       preprint https://arxiv.org/abs/2303.16854. 2024. doi: https://doi.org/10.48550/arXiv.230
       3.16854.
[11]   A. Karjus. Large language models to supercharge humanities and cultural analytics re-
       search. Poster presentation at CHR2023 https://2023.computational-humanities-researc
       h.org/programme/. 2023.
[12]   A. Karjus. Machine-assisted mixed methods: augmenting humanities and social sciences
       with artificial intelligence. arXiv preprint https://arxiv.org/abs/2309.14379. 2023. doi:
       https://doi.org/10.48550/arXiv.2309.14379.
[13]   O. Khattab, K. Santhanam, X. L. Li, D. Hall, percy Liang, C. Potts, and M. Zaharia.
       Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-
       Intensive NLP. arXiv preprint https://arxiv.org/abs/2212.14024. 2022. doi: https://do
       i.org/10.48550/arXiv.2212.14024.
[14]   O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq,
       A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts. DSPy: Compiling
       Declarative Language Model Calls into Self-Improving Pipelines. arXiv preprint https://ar
       xiv.org/abs/2310.03714. 2023. doi: https://doi.org/10.48550/arXiv.2310.03714.


                                               1063
[15]   J.-C. Klie, M. Bugert, B. Boullosa, R. E. de Castilho, and I. Gurevych. “The INCEpTION
       Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation”. In: Pro-
       ceedings of the 27th International Conference on Computational Linguistics: System Demon-
       strations. Santa Fe, New Mexico, 2018. url: https://aclanthology.org/C18-2002.
[16]   J. Kocoń, I. Cichecki, O. Kaszyca, M. Kochanek, D. Szydło, J. Baran, J. Bielaniewicz, M.
       Gruza, A. Janz, K. Kanclerz, A. Kocoń, B. Koptyra, W. Mieleszczenko-Kowszewicz, P.
       Miłkowski, M. Oleksy, M. Piasecki, Ł. Radliński, K. Wojtasik, S. Woźniak, and P. Kazienko.
       “ChatGPT: Jack of all trades, master of none”. In: Information Fusion 99 (2023), p. 101861.
       doi: https://doi.org/10.1016/j.inffus.2023.101861.
[17]   Koninklijke Bibliotheek. Over ons - Diensten DBNL. https://www.kb.nl/over-ons/dienst
       en/dbnl. 2024.
[18]   A. Kosar, G. D. Pauw, and W. Daelemans. “Comparative Evaluation of Topic Detection:
       Humans vs. LLMs”. In: Computational Linguistics in the Netherlands Journal 13 (2024),
       pp. 91–120. url: https://www.clinjournal.org/clinj/article/view/173.
[19]   L. D. Langhe, A. Maladry, B. Vanroy, L. D. Bruyne, P. SIngh, E. Lefever, and O. D. Clercq.
       “Benchmarking Zero-Shot Text Classification for Dutch”. In: Computational Linguistics
       in the Netherlands Journal 13 (2024), pp. 63–90. url: https://clinjournal.org/clinj/article
       /view/172.
[20]   M. Li, T. Shi, C. Ziems, M.-Y. Kan, N. Chen, Z. Liu, and D. Yang. “CoAnnotating:
       Uncertainty-Guided Work Allocation between Human and Large Language Models for
       Data Annotation”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural
       Language Processing. Singapore, 2023. doi: 10.18653/v1/2023.emnlp-main.92.
[21]   J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen. “What Makes Good In-Context
       Examples for GPT-3?” In: Proceedings of Deep Learning Inside Out (DeeLIO 2022): The
       3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures.
       Dublin, Ireland and Online, 2022. doi: 10.18653/v1/2022.deelio-1.10.
[22]   L. Molle. “Inleiding - Een geschiedenis van mensen en (andere) dieren”. In: Tijdschrift
       voor Geschiedenis 125 (2012), pp. 464–475. doi: 10.5117/tvgesch2012.4.moll.
[23]   OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt. 2022.
[24]   C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and D. Yang. Is ChatGPT a General-
       Purpose Natural Language Processing Task Solver? arXiv preprint https://arxiv.org/abs/2
       302.06476. 2023. doi: https://doi.org/10.48550/arXiv.2302.06476.
[25]   S. Rebora, M. Lehmann, A. Heumann, W. Ding, and G. Lauer. “Comparing ChatGPT to
       Human Raters and Sentiment Analysis Tools for German Children’s Literature”. In: Com-
       putational Humanities Research Conference (CHR). Paris, France, 2023. url: https://ceur-
       ws.org/Vol-3558/paper3340.pdf.
[26]   N. Reimers and I. Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese
       BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Natu-
       ral Language Processing. Hong Kong, 2019, pp. 3982–3992. doi: https://doi.org/10.48550
       /arXiv.1908.10084.


                                              1064
[27]     M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to Reweight Examples for Robust Deep
         Learning. arXiv preprint https://arxiv.org/abs/1803.09050. 2019. doi: https://doi.org/10
         .48550/arXiv.1803.09050.
[28]     H. Ritvo. The Animal Estate: The English and Other Creatures in the Victorian Age. New
         ed. Cambridge, MA: Harvard University Press, 1989.
[29]     K. Thomas. Man and the Natural World: Changing Attitudes in England 1500-1800. New
         edition. London, UK: Penguin Books Ltd, 1991.
[30]     P. Törnberg. ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political
         Twitter Messages with Zero-Shot Learning. arXiv preprint https://arxiv.org/abs/2304.065
         88. 2023. doi: https://doi.org/10.48550/arXiv.2304.06588.
[31]     S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng. “Want To Reduce Labeling Cost? GPT-3 Can
         Help”. In: Findings of the Association for Computational Linguistics: EMNLP 2021. Punta
         Cana, Dominican Republic, 2021. doi: 10.18653/v1/2021.findings-emnlp.354.
[32]     R. Zhang, Y. Li, Y. Ma, M. Zhou, and L. Zou. “LLMaAA: Making Large Language Models as
         Active Annotators”. In: Findings of the Association for Computational Linguistics: EMNLP
         2023. Singapore, 2023. doi: 10.18653/v1/2023.findings-emnlp.872.
[33]     C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, and D. Yang. Can Large Language Models
         Transform Computational Social Science? arXiv preprint https://arxiv.org/abs/2305.03514.
         2024. doi: https://doi.org/10.48550/arXiv.2305.03514.


7. Appendices

A. Annotation Guidelines
A.1. Annotation Schema
Category

       • Animals: A living thing that can move around to search for food. It usually has ways to
         see, hear, smell, taste, and feel the world around it.
       • Plants: A living thing that usually stays in one place. It creates its own food using
         sunlight, water, and air.

  Type

       • Organisms: A whole, living animal or plant. Think of it like one complete cat, or one
         whole oak tree.
       • Parts: A piece of an animal or plant. Things like a bird’s wing, a flower petal, or a bear’s
         claw.
       • Products: Something we get from a plant or animal that we use. Only first-order products
         count (i.e. it’s the first “product” that comes from the plant/animal, not a product of an
         earlier product). Examples are milk from a cow, honey from bees, or apples from a tree.


                                                 1065
    • Collective: Something is collective if the word refers to a heterogenous multitude of
      plants/animals. Nature explicitly and inherently is a prominent part of, but it is not
      100% clear what kinds of nature. If the collective might belong to both categories (you
      choose the best or least-wrong category). Examples are: weide, grastapijt, bos, woud,
      vee, kudde.

  Usage

    • Literal: When the word means exactly the animal, plant, part, or product itself. If you
      envision the text, you should see it. (”The bear ate a fish.”)
    • Symbolical: When the word is used as a symbol or metaphor, representing something
      else. If you envision the text, you should not see it. (”His heart was as cold as a snake.”)
      Pictures are symbolic. Nicknames are probably symbolical.
    • Petrified: if the plants/animals word is the name of something or someone.

A.2. Technicalities
General rule Textual context is always dominant in annotating.

Discontinuous annotations Sometimes an annotation is discontinuous, meaning that there
     are words between the parts to be annotated. An example is: “esschen- en pijnhout”.
     Here, “esschen-” and “hout” should be annotated. This can be done by annotating both
     (here, meaning that you should also make a separate annotation for hout!) and then
     defining a relationship. Step-by-step guide: 1. Annotate “esschen-” and “pijnhout” and
     “hout”. 2. Select the first part (“esschen-“). 3. Right click the second part (“hout”). 4.
     Click “Link to” and select “discontinuous entity”.

Part – Whole constructions Sometimes part-whole constructions occur, e.g. “de wortel van
      de brem”. Here, it is important to look at the parts that are separately referential (wortel
      and brem, here). If the text has “bremwortel” there is just one separate referential entity.

Syntactic head Concerning compound words, we annotate based on the syntactic head. You
     can find the syntactic head by doing a reference test: to what pard of the compound can
     you refer? (“hazenpad”: not annotated; “padhaas”: annotated). In Dutch, the syntactic
     head is normally on the right side of the word.

Co-references Co-references to entities are not tagged. (in “de wolf is blij, hij eet graag
     haas”, “hij” should not be annotated). Likewise, words that in a specific context refer
     to plants/animals should not be annotated, unless the plant/animal aspect is inherent.
     “Veulen” and “kalf” are names for young animals and should be annotated, however,
     “jong”, “wijfje”, “mannetje”, “wederhelft”, “lichaam”, are not.

Adjectives As a general rule, adjectives are not annotated. There are a few exceptions: 1. If
     the adjective is part of the name of a plant/animal, it should be annotated (e.g. “blauwe”
     in “blauwe vinvis” and “kruipende” in “kruipende boterbloem”). 2. Sometimes a word
     looks like an adjective, but it is used as a substantive. In that case, annotate it.


                                              1066
Foreign languages When plants/animals/nature-locations are in a non-Dutch language, they
     should still be annotated. There are two exceptions: 1. If the whole text is in a different
     language, it should not be annotated; 2. If the entities name is in a non-Latin script (e.g.,
     Arabic, Greek, Hebrew), it shouldn’t be annotated.


B. Used Prompts
A ChatPromptTemplate is used. Therefore, ”chat messages” are provided between
parentheses; inside the parentheses the ”sender” of the message and the message itself
are divided with a comma. Parts in italics dependent on the texts that is annotated.
Here, it is only indicated that these parts exist.

B.1. Pre-filtering prompt
(
System,
You are a helpful assistant. You'll get a historical Dutch text.
It's your task to tell whether (non-human)
animals or plants are directly present in this text. You do this
by reasoning step by step, and then end by completing: 'I deem
the statement that literal animals are present in this text to
be:' with True or False. I know you can do it!
),
(
User,
\textit{Text to pre-filter}
)


B.2. Few-shot direct annotation prompt
(
System,

You are a highly intelligent and accurate nature domain information extraction
system. I'll provide a small text, written in historical Dutch. Your task is
to recognize and extract all entities related to plants or animals. If you
have found anything that falls into that category, you should annotate it on
three levels: 1. Category; 2. Type; 3. Usage.

For Category, there are two possibilities: Plants and Animals.

* Animals: A living thing that can move around to search for food. It usually
has ways to see, hear, smell, taste, and feel the world around it.
* Plants: A living thing that usually stays in one place. It creates its own


                                              1067
food using sunlight, water, and air.

For Type, there are four possibilities: Organisms, Parts, Products, Collective.

* Organisms: A whole, living animal or plant. Think of it like one complete cat,
  or one whole oak tree.

* Parts: A piece of an animal or plant. Things like a bird's wing, a flower
  petal, or a bear's claw.

* Products: Something we get from a plant or animal that we use. Only
  first-order products count (i.e. it's the first 'product' that comes from the
  plant/animal, not a product of an earlier product). Examples are milk from a
  cow, honey from bees, or apples from a tree.

* Collective: Something is collective if the word refers to a heterogeneous
  multitude of plants/animals. Nature explicitly and inherently is a prominent
  part of, but it is not 100\% clear what kinds of nature. If the collective
  might belong to both categories (you choose the best or least-wrong category).
  Examples are: weide, grastapijt, bos, woud, vee, kudde.

For Usage there are three possibilities: Literal, Symbolical, Petrified.

* Literal: When the word means exactly the animal, plant, part, or product
  itself. If you envision the text, you should see it. ('The bear ate a fish.')

* Symbolical: When the word is used as a symbol or metaphor, representing
  something else. If you envision the text, you should not see it. ('His heart
  was as cold as a snake.') Pictures are symbolic. Nicknames are probably
  symbolical.

* Petrified: if the plants/animals word is the name of something or someone.

To summarize, you should detect all plant and animal related words and tag them
according to this schema. So, for each found entity you annotate its category
(Plant/Animal), its Type (Organisms/Parts/Products/Collective), and its Usage
(Literal/Symbolical/Petrified).

It is extremely important that you work precise. Therefore, you should explain
step by step why you make a choice. Also extremely important: the annotation you
do should be in the form of a list with dictionaries. You should also do an
explanation, but your ultimate annotation should be in that format. So you
should always have output like this:


                                       1068
[{"span": span, "type": Category-Type-Usage}, ...]

Very important: if you don't find any entities, your annotation should be an
empty dictionary in a list:

[{}]

otherwise the postprocess script will get in trouble.

Good luck, I count on you!
)
(
System,
  The span must be exactly the same as in the original text, including white
  spaces.
)
(
User,
Here are some examples:
\textit{ Example1, Example2, Example3, Example4, Example 5}

Please now annotate the following input:
Input: \textit{Text to annotate.}
)


B.3. Zero-shot direct annotation prompt
Same prompt as above, but without the examples.


C. Model Settings
OpenAI API parameters temperature = 1; top_p = 1; frequency_penalty = 0; pres-
    ence_penalty = 0; gpt-4-o version: gpt-4o-2024-05-13. gpt-3.5-turbo version:
    gpt-3.5-turbo-0125. ‘2023-03-15-preview’

GysBERT parameters archictecture: BertForTokenClassification; optimizer: Adam; learn-
    ing_rate: 2e-5.


D. Held-out Test Set
   1. Doch in kommerlyke tyden word dit kruid, een weinig geroost, door de menschen ten
      spyze gebruikt.
   2. Al de gedroogde visch, die zich toen op het eiland bevond, werd daar van geheel zwart
      en onbruikbaar, ja in de twe naastvolgende jaren werden door die assche, of veeleer door


                                            1069
    de ’er mede vermengde scherpachtige rotsbrokjes of zand, gelyk boven by den brand op
    Jan Mayen eiland aangemerkt is, zo verre het
 3. Als men het Varken, in ’t midden aan weeder zyden van de rugge-graad, doorgesne-
    den heeft, zo laat men ieder helft, even onder de schouwder nog eens doorsnyden in de
    breedte.
 4. Het gerookt vleesch moet ook acht dagen in het zout leggen, en dan in zakken genaait
    in de rook gehangen worden, en moet drie of wel vier maanden rooken.
 5. Zouten van Spek, Hammen en Ossen-Vleesch, hoe daar mede te handelen.
 6. Dan legt men alles aldus in de kuip om in order te gebruiken: 1. de 6 klapstukken
    van de buik onder in, want ze konnen het langste duuren: 2. de staartstukken: 3. de
    schouwderbladeren: 4. de twee borststukken: 5. de twee beste ribben: 6. de vier an-
    dere ribben: de twee ongeschikte ribben die by de schouwders zitten: 7. de huspot zo
    men wil boven op; maar men moet zorg dragen dat de stukken wel vast in malkanderen
    sluiten, en de openingen moeten met zout gevuld worden, en wat zout ’er boven op, en
    eerst onder op den bodem gespreid; ook moet de kuip eerst schoon uitgebroeid en met
    kruidnagels gedroogt worden.
 7. Dit alles te zaamen in een groote pan of styfsel-kom of hakkebord gedaan, en 6 tinne
    kommetjes met Osse-vleesch-nat of ander vleesch-nat, of warm water daar op gegooten
    en digt toegedekt en altemets eens omgeroert. en zo een nagt over, op de warme plaat
    laaten staan weken; en dan stopt men ze gelyk Leverbeulingen; dog om dat de gort sterk
    zwelt maar half vol, en dan zynze half vol als men ze plat duuwt: Als ze gestopt zyn laat
    men ze zeer zagt kooken dat het water maar even beweegt omtrent een half uurtje, en
    men prikt ze ondertussen met een doorntje om niet te barsten en uittekooken, en dan
    zyn ze heel goed. 4.
 8. En schoon veele staande houden dat het eeten van dit vleesch geen quaad aan de men-
    schen doedt, zo zyn fatsoendelyke lieden nogtans beschroomd om het te gebruiken: om
    dit met zeekerheid te weeten zo kan men daar deeze proeve van neemen.
 9. §. XXXI. De Koemelk word tot artzeny gebruikt. De Melk is de voornaamste artzeny
    der Yslanders, en word daarom ook, zodra zy van de koe koomt, door gene anderen, dan
    alleen kranken, genoten.
10. Die het beter willen maken, en ’er de middelen toe hebben, kopen een weinig zout, sny-
    den, als het ge-slagt dier noch onafgehakt hangt, op drie of vier plaatsen een diepe snede
    in het vleesch, en doen in iedere opening een kleine hand vol zout, zich verbeeldende, dat
    het dus zelf, zo veel nodig is, door het gantsche beest trekt, en het vleesch, wanneer ’er
    vervolgens wind en rook by koomt, zeer wel bewaard word Op de beide gezegde wyzen
    handelen de ingezetenen ook met het schapenvleesch, als zy het voor hun huisgezin
    slagten.
11. Zeeusche Pens en Hoofdvleesch, hoe men die maaken zal.
12. Ossen en Koeyen vallen niet groter dan het kleinst geestvee in Duitsland; hebben, gelyk
    bereids gezegt is, gene Hoornen, en genieten alleen het voorrecht, door de huis lieden
    in den winter mede onder ’t dak genomen en met het zo kommerlyk gewonnen hooy,
    of, by mangel van het zelve, met het gedroogd zeegewas Zeenestel spaarzaam gevoed te
    worden.


                                           1070
13. Men stopt de beulingen maar half vol om dat die anders te ligt uitkooken of barsten; en
    men bind ze met een touwtje onder en boven toe, en dan wordenze op een schootel plat
    nedergelegt, tot dat ze gekookt worden: Voor al moet men niet vergeeten genoeg vet
    daar in te doen, want anders zyn de Leverbeulingen te droog.
14. weshalven de boeren ’er aldaar meer acht op geven. Dezen jagen alleen de Hamels in ’t
    gebergte; doch houden de Oyen zo veel by huis, als doenlyk is.
15. Als men zo veel moeiten niet doen wil om Rolpens en Hoofd-vleesch te maaken, zo snyd
    men de pens in stukken, en men kookt het met de kop tot dat alles gaar is, en dan legt
    men het vleesch met de pens door een, met wat zout en heele peper, in den azyn, in een
    keulse aarde pot, is heel goed om met appelen des winters gebakken te eeten. 17.
16. De Boter kaarnen de meesten voor en na zo hairig, als zy uit ongereinigde melk in een
    zamengenaaide schapenvacht gemolken is, en leggen dezelve dus op; weshalven een
    vreemdeling die Boter niet ligtelyk door de keel zoude konnen krygen.
17. Dan neemt men een groote vleesch keetel en men hangt ze vol regen water over het
    vuur, en als het water kookt doet men de beulingen daar in, dat die regt uit en niet op
    malkanderen leggen, daarom mag men niet meer als anderhalf douzyn beulingen te gelyk
    kooken; en ze moeten heel zagtjes kooken, omtrent een half uur lang.
18. Afhakken van ’t vleesch in de Slacht-tyd, en hoe men de stukken best en ten meesten
    voordeelen zal gebruiken, en hoe men verder met alles in de Slacht-tyd, moet handelen.
    1.
19. Neemt by de 20 ponden, gehakt redelyk vet, varkens vleesch, anderhalf loot of twee
    loot nootemuscaten; twee loot nagelen; twee loot zwarte peeper, dit alles ter deegen fyn
    gestooten zynde, zo roert men het onder anderhalf vierendeel zout, en men kneed het
    door het gekapte Varkensvleesch heen; en men laat het zo een nacht met een schoone
    doek bedekt staan doortrekken.
20. Hunne vellen vallen in den winter, als zy het meeste en vastste hair hebben, het best;
    weshalven de Yslanders dezelve dan naarstig vangen, en wel, uit aangebore afschuuw
    van schietgeweer, met uitgezette netten of vangyzers, die gelyk een kleermakersschaar
    gevormt, en met een dood lam ten lokaas voorzien zyn.
21. geweld der uitbrekende en uitgezette lucht een groot gedeelte van den berg, ’t geen te
    zwaar was, om opgeligt te worden, op zyde en niet slegts een gantsche myl wegs langs het
    eiland tot aan het strand, maar zelfs noch een myl verr’ in zee voortgeschoven, en aldaar
    neder gezet wierd, alwaar het, onaangezien de diepte, in den beginne wel 60 vademen
    boven het water uitstak, en aldaar merendeels noch staat e.
22. Neemt voor het vleesch, het geen men daar in legt, het vleesch van de schouwder van
    een Os dat het malste is; of anders een van de platte billen.
23. Ja zy zyn het zelven, die gemeenlyk het begin der aardbranden veroorzaken.
24. §. XXXIV. Hebben geen Zwynen, maar wel Honden en Katten.
25. Doch wat de eigentlyke en natuurlyke oorzaak dezer zeldzaamheid zyn mag, is niet zeer
    ligt te beseffen w.
26. Reusel, hoe men die wel zal smelten.
27. Van harde of Coraalachtige Zeegewassen wist myn berichter te zeggen, dat enigen van
    dezelven op de gronden gevonden wierden; doch konde hen niet noemen of beschryven,
    nadien hy, volgens zyne eigen belydenis, ’er nooit naar gezien had.


                                          1071
28. Dezen zyn de Snoriper op de lappische Alpen, die zich a steeds op het land houden, meer
    lopen dan vliegen, en mitsdien niet bezwaarlyk te vangen zyn.
29. Men moet zich verwonderen, wat zy konnen uitstaan; doch zy worden wel degelyk door
    de ongemakken verhard, nadien zy jaar uit jaar in in het open veld onder den bloten
    Hemel blyven, en ’s winters onder de sneeuw zowel, als ’s zomers, hun voeder zelven
    moeten zoeken, waar toe zy alleen de weldaad van de natuur genieten, dat zy met byzon-
    dere styve, lange en dikke hairen, allermeest tegen den wintertyd, bedekt zyn.
30. Vervol-gens bragt men het zieke volk aan land, ’t geen, ofschoon het, behalven enig Lep-
    elblad, niet als Zuring in warme Melk en een weinig Schapenvleesch nuttigde, nochtans
    velen binnen acht en de anderen binnen veertien dagen zo fris en gezond werden, dat zy
    huppelden en sprongen, en in minder dan vier weken na hun komst weder scheep gaan,
    zelven hun anker lichten, en die lange en bezwaarlyke reize voorts vrolyk voleinden
    konden.
31. Het vleesch snyd men eerst aan stukken als Ossekarbenaden; en dan snyd men het aan
    lange reepen omtrent een vinger dik en vierkant; men snyd het vet ook aan zulke langw-
    erpige stukken; en dan bestrooid men de pens met wat geprepareerd zout en kruit, gelyk
    ik boven gezegt heb.
32. Men neemt 3 loot bruine peper, en een halfvierendeel nagelen; dit te zaamen eerst fyn
    gestoten en in een aarde schootel gedaan, en een hand vol gedroogde Saly, die men op
    den haart wat te droogen legt en die klein gewreven is, en een hand vol of vier zout daar
    onder geroert, tot men denkt dat men genoeg zal hebben; want den een doet het wel wat
    hartiger dan den ander.
33. Men behoefd ’er geen Sukade nog Amandelen in te doen als men niet wil, en is evenwel
    goed maar zo lekker niet.
34. De Harsten laat men een dag of vyf in het zout leggen en men moet ze niet te groot laaten
    hakken, om dat ze anders te ongeschikt zyn, en ieder een doet dit naa de groote van zyn
    huisgezin, ook zyn die Harsten dus zeer goed om in den Oven gezet en gebraaden te
    worden.
35. Het vleesch om in de Kuip in te zouten, daar neemt men toe de zes klapstukken van
    de buyk, de twee staartstukken, de schouwderbladeren, de twee borststukken, de vier
    andere ribben, als men de twee beste ribben wil in de rook hangen, anders kan men ook
    de Paterstukken inzouten, en dan nog de twee ongeschikte ribben die by de schouders
    zitten; en men laat die stukken groot of klein hakken naa dat men het wil hebben en het
    huisgezin groot is.
36. Mitsdien ziet men zelden op Ysland andere, dan uitgebrande bergen, aan en om welke
    men bequaam de werkingen en overgebleven tekenen van een vorigen brand bespeuren
    kan.
37. Buiten dien tyd leggen de inwoonders, nadien de Vossen de schapen zeer schadelyk
    zyn, kraanogen (nuces vomicae) in honig geweekt, die zy, anders niets zoets te eten
    bekomende, zeer begerig inzwelgen.
38. Neemt 4 kop Gort schoon afgewasschen: 4 pond korenten die wel verlezen en schoon
    gewassen zyn: 8 loot gestoote kaneel: 1 loot gestoote nagelen: 3 loot gestoote notemus-
    caaten; 1/2 pond poeijer-zuiker: 1 pond gepelde amandelen in stukjes gesneden: 6 sukade


                                          1072
    schellen aan stukjes gesneden: Een hand vol zout: 10 pond of daar omtrent Osse-niervet
    aan dobbelsteentjes gesneden.
39. Het zoude gezwellen verwekken, en, als men ’er veel van eet, sterk openende zyn.
40. de Ravens verjagen; doch het Lam, vermits het, zyn voeder niet konnende zoeken,
    elendig omkomen moet, slagten, en het het zachte vel afstropen, ’t geen de peltery geeft,
    die in Denmarken en Holstein onder den naam van Schmaaskin of Schmaasken x verkogt
    en zeer veel door lieden van een middelbaar vermogen gedragen word.
41. Neemt een van de grootste Kalfskoppen, en reinigt die, en wascht ze vier of vyfmalen
    ter degen schoon af, en laatze een nacht in schoon regen water staan te trekken, dat ’er
    de slym en het bloedige wel schoon af is, en hangt de kop met schoon regen-water over
    het vuur; en doet ook in de keetel, de nek van het Varken, en de twee ooren met wat
    veel zwoort dat ’er genoeg is om het vleesch in het hakkebord van booven en onderen te
    bedekken; en als men te veel zwoort en niet genoeg vleesch heeft, zo doet men ’er wel
    een of twee van de vleesigste stukken van het varken by, en men laat het te zaamen een
    uur of drie kooken, na dat men het alvorens wel schoon geschuimt heeft, en het moet
    zeer gaar zyn tot dat het vleesch van de beenen af valt, en dan schept men het uit op een
    aarde vergiettest of doorslag.
42. Hunne manier, om het Rundvee te slagten, heeft ook iets byzonders, Zy kollen het niet
    voor den kop, menende, dat daar door het bloed in ’t vleesch stremt, en mitsdien niet
    lopen kan; maar steken het een dun penmes diep in den nek, waar door het ter aarde
    valt; als dan trekken zy de poten gezwind met strikken zamen, en openen de keel, op
    dat al het bloed zoude uitvlieten Het ingewand word door de Yslanders allereerst, zonder
    veel te reinigen, genuttigt, en het dier zelf afgehakt.
43. Neemt de Lever van het Varken en wascht die schoon, en laat die op een aarde schotel
    leggen; doet daar zo raauw de vellen en spieren met een mes ter degen schoon uit, en
    doet het in een schoon tobbetje.
44. Voor een geheele pens heeft men omtrent 20 pond vleesch noodig, behalven het vet dat
    men daar by gebruikt, dat nog omtrent 10 ponden is.
45. Laat dan een ketel of twee met regenwater kooken en laat het Koud worden; en als het
    koud is neemt dan schaars drie kommetjes van dat water tegen ruim een kommetje wyn
    azyn, en mengt dat te zaamen onder malkanderen zo veel tot dat de pens, als he daar over
    gegooten is, kan onderleggen, en zet ze dan zo open weg daar ze niet te vogtig staan, is
    heel goed om met appelen gebakken, of gestooft met wyn te eeten. 10.
46. Saucysen of Worst van Varkenvleesch, hoe men die maaken zal.
47. De stukken worden niet met zout gewreven, maar slegts twemaal door zeewater gehaalt,
    en dan in de lucht, op dat zy winddroog zouden worden, en vervolgens in hunne hutten
    over hunne haardsteden gehangen, om dezelve te roken, en te meer te doen drogen Dus
    behandelen zy hun geslagt half verrot en half stinkend vleesch, tot zy het voorts opeten.
48. Het vleesch om in de rook te hangen daar toe neemt men de Paterstukken, de twee andere
    platte billen; en de twee beste ribben; en de spieren, die achter tusschen de beenen van
    de ribben inzitten, moeten daar schoon uitgedaan worden, om dat daar door ligt verderf
    ontstaan kan.


                                          1073
  49. Dan doet men twee geraspte nootemuscaten, met wat gestoote foelie en met wat zout
      daar in, en men hakt het te zaamen onder een tot het redelyk klein, maar niet al te klein
      is.
  50. Alsdan begeeft een harder zich met de afgerichte honden op een heuvel, en geeft met zyn
      hoorn een teken, waarop de honden zich verdelen, en de Schapen van alle kanten uit de
      klippen en wildernissen in een zekere omtuining of staketzel dryven, ’t geen vooraan
      wyd uitgezet is; doch, op dat zy niet zouden konnen ontvluchten, naar achter allengs
      enger word.


E. Cost
In humanities, costs are often an important consideration. For all strategies, there’s the cost of
establishing annotation guidelines and making something of a test set. After that point:

    • Human annotation costs €0.7 per sentence;
    • Direct annotation costs €0.007 per sentence (for GPT-4o, directly via OpenAI), with zero-
      shotting being slightly cheaper due to omitting examples in the prompt;
    • Indirect annotation costs €4.50 to train the model, and nothing per sentence.

  It is important to emphasize that costs per strategy are likely to change over time since
GenAI models are getting cheaper and that human annotations might differ significantly per
country or institution. Also, multiple factors should be considered, such as available hardware
and environmental effects.


F. Online Resources
Code and data used in this study can be found here:

    • Data: https://www.dbnl.org/letterkunde/pd/index.php,
    • Code and annotations GitHub Repository.


                                              1074

</pre>