1. Introduction and Motivation

Benchmarking Historical Phase Recognition from Text and Events

Fabio Celli

Marco Rovera

0 0 Fondazione Bruno Kessler , Trento , Italy 1 Maggioli Research , Santarcangelo di Romagna , Italy

2025

This paper presents preliminary studies on a benchmark for the Historical Phase Recognition task. This task explores the application of computational linguistics to the study of long-term historical dynamics. We compare the utility of Event Tagging and BERT embeddings for classifying the phases of secular cycles defined by the the Structural-Demographic Theory. We explore this task both as five-class classification (crisis, growth, population immiseration, elite overproduction, State stess) and binary classification (rise, decline), on the basis of human- and LLM-annotated labels. Our findings reveal that Event Tagging, when aligned with human annotations, yields good performance in multi-class classification, but not in binary classification. Conversely, using BERT to extract features directly from text yields better performances with LLM-generated labels, in particular on the binary classification task. We also report higher inter-annotator agreement between LLMs compared to humans when labeling historical phases.

eol>Historical Phase Recognition Cultural Analytics Structural Demographic Theory Large Language Models

1. Introduction and Motivation

Historical Phase Recognition is a novel task that aims at the classification of phases of past societies according to existing theoretical frameworks. This task, based on the idea that history is a complex adaptive system [1] like language [2], can be useful for exploring and comparing societal adaptation processes in their long-term trends [3], to find replicable patterns. Societies have historical and structural dimensions [4] and evolve through dynamics that create cycles [5], following irreversible developmental paths that eventually cause them to break down [6] or recover. Crucially, much of historical information is expressed in natural language [7], and it is available from open sources like Wikipedia [8, 9], hence computational linguistics tasks such as event detection [10] can ofer a great contribution to this line of research. SDT has proven to be a valuable framework for under

A theoretical framework in this area that has proven to standing a diverse array of historical occurrences. For be suitable for computational analysis is the Structural- instance, it has been applied to analyze the underlying Demographic Theory (SDT) [11]. By integrating this causes of the French Revolution, the elite rivalries that theory with data modeling techniques, researchers were fueled the American Civil War [14], and the factors conable to make remarkably accurate predictions about the tributing to the collapse of the Qing Dynasty [15]. Furglobal crises that unfolded in the 2020 s [12]. This pre- thermore, SDT is also employed to analyze contemporary dictive power underscores the value of SDT as a tool historical events, ranging from the Egyptian revolution for analyzing complex socio-political dy namics within of 2011 [16] to the political instability experienced in the historical datasets [13]. Specifically, the SDT posits that US in 2021 [17]. historical cycles are characterized by five distinct phases: Previous work in Historical Phase Recognition [18] released the Chronos dataset, annotated by humans, and CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- demonstrated that systems can learn models with perfortics, September 24 — 26, 2025, Cagliari, Italy mance above chance, although far from perfect. Recent * Corresponding author. research in the field reports that LLMs can reach human $ f0a0b0i0o-.0c0el0l2i@-7m30a9g-g58io8l6i.i(tF(.FC.eClleil)li); m.rovera@fbk.eu (M. Rovera) performance in Historical Phase labeling and report that © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License the intra-annotator agreement of LLMs is consistent [19]. Attribution 4.0 International (CC BY 4.0).

• 0. Crisis (widespread conflict that results in a

restructuring of the socio-political order); • 1. Growth (a new order creates social cohesion, triggering high productivity and increasing competition for social status); • 2. Population immiseration (increased competition for status and resources leads to rising inequality); • 3. Elite overproduction (inequalities lead to radical factionalism and frustrated individuals who may become agents of instability) and • 4. State stress (the rising instability brings fiscal distress and both lead the State towards potential crises with widespread conflicts, restarting the cycle).

Still there is no benchmark in Historical Phase Recog- changes, and religions. Descriptions are summarized to nition, and there are research questions about this task an average of 400 characters per decade, with source that remain unanswered, for instance: references when available. Each entry includes a timestamp, historical age, sampling zone, world region, and • (RQ1) Can Event Tagging provide a generalization a standardized Polity ID encoding origin, name, societal that helps Historical Phase Recognition? type, and periodization. The dataset contains more than • (RQ2) Can LLMs-as-annotators reach a higher 9000 rows, but most of them have no textual description, consensus than humans in SDT labeling? especially those in remote times. Moreover, there are • (RQ3) Which kind of label is easier to model, the duplicates, as some polities expanded over more than one made by humans or by LLMs? one sampling zone, and were sampled more than once. • (RQ4) Is it easier to perform Historical Phase The dataset also contains a flag to indicate whether the Recognition as 5-class or as a binary classification historical information reported is recorded or supposed. task? Using these information we created a benchmark.

To answer RQ1 we use EventNet-ITA, a Frame Parser1 trained on a large Italian corpus, annotated with semantic frames of events2. This tool provides a fast and efective method for extracting Event Frames in Italian, achieving a performance of 0.9 F1-score for Frame Identification and 0.72 for Frame Element Identification on the original dataset [20]. To answer RQ2 we employ GPT4 [21] and Llama 3.1-400b [22] as annotators, producing a new SDT annotation on data. To answer RQ3 we adopt a perspectivist approach [23], running the classification task on diferent label sets and even on combination of labels.

Lastly, to answer RQ4, we aggregate phases 1 and 2 under the label "rise" and phases 3, 4, and 0 under the label "decline," and then perform a binary classification task.

The paper is structured as follows: In Section 2 we describe how we created a benchmark from the Chronos dataset to promote the reproducibility of future experiments. In Section 3 we describe our experimental design, with annotation guidelines, prompts, analysis of labels and the results of the classification experiments. Finally, in Section 4, we draw our conclusion.

2. Data Previous work on the Historical Phase Recognition task

made a huge efort to produce annotated data [ 18], but the results of the previous classifications are not fully replicable. Hence we decided to develop a benchmark with ifxed training and test sets out of the Chronos dataset.

The Chronos dataset, built upon the Seshat historical databank [24] and augmented with Wikipedia content, provides time-series data, in Italian and English, of historical events for 366 polities across 18 sampling zones, spann ing from neolithic to the 2010 s CE. Each row in the dataset represents an historical decade of a polity in a sampling zone. Textual descriptions of the selected events that happened in the decade include information about wars, reforms, rulers, population, elites, disasters, alliances, socio-economic context, famines, protests, elite

1https://huggingface.co/mrovera/eventnet-ita 2https://huggingface.co/datasets/mrovera/eventnet-ita 2.1. Annotation and Agreement

First, we extracted event tags from the historical descriptions in Italian with EventNet-ITA. Then we removed duplicates and selected the rows with tags, text and recorded information. We obtained 1422 rows with data spann ing from antiquity to 2010 s. The data included also the original SDT labels, annotated by human hand following the points in these guidelines: 1. Read the textual description to identify key events: wars, reforms, rulers, population, elites, disasters, epidemics, alliances or treaties, socioeconomic context, famines or financial stress, protests or movements, religions. 2. Use polity identifiers to find the start and end points of cultures. The end of a culture represents a crisis period. 3. Starting from the beginning of a culture, initially assign the sequence of labels of a standard secular cycle model: 1,1,2,2,3,3,4,4,4,0 and then evaluate whether to keep or change the labels in each decade. It is possible to have longer or shorter cycles. There can be only one label 0 (crisis) per cycle. A polity can have one or more cycles. 4. Having in mind the key events in the textual description, select one of the following labels to describe the decade: 1=growth. A society is generally poor when it experiences renewal or change followed by demographic (but not always territorial or economic) growth. Reforms, alliances, wars won or similar events are potential indicators of this phase. 2=impoverishment of the population. Potential economic and/or territorial expansion slows while demography continues to expand. The elite takes much of the wealth and defines the status symbols. Stability and external attacks are potential indicators of this phase. 3=Overproduction of the elites. The wealthy seek to translate their wealth into positions of authority and prestige. The population becomes poor. Movements, protests, and wars are potential indicators of this phase. 4=State stress. The elites want to institutionalize their advantages in the form of low taxes and privileges that lead the state into fiscal dificulties. Wars, protests and changes in the elite are potential indicators of this phase. 0=Crisis. a triggering event such as a war, revolt, famine or disaster that the state is unable to manage leads to a new configuration of society. Emigration of elites, subjugation to other societies, civil wars or profound reforms are potential indicators of this phase. 5. Use the progressive order of the phases if no textual description is available for the decade. 6. Make sure there is a progressive order of the labels (e.g. phase 3 must follow phase 2). All labels can be repeated in the following decade except the crisis phase, which conventionally lasts one decade.

The annotation in the Chronos dataset was validated with three human annotators, who independently labeled a sample of 93 examples from the data. The initial agreement was low (Fleiss’ k 0.206) because a single disagreement has an exponential impact on the rest of the sequence, but after a training session and the use of a standard pattern to start with (the sequence of secular cycle labels 1,1,2,2,3,3,4,4,4,0), the agreement between humans raised to Fleiss’ k 0.455.

In order to answer RQ3 (whether it is easier to predict labels annotated by humans or LLMs) we produced new labels using GPT4 (1.8 trillion parameters) and Llama 3.1 (405 billion parameters) with the prompt reported in Figure 1 and temperature of 0.5. We provided the input data in chunks containing sequential decades of one or two polities per run. Despite the prompt explicitly required to assume that the sequence of labels follows a standard secular cycle model like the one used by humans (1,1,2,2,3,3,4,4,4,0), sometimes the LLMs produced as output unordered labels.

In order to create a benchmark, we split the data into training (1222 instances) and test set (200 instances). The labels have comparable distributions in the training and test set, as reported in Figure 2. While human and LLM labels approximate a log-normal distribution, the averaged labels approximate a normal distribution. This is because averaging labels with big misalignments (such as label "1" and label "4") tend to produce more labels "2", which became a wastebasket label.

We computed the inter annotator agreement over all 1422 examples and pairs of annotators, greatly expanding the experiments presented in literature. We evaluated results with k statistics and Krippendorf’s [25]. Although pairs that mix human and LLM annotations have an agreement comparable to previous results, here GPT4 Act as an expert historian and consider the Structural Demographic Theory (SDT). Given a set of descriptions of historical decades for diferent polities, label each description with one of the following secular cycle phases (sdtphase): 0=crisis (in this phase may happen societal collapse patterns, power transitions, conflicts, administrative or social structure changes, and external influences. Look for signs of civil wars, military coups, environmental factors, population movements, reform of tax systems, trade network disruptions, class conflicts, and foreign invasions). 1=growth (a society recovers from a crisis finding a new fresh culture that creates social cohesion. to recognize this phase examine the power structure patterns, legitimacy of rule, social organization, cultural elements, military aspects, and social changes. Look for the presence of strong elite classes, religious legitimation of power, centralized administrative systems, trade networks, cultural practices, territorial expansion, and population movements); 2=population impoverishment (growth slows and inequalities begin to emerge. to recognize this phase evaluate the power dynamics, economic patterns, military aspects, cultural/religious elements, administrative features, and infrastructure development. Look for succession struggles, trade route development, territorial conquests, religious tolerance, bureaucratic reforms, and construction projects); 3=elite overproduction (the number elite aspirants rises and the social lift mechanisms deteriorate. To recognize this phase assess power dynamics, governance, economic patterns, social structures, cultural and technological development, and common catalysts for change. Look for power struggles, trade system developments, social unrest between elite and population, religious developments, and military conflicts), 4=state stress (elites struggle to institutionalize their advantages. to recognize this phase review political instability, power struggles, economic challenges, military conflicts, administrative changes, and social/religious tensions. Look for succession disputes, financial crises, territorial loss, reforms to advantage specific elite groups, social unrest and religious conflicts). Initially assume that the sequence of labels follows a standard secular cycle model: 1,1,2,2,3,3,4,4,4,0 and then evaluate whether to keep or change the labels in each decade. Evaluate each label on the basis of the preceding and following ones. It is possible to have longer or shorter cycles. A cycle cannot turn back and cannot skip phases. So if in 1940 there is a phase 0, in 1950 there should be a phase 1, in 1960 there can be a phase 1 or phase 2. If in 1960 there is a phase 2, in 1970 there can be a phase 2 or phase 3, not a phase 4. If in 1970 there is a phase 3, in 1980 there can be a phase 3 or 4, and if in 2000 there is phase 4, in 2010 there can be a phase 0 or another phase 4. The decade after phase 0 the cycle restarts from phase 1.

This is an example of the input (json): ⟨⟩ and this is the desired output (csv): ⟨⟩ set of descriptions to label (json): ⟨⟩ 0.5; moreover, these findings closely match the results obtained when both humans and LLMs received identical instructions and the temperature was set to zero [19].

The evaluation with Krippendorf’s , which could bet

ter capture the importance of label order, shows results similar to the ones computed with Fleiss and Cohen’s k, suggesting that there might be disagreements on distant labels, like 0 and 4. Results are reported in Table 1. Results of inter-annotator agreement between pairs of Histor

2.2. Contents

The final dataset contains the following features: • a decade ID formatted with a standard method: 2 letters to indicate the area of origin of the culture, 3 letters to indicate the name of the polity, 1 letter to indicate the type of society (c=culture/community; n=nomads; e=empire; k=kingdom; r=republic), 1 letter to indicate the periodization (t=terminal; l=late; m=middle; e=early; f=formative; i=initial; *=any) and a number corresponding to the decade. For example "EgPdyk*-2960" is the pre-dynastic kingdom of

Egypt in the 2960s b.C. "ItRomrm-220" is the middle Roman Republic in the 220s b.C. and "TrOttet1850" is the terminal phase of the Ottoman Empire in the 1850s;

• a short Italian textual description of the decade (the one used for the experiments); • a short English textual description of the decade; • the list of tags extracted from text; • human annotated SDT labels; • SDT labels annotated with GPT4, • SDT labels annotated with Llama3.1, • the average of all the SDT labels, turned into integer values; • the average of the SDT labels generated with LLMs, turned into integer values; • the binary labels annotated by humans obtained from SDT labels (1,2=rise; 3,4,0=decline); • the binary labels annotated by LLMs obtained from SDT labels (1,2=rise; 3,4,0=decline).

Examples of data follows3:

1. JpKamk*1290, “al tempo del reggente Ho¯jo¯

Sadatoki (r. 1284–1301) per il principe Hisaaki il clan Ho¯jo¯ era alleato del clan Adachi. Tuttavia un complotto di Adachi Yasumori per usurpare gli Ho¯jo¯ portò al colpo di stato noto come incidente Shimotsuki. vinse Hojo.”,“at the time of Regent Ho¯jo¯ Sadatoki (r. 1284–1301) for Prince Hisaaki the Ho¯jo¯ clan was allied of the Adachi clan. However a plot by Adachi Yasumori to usurp the Ho¯jo¯ resulted in the coup known as Shimotsuki incident. the Ho¯jo¯ won.”, process*PROCESS_START activists*POLITICAL_ACTIONS invader*INVADING PROCESS_START PO

LITICAL_ACTIONS INVADING,4,4,4,4,4,0,0 2. IqBabke-1750, “possibile apertura di una rotta commerciale per beni di lusso e minerale di stagno verso il Levante (Caanan) e l’Anatolia orientale (occupata dagli Assiri).”,“possible opening of a commercial route for luxury goods and tin ore towards the Levant (Caanan) and eastern Anatolia (occupied by Assyrians).”, Figure 3: Wordclouds of Event tags in the binary classificaland*OCCUPANCY occupier*OCCUPANCY tion task. The wordclouds include only the examples where OCCUPANCY,2,2,2,2,2,1,1 all annotations agreed on the same label. Event frames are 3. EgMamke1340,“peste nera ad Alessandria nel represented in uppercase while frame elements in lowercase. 1347. Serie di sultani di breve durata.”,“black death in Alexandria in 1347. Series of short lived Sultans.”, old*TAKE_PLACE_OF killer*KILLING cause*DEATH place*DEATH time*DEATH TAKE_PLACE_OF KILLING DEATH,4,1,3,3,2,0,1 on the same label. Figure 3 reports the wordclouds for the binary classification task. As introduced in Section 2, Event Frames are shown in uppercase, while Frame

Example 1 describes the Japanese Kamakura period in Elements in small caps, along with their Frame, in the 1290s and is a case where all the annotations agree about format frame_element*EVENT_FRAME. The larger and phase 4 (or 0, "decline" in the case of binary labels). Ex- bolder a word, the more strongly it is associated with that ample 2 reports a description of Kassite Babylon in 1750s particular phase. From the wordclouds is clear that there b.C. and is a case where all annotations agree on phase 2 are overlapping Event Frames between the two phases (or 1, "rise"). Example 3 describes Mamluk Egypt in 1340s (eg: CONQUERING, WAR, CHANGE_OF_LEADERSHIP, and it is a case of disagreement between annotations. BEAT_OPPONENT), while the same Frame Elements

We ordered the data alphabetically using the text col- seem to have diferent frequencies in the two phases. umn, thus obtaining a pseudo-randomization of the in- Things are much more complicated in the multi-class stances and breaking the temporal sequences. We dubbed classification task, depicted in Figure 4. In summary, the this dataset "Chronos benchmark", which is freely avail- wordclouds show a progression where there are many able on Huggingface4. overlaps of Event Frames between phases, in particular the BEAT_OPPONENT and CONQUERING events. However, Frame Elements help distinguish between phases: 3. Analysis and Discussion theme*CONQUERING clearly appears in the growth and crisis phases, while other low-frequency elements, such In order to answer RQ1 (whether Event Tagging is useful as process*PROCESS_START, and goal*ATTEMPT are to recognize diferent phases), we performed an analysis distinctive of phases 3 and 4 respectively. In general, of events per label. To do so, we extracted wordclouds wordclouds with smaller words, like the ones for phase including only the examples where all annotators agreed 2, 3 and 4, highlight the need to capture weak signals for the classification tasks. 3sEmVaElNlcTa_pFsR.AMES are shown in uppercase, frame_elements in Overall, the similarity of the tags between phases il4https://huggingface.co/datasets/facells/ lustrate well how dificult is the Historical Phase Recogchronos-historical-sdt-benchmark nition task.

3.1. Experiments In order to answer the research questions listed in Sec

tion 1, we performed two distinct tasks: a multi-class classification, and a binary classification. Both tasks have comparable settings, with 768 features extracted with a frequency token matrix from the EventNet-ITA tags (events) and 768 features extracted with BERT-ItalianXXL (bert). To ensure replicability, we used Learnipy [26], a suite of algorithms for data science and machine learning in Colab Notebooks available online5,

Table 2 reports the balanced accuracy of diferent classification models: Naive Bayes (nb), Gradient Boosting (xgb), Linear Discriminant Analysis (lda) using the two feature extraction methods (events, bert) to predict the 5 SDT phases. The models were trained and evaluated on diferent sets of labels: human-annotated (human), an average of LLM annotations (llms), and an average of all annotations (all). The baseline for this task is 0.2.

Interestingly, the combination of human labels, event tags and an algorithm that captures weak signals (Gradient Boosting) yields good performances, suggesting that for the 5-class classification the event-based features align well with the human understanding of the SDT phases. However, the more robust results are achieved using event tags on the average of all labels, possibly for the normal distribution resulted from averaging the labels. In contrast, BERT struggles with human labels: the results show an average balanced accuracy lower than the baseline.

This might indicate that the contextual embeddings from BERT, while powerful, don’t directly capture the nuances of the SDT phases as efectively as the event-based

5https://colab.research.google.com/drive/

1G1VNHUCoDTso6wIWmrdvM21Z6D1PC6nL?usp=sharing

4. Conclusion

features when aligned with human annotations. However, the best performance when using labels averaged from LLMs is achieved with BERT features and Linear In conclusion, this study has taken initial steps in leverDiscriminant Analysis. This hints that the patterns cap- aging computational linguistics for the complex task tured by BERT might be more consistent with the way of Historical Phase Recognition within the StructuralLLMs interpret and label the SDT phases, although less Demographic Theory framework. Our investigation into transparent. the utility of Event Tagging revealed its promise, partic

An interesting point is that event tags show consistent ularly when aligned with human-annotated data, achievperformance across diferent label sets (human, all, llms). ing the most robust performance in the 5-class classificaThe event tagger features consistently provide compet- tion task. This suggests that explicitly identified event itive results, often outperforming or closely matching structures resonate with human understanding of SDT’s BERT, with the advantage of being transparent. This nuanced phases. Conversely, while powerful, BERT emhighlights the value of explicit event information for this beddings struggled to capture these nuances as efectively Historical Phase Recognition task. Overall, performance on human labels, hinting at a potential mismatch between still needs improvement. While some results surpass the its learned representations and the human interpretation baseline of 0.2, the balanced accuracy scores indicate of SDT. that accurately classifying the 5 SDT phases remains a Interestingly, BERT showed better performance with challenging task. LLM-generated labels, indicating a possible alignment

Table 3 presents the results of the binary classifica- in their interpretation patterns, albeit with a loss of tion task, where the 5 SDT phases were aggregated into transparency compared to event tags. Answering RQ1 "rise" (phases 1 and 2) and "decline" (phases 0, 3, and 4). (whether Event tagging is useful): our results show that The same feature extraction methods and classification event tags help Historical Phase Recognition when coualgorithms were used on human-derived binary labels pled with human annotations. Instead, having LLM(human) and LLM-averaged binary labels (llms). generated labels, transformer models seem the best choice. In general our results show similar improvements over the baseline with the multi-class and binary classification tasks. Hence, answering RQ3 (which kind of label

Declaration on Generative AI

[12]

Turchin , Political instability may be a contributor

in the coming decade , Nature 463 ( 2010 ) 608 - 608 . F.C. : conceptualization, experiments and main [13]

Turchin ,

Korotayev , The 2010 structuralmanuscript text; M.R.: data enrichment with Event demographic forecast for the 2010-2020 decade: Tagging, manuscript editing. All authors edited and A retrospective assessment , PloS one 15 ( 2020 ). reviewed the manuscript . [14]

Turchin ,

Structural-Demographic Analysis of

American

History , Beresta Books Chaplin, 2016 . Acknowledgments [15]

Orlandi ,

Hoyer ,

Zhao ,

J. S.

Bennett , M. Be-

nam , K. Kohn, P.

Turchin , Structural-demographic This research was supported by the European Commis- analysis of the qing dynasty (1644-1912) collapse sion, grant 101120657: European Lighthouse to Manifest in china , Plos one 18 ( 2023 ) e0289748 . Trustworthy and Green AI-ENFIELD . [16]

Korotayev ,

Zinkina , Egypt's 2011 revolution: A

revolutions in the 21st century: The new waves of References revolutions, and the causes and efects of disruptive

political change , Springer, 2022 , pp. 651 - 683 . [1]

C. E.

Maldonado , History as an increasingly com- [17]

Turchin , End times: elites, counter-elites, and the

plex system, History and Cultural Identity: Retriev- path of political disintegration , Penguin , 2023 .

ing the Past, Shaping the Future (

2011 ) 129 - 152 . [18]

Celli ,

Basile , History repeats: Historical phase [2]

Lund ,

Basso Fossali ,

Mazur , M. Ollagnier- recognition from short texts , Proceedings of CLIC-

Beldame , Language is a complex adaptive system: it 2024 ( 2024 ).

Explorations and evidence , Language Science Press, [19]

Celli ,

Basile , Large language models rival

2022. human performance in historical labeling , in: Pro[3]

Toynbee 's, A study of history , Munich: List. ceedings of ARDUOUS 2025 , co-located with ECAI,

Henry , William P. , Greek Historical Writing: A His- 2025 .

toriographical Essay ( 1991 ). [20]

Rovera , Eventnet-ita: Italian frame parsing for [4 ]

Luhmann ,

Baecker ,

Gilgen , Introduction to events, in: Proceedings of the 8th Joint SIGHUM

systems theory , Polity Cambridge, 2013 . Workshop on Computational Linguistics for Cul[5]

Dalio , Principles for dealing with the changing tural Heritage, Social Sciences, Humanities and Lit-

world order: Why nations succeed or fail, Simon erature (LaTeCH-CLfL

2024 ), 2024 , pp. 77 - 90 .

and Schuster , 2021 . [21]

J. A.

Baktash ,

Dawodi , Gpt-4: A review on ad[6] I. Wallerstein, Historical systems as complex sys- vancements and opportunities in natural language

tems , European Journal of Operational Research 30 processing, arXiv preprint arXiv:2305.03195 ( 2023 ).

( 1987 ) 203 - 207 . [22]

Deroy ,

Maity , Code generation and algorith[7]

Lai ,

J. R.

Porter ,

Amodeo , D.

Miller, mic problem solving using llama 3.1 405b, arXiv

Marston ,

Armal , A natural language process- preprint arXiv:2409.19027 ( 2024 ).

ing approach to understanding context in the ex- [23]

Frenda ,

Abercrombie ,

Basile , A . Pedrani,

& Management 59 ( 2022 ) 102735. cessing: a survey, Language Resources and Evalua[8]

Fisichella ,

Ceroni , Event detection in tion ( 2024 ) 1 - 28 .

wikipedia edit history improved by documents web [24]

Turchin ,

Whitehouse ,

François ,

Hoyer ,

tive Computing 5 ( 2021 ) 34 . J. Bennet , et al., An introduction to seshat: Global [9] M. Rovera , A knowledge-based framework for history databank , Journal of Cognitive Historiogra-

events representation and reuse from historical phy 5 (

2020 ) 115 - 123 .

archives, in: European Semantic Web Conference, [25]

Krippendorf , Computing krippendorf's alpha-

Springer , 2016 , pp. 845 - 852 . reliability ( 2011 ). [10]

Sprugnoli ,

Tonelli , One, no one and one hun- [26]

Celli ,

Casadei , Learnipy: a Repository

events in an inter-disciplinary perspective , Nat- Coding , Technical Report , 2022 . URL: https:

ural language engineering 23 ( 2017 ) 485 - 506 . //github.com/facells/fabio-celli-publications/blob/ [11]

J. A.

Goldstone , Demographic structural theory: 25 main/docs/2022_learnipy_techreport.pdf.

years on, Cliodynamics 8 ( 2017 ).