How to Compress Categorical Variables to Visualize
                         Historical Dynamics
                         Fabio Celli1,*
                         1
                             Research & Development, Maggioli SpA, via Bornaccino 101, 47822 Santarcangelo di Romagna


                                        Abstract
                                        This paper explores innovative Knowledge Discovery and Representation techniques for historical data within
                                        the field of Digital Humanities. In particular this research introduces Time-Resolved Variables, an information
                                        compression technique, to represent the evolution of categorical variables through time. This technique proves
                                        to be more effective than One-Hot Encoding with Principal Component Analysis in explaining the increase of
                                        social complexity from historical data. Moreover, this work highlights the potential of Time-Resolved Variables
                                        to enhance model explanation by means of correlation analysis and graph visualization. The result of this work,
                                        the Chronos dataset, is available online in a shared Spreadsheet for collaborative research and paves the way
                                        towards more transparent and trustworthy use of AI in history and cliodynamics.

                                        Keywords
                                        Knowledge Discovery and Representation, Data Compression, Cliodynamics, Model Explanation


                         1. Introduction and Related Work
                         Knowledge Discovery and Representation (KDR) is crucial for the scientific method in Digital Humanities
                         [1] as it bridges the gap between unstructured data and theory, enabling data-driven hypothesis testing
                         [2]. However, certain phenomena like missing historical records [3], particularly prevalent in historical
                         data, introduce data integrity issues [4] due to the increasing scarcity of information as we look
                         deeper into the past. Additionally, crowdsourcing historical data annotation is challenging [5] due
                         to the subjective nature of historical interpretation. However, technological advancements have led
                         to the development of KDR techniques to address challenges such as these. Specifically, there are
                         three generations of Knowledge Discovery systems [6]. The first generation focused on collecting
                         and querying data through large databases. The primary challenges in this phase were related to
                         Knowledge Organization and Information Retrieval [7]. The second generation introduced powerful
                         tools for extracting and visualizing patterns within data, enabling the reconstruction of events from
                         unstructured sources and their presentation in the form of timelines and maps. For instance, it is
                         possible to generate maps based on data about medieval trade routes [8] [9] or timelines illustrating
                         the evolution of linguistic events derived from textual data [10]. The third generation, which has
                         recently emerged, leverages Artificial Intelligence (AI) and Large Language Models (LLMs) to tackle
                         data integrity issues. One application of this kind is the restoration of ancient inscriptions using AI
                         [11]. However, a significant challenge remains: ensuring that AI models are transparent to humans.
                         The ultimate goal is to create systems that are trustworthy, human-readable, computationally efficient,
                         and capable of self-maintenance within a human-data-AI continuum [12].
                            Crucially, the field of cliodynamics is facing the same challenges and, among other research goals,
                         addressed the problem of croudsourcing, producing Seshat, a valuable expert-compiled historical dataset
                         that is suitable for computational analysis [13]. The basic concept of Seshat is to provide quantitative
                         and semi-structured data about past societies, defined as political units (polities). It contains data from 35


                         IRCDL 2025: 21st Conference on Information and Research Science Connecting to Digital and Library Science, February 20-21 2025,
                         Udine, Italy
                         *
                           Corresponding author.
                         $ fabio.celli@maggioli.it (F. Celli)
                          https://github.com/facells/fabio-celli-publications (F. Celli)
                          0000-0002-7309-5886 (F. Celli)
                                       © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
sampling points equally distributed across the globe in a time window from roughly 10000 BC to 1900 CE
and sampled with a time-step of 100 years. Seshat is designed for hypothesis testing. For example it has
been used for comparing competing hypotheses about the evolution of social complexity, like theories
that see social complexity as a product of organizational challenges due to environmental changes [14]
or a product of strong normative beliefs in moralizing Gods [15]. A data-driven analysis on Seshat
with dynamic regression models revealed a strong causal role played by a combination of increasing
agricultural productivity and adoption of new military technologies [16]. The Seshat databank provides
many dimensions that report the presence or absence of a cultural trait in a polity at a specific point
in time, for example the presence/absence of smelting copper skills, fortifications, firearms, written
literature, coins and many others. In Data Mining these fine-grained cultural dimensions can be treated
computationally with One-Hot Encoding (OHE), a widely used approach for expressing the presence
or absence of categorical features into numerical ones, represented as 1 or 0. However, OHE results
in very sparse feature values [17] and this makes it difficult to extract patterns and generalize models
from data for Knowledge Discovery. Moreover, OHE does not solve the problem of missing data. A
common approach is to employ a compression technique like Principal Component Analysis (PCA) [18]
to combine the one-hot encoded features into more general variables. However, this process converts
the categorical variables into aggregated dimensions, very difficult to interpret. Among compression
techniques for time series there are some based on dictionaries, some based on function approximation,
some on sequential algorithms and more recent techniques based on autoencoders, but there is no
compression technique for time series designed to be human-readable [19].


2. Scope of this work
This paper addresses the challenges of data integrity and knowledge representation in historical research
by introducing a novel, human-readable information compression technique for time series: Time-
Resolved Variables (TRVs). In essence, TRVs represent sequences of categories, ordered chronologically
by their first known historical appearance, as numerical values corresponding to their position in
the scale. For instance, a time-resolved variable for military technologies might assign the values 0.1
to stone/wood, 0.2 to copper, 0.3 to bronze, 0.4 to iron and so on. Interpolation can be incorporated
using decimal numbers between these primary scale steps, signifying transitional periods between
one level and the next. The values assigned to the steps in TRVs are numerical indexes that represent
interpretable labels in a sequence. TRVs can encode two semantic dimensions: steps and duration
between steps. Steps (i.e. 0.1 is stone, 0.2 is copper, 0.3 is bronze etc..) return an equidistributed sequence,
that is indicated for visualization with timeline charts and comparison of different scales (i.e. cultural
evolution of different societies). Instead encoding both the sequence and duration between steps (i.e. -10
is stone/wood, -0.5 is copper, -0.3 is bronze etc..) returns the inherent structure of historical events and,
depending on the presence of patterns, cloud lead to a low-discrepancy sequence or to a purely random
sequence, indicated for scatterplot visualizations. This paper aims to discover the characteristics of
TRVs with equidistributed sequences in regression, correlation and visualization tasks, by comparing
them against OHE-PCA. Experiments can provide evidence to understand the difference between the
information extracted with the two techniques. It is possible to compare TRVs also against other
compression techniques, but PCA has been selected because principal components are uncorrelated
and reduce redundant information, which is optimal for the selected tasks, in particular for correlation
analysis and data visualization. Further experiments with TRVs are left to future work.
   TRVs can be created in two ways: manually, by using information from digital libraries, or auto-
matically, by prompting LLMs to extract knowledge [20]. Both methods have drawbacks. Manual
annotation is time-consuming, while LLM-generated TRVs can contain inaccuracies or "hallucinations".
To effectively use TRVs in experiments and gain an advantage over OHE-PCA, it is possible to adopt
a three-step approach: First, manually annotate a small dataset to create a reliable ground truth. Sec-
ond, use LLMs to generate a larger dataset of TRV annotations. Finally, evaluate the accuracy of the
LLM-generated data by comparing it to the ground truth using correlations or other statistical tests. In
general, TRVs offer a transparent approach to compressing historical data into a numerical format that
is machine-readable and human-readable at the same time, while also enabling interpolation of missing
data points. This goes in the direction of trustworthy and transparent human-data-AI continuum.
   This paper focuses on the following research questions (RQs):

    • RQ1: Are TRVs more informative than OHE-PCA features to compress historical information?
    • RQ2: Are TRVs useful for visualizing and interpret the results of Knowledge Discovery?

To answer RQ1, this paper presents a comparison of the same independent variables, encoded with TRVs
and OHE-PCA, to predict a set of dependent variables, and evaluates the results with the coefficient of
determination 𝑅2 . To answer RQ2, this paper presents a comparison of correlation graphs extracted
from the same features, but treated as OHE-PCA and TRVs, providing an explanation of the results.
Prompt engineering for the generation of TRVs with LLMs is outside of the scope of this paper, as it is
covered in Celli and Mingazov 2024 [20]. However, it is important to report here that Gemini 1.5 flash
is capable of grounding its output with real references, thanks to the Google search engine [21]. This
has a great potential for digital libraries, as it is possible to automatically link TRVs and existing text
sources by means of LLMs.
   The paper is structured as follows: in Section 3 there is a description of TRVs annotated on data and
the experiments to answer to RQ1 and RQ2. Finally, Section 4 reports a discussion of the results, draws
conclusions and outlines directions for future work.


3. Method and Experiments
A schema of the method adopted in this paper is depicted in Figure 1. First there is a data preparation
process where Seshat is integrated and compressed into a structured dataset, processed with OHE
and PCA, dubbed “Seshat-pca” (Section 3.1). Then there is the TRVs annotation phase, where the
original categorical variables of Seshat are grouped and transformed into scales according to first time
of appearance found in scholarly literature. This operation creates a new dataset, named “Chronos”
(Section 3.2), that is compared against Seshat-pca to test the predictive power of TRVs and OHE-PCA
compression methods and answer RQ1 (Section 3.3). Then there is a second comparison, with the
extraction of a correlation matrix and visualization of correlation graphs form the two datasets (Section
3.4). Finally there is the analysis and discussion of the graph model to answer RQ2 (Section 4).


Figure 1: Schema of the method adopted.
  The experiments are designed to test the usefulness of TRVs in the historical domain, but in principle
this technique can be applied to any categorical data with a sequence, not necessarily temporal, and
can preserve transparency while compressing information. Potential applications outside the historical
domain include the analysis of music genres, where emotions can be expressed as scales [22], and
hypothesis testing for theories that provide predictions which can be turned into sequences.

3.1. Compression of categorical variables in Seshat
The Seshat project released many replication datasets containing data about polities sampled from 35
Natural Geographic Areas (NGAs). The present work employed the categorical dimensions in the Social
complexity dataset and the Axial age dataset [23]. The two datasets were structured and integrated,
and the original variables were processed with OHE and PCA. The variables are grouped following the
original macro-categories of Seshat:
   military technology (militech): the PCA of presence/absence of copper, bronze, iron, steel, gun-
powder siege artillery in a polity;
   warfare tools and tactics (warfare): the PCA of presence/absence for each polity of small vessels,
plate armor, laminar armor, settlement in defensive position, earth ramparts, moats, scaled armor, stone
walls, horses, camels, spears, wood barks, leather cloth, shields, helmets, breastplates, limb protection,
donkeys, composite bows, battle axes, daggers, swords, modern fortifications, bows, javelins, slings,
crossbows, handled firearms, complex fortifications, fortified camps, chain mail tension siege engines,
war clubs, elephants, ditch, pole arms, specialized military vessels, stone walls, atlatl;
   agriculture technology (agritech): the PCA of presence/absence of cropping, field rotation, irrigation
and fertilizers for each polity;
   morality in religion (morality): the PCA of presence/absence of moral features in the polity:
moral enforcement in this life, moralizing enforcement is agentic, moral religion is adopted by elites,
moralizing is certain, broad moralizing norms, moral concern is primary, rulers are gods;
   economy management (economy): the PCA of presence/absence of tokens, articles, precious metals,
indigenous coins, foreign coins, paper currency;
   information management (infomedia): the PCA of presence/absence of lists and tables, calendars,
sacred text, religious literature, scientific literature, fiction, philosophy, practical literature, history;
   writing system (alphabeth): the PCA of presence/absence of phonetic writing, non-phonetic writing,
script, mnemonic devices, non-written records;
   communication systems: the PCA of presence/absence of couriers, postal stations, general postal
service in the polity;
   infrastructure levels (infrastructure): the PCA of presence/absence of roads, mines and quarries,
ports, canals, drinking water supply, irrigation and production systems, bridges, markets, food storage
sites;
   political system (politics): the PCA of presence/absence of constraints on executive by non-
government (population representatives), constraints on executive by government (aristocracy), legal
impeachment for each polity.
   Finally, the three dependent variables of social complexity (that will be called also target variables)
were processed in the same way as defined by Turchin[16]:

    • Social scale (SCALE), is the PCA of the log-transformed polity population, polity territory and
      the population of the largest settlement.
    • Hierarchical complexity (HIER), is the PCA of the raw count of levels in administrative, military
      and settlement hierarchies.
    • Specialization of governance (GOV), is the PCA of the following 11 One-Hot encoded variables:
      presence or absence of professional officers, soldiers, priests, lawyers, full-time bureaucrats,
      specialized buildings for government, examination system, merit promotion, formal legal code,
      full-time judges, courts.
   It is very important to point out that Social scale and Hierarchical complexity are originally
homogeneous sets of numerical variables, while Specialization of governance is a heterogeneous set
of One-Hot encoded categorical variables. All the features in Seshat Were normalized in order to align
them with the target variables.

3.2. Annotation of Time-Resolved Variables in Chronos
To ensure that the categories represented by TRVs are compatible with the ones in Seshat, original
categories were arranged in a chronological order according to their earliest historical appearance.
Wikipedia and scientific literature have been taken as reference. A random subset of 186 polities from
Seshat with information available for social complexity variables has been selected for the annotation.
The TRV scales were defined as follows:
   Technology for military purposes (militech) encodes the level of military technology of a polity
through time and is related to the military use of metals and technologies. It encodes the following steps:
0.1 is stone/wood/clay, 0.2 is copper, 0.3 is bronze, 0.4 iron, 0.5 steel, 0.6 gunpowder and 0.7 uranium/nuclear.
   Warfare tactics (warfare) encodes the level of military strategy of a polity. It encodes the following
steps: 0.1 hunters with projectiles, 0.2 armies and fortifications, 0.3 armies with animals and chariots, 0.4
armies with naval and siege forces, 0.5 there are armies with industrial forces including machine guns,
logistics and tanks, and then 0.6 is cyberwarfare.
   Type of agriculture (agritech) encodes the stage of agriculture development through history. At
level 0 there is spontaneous cropping, at 0.1 there is swidden/slash-and-burn agriculture, at 0.2 is fallow
agriculture with irrigation, at 0.3 there are two fields/crop rotation, at 0.4 there are nitrogen-fix/fertilizers
and at 0.5 there are GMOs (Genetically Modified Organisms).
   Religious system (religion) puts along a scale the types of religions. Not single religions, but religion
clusters. In Seshat this dimension is restricted to features about moralizing Gods but, since Savage[24]
pointed out that complex societies precede moralizing Gods, this dimension has been changed, in order
to test whether it captures social complexity better than morality. At the first level (0.1) we have the
cult of the dead and spirits, 0.2 is the cult of ancestors/family/totem. 0.3 are the fertility cults, 0.4 are the
polytheisms, 0.5 are monotheisms, 0.6 are philosophies, such as Buddhism, Confucianism and Taoism.
At 0.7 there is humanism, that includes atheism, agnosticism and ideologies (like communism and
capitalism).
   Economic level (economy) encodes the economic advancements in history. At level 0 there are
subsistence and exogamy, at 0.1 tokens and barter. at 0.2 precious metals and weights, at 0.3 there are
coins, at 0.4 paper currency and at 0.5 there is the stock market.
   Information management (infomedia) encodes the level of informative content that a culture
needs to mediate and diffuse to keep its stability. At level 0 there are mnemonic devices and oral tradition,
at 0.1 there is symbolism, at 0.2 there are calendars and lists, at 0.3 religious, philosophical and scientific
texts; at 0.4 there is fiction literature and at 0.5 news and opinions.
   Writing system (alphabeth) encodes the chronological order of appearance of alphabet types. At
level 0 there is no writing, at 0.1 script and logographic writing, at 0.2 syllabic and non-phonetic writing
and at 0.3 are phonetic alphabeths.
   Communication systems (communication) encodes the evolution of long-distance communication.
At level 0.1 there are couriers, at level 0.2 networks of courier stations, at 0.3 it appears a centralized postal
service, at 0.4 electric narrowcasting like the telegraph, at 0.5 the electric broadcasting like radio and TVs,
and at level 0.6 there is computer mediated communication.
   Infrastructure Level (infrastructure) encodes the capability of a polity to extract, transport, and
preserve goods. At level 0.1 there are routes and quarries, at 0.2 storage sites and special buildings. at 0.3
there are irrigation and production systems, at 0.4 there are urban markets, at 0.5 portual systems and at
level 0.6 logistics and telecommunications. Level 0.7 is about space stations.
   Political system (politics) encodes the development of the limits that are imposed to the rulers. At
the base level there is a sole ruler, and no limits to the ruler’s power. At level 0.1 decisions were taken in
a collective assembly, at level 0.2 there are representatives of the population, that are needed in large
societies to put a constraint on the decisions taken by aristocratic assemblies and are the foundation of
democracies. At 0.3 there is legal impeachment, that allows an authority to legally remove the powers
from a ruler without physically killing him.
   In the context of polities, TRVs represent generalized increasing stages of social complexity, and
by design tend to reduce variety into broad classes. For example Christianity, Islam and Judaism are
very different religions, but all fall under the umbrella of monotheism. The same for the beginning of
symbolism: symbols produced in 9000 BC in Göbekli Tepe and symbols produced in Jiahu around 6000
BC are totally different things, but functionally they both represent the transition to a new level of
complexity in information management. These TRVs are all considered independent variables in the
experiments.

3.3. Experiments
The first experiment aims to predict the same dependent variables (SCALE, HIER, GOV) using the
independent variables encoded with TRVs and OHE-PCA in the two different datasets. The polities
were aligned, obtaining the same 186 polities in Seshat and Chronos.
   The experiment design is 80% training and 20% test split with 𝑅2 as evaluation metrics. Basically, the
higher the score, the more variance can be explained. The regression is performed with linear modeling
(linear regression) and non-linear modeling (random forest regression).

Table 1
Regression of target variables with all features in Chronos-trv and Seshat-pca. The best results are marked in
bold.
                               dataset        algorithm          target 𝑅2
                               chronos-trv linear regression GOV          0.747
                               chronos-trv linear regression SCALE 0.744
                               chronos-trv linear regression HIER         0.829
                               chronos-trv random forest         GOV      0.721
                               chronos-trv random forest         SCALE 0.804
                               chronos-trv random forest         HIER     0.860
                               seshat-pca     linear regression GOV       0.789
                               seshat-pca     linear regression SCALE 0.719
                               seshat-pca     linear regression HIER      0.796
                               seshat-pca     random forest      GOV      0.853
                               seshat-pca     random forest      SCALE 0.697
                               seshat-pca     random forest      HIER     0.798

   The results, reported in Table 1, show that TRVs explain more variability than OHE-PCA only in the
case of SCALE and HIER dependent variables, both with linear and non-linear modeling. In particular,
the difference is greater with non-linear modeling, It is interesting to note that OHE-PCA yields better
models of the GOV dependent variable. This suggests that TRVs are better predictors of data expressed
as numerical scales, while OHE-PCA are better in the case of prediction of data that is originally one-hot
encoded.
   These findings rise the question of how are the independent variables distributed with respect to the
dependent variables. Figure 2 reports the distribution of all variables as histograms, and reveals that
the GOV dependent variable has a multimodal distribution, while SCALE and HIER have log-normal
distribution. This confirms that TRVs are best suited for predicting homogeneous and continuous
variables.
   These results also raise the question of how much each independent variable contributes to the
prediction. Ablation studies were done to answer this question. Given that the distribution of the
variables is not normal, Spearman correlations were used for the analysis.
   Results, reported in Table 2, show at least four interesting phenomena:
Figure 2: Histograms representing the distribution of all variables.


Table 2
Spearman correlations between the features in the two datasets and the target variables. *=p-value <0.005;
**=p-value <0.001. The best correlations are marked in bold.
               feature                        dimensionality      SCALE      GOV        HIER
               chronos-trv-warfare            5                   0.857**    0.765**    0.811**
               chronos-trv-infomedia          5                   0.763**    0.793**    0.815**
               chronos-trv-infrastructure     6                   0.806**    0.766**    0.822**
               chronos-trv-communication      4                   0.774**    0.732**    0.831**
               chronos-trv-economy            5                   0.826**    0.746**    0.801**
               chronos-trv-militech           6                   0.748**    0.679**    0.739**
               chronos-trv-religion           6                   0.764**    0.766**    0.760**
               chronos-trv-alphabeth          3                   0.596**    0.499**    0.653**
               chronos-trv-agritech           4                   0.312**    0.298**    0.318**
               chronos-trv-politics           4                   -0.013**   -0.000**   -0.020**
               seshat-pca-warfare             38                  0.868**    0.778**    0.831**
               seshat-pca-infomedia           9                   0.763**    0.790**    0.826**
               seshat-pca-infrastructure      9                   0.735**    0.796**    0.809**
               seshat-pca-communication       3                   0.755**    0.721**    0.814**
               seshat-pca-economy             6                   0.741**    0.653**    0.719**
               seshat-pca-militech            5                   0.745**    0.654**    0.714**
               seshat-pca-morality            7                   0.672**    0.695**    0.731**
               seshat-pca-alphabeth           5                   0.705**    0.753**    0.754**
               seshat-pca-agritech            3                   0.119*     0.200*     0.175
               seshat-pca-politics            3                   0.204*     0.199      0.274**


    • dimensionality has an impact on the predictive power of aggregated variables, but the im-
      provement is not linear, as evidenced by the difference between in the warfare variable, where
      seshat-warfare has just a slightly higher performance with a much larger dimensionality;
    • the p-values of the TRVs in the Chronos dataset are always below 0.001, unlike in Seshat-pca;
    • the TRVs of the Chronos dataset yield most of the best correlations despite the dimensionality of
      the time-resolved variables is, by average, lower than in the OHE-PCA variables in Seshat;
    • religion type is more informative than morality.

It is interesting to note that, given roughly the same dimensionality, TRVs tend to have higher corre-
lation strength to the target variables, like in the case of chronos-economy, chronos-militech, chronos-
communication, chronos-religion and chronos-agritech. This means that TRVs are able to summarize the
information better than OHE-PCA on historical data.
3.4. Correlation Analysis and Interpretation
A correlation analysis is performed on the Chronos-TRV and Seshat-PCA datasets to answer RQ2 and
test whether TRVs are useful for the interpretation of historical data. This is a qualitative evaluation
from data visualization.


Figure 3: Filtered spearman correlation matrices of the independent variables in the two datasets.


  First of all, correlation matrices are extracted from the two datasets. Only strong and significative
correlations between the independent variables are kept, filtering out self-correlations; the dependent
variables (GOV, SCALE, HIER); the dimensions with 𝜌 between -0.5 and 0.5 and with p-value > 0.001.
These matrices, reported in Figure 3, reveal that

    • there are no inverse correlations;
    • correlation strength is similar with the two information compression techniques;
    • different variables were filtered out in the two datasets.

   In particular, in Seshat-PCA the infomedia and alphabeth dimensions are missing, while in Chronos-
TRV there are no politics and agritech variables. It is clear that the two datasets, compressed with
different techniques, are showing different knowledge representations and narratives.
   The interpretation of Seshat-PCA is not straightforward. PCA transforms correlated OHE variables
into a set of linearly uncorrelated principal components, hence strong correlations between components
means they contribute similarly to the direction of maximum variance, likely having similar loadings
on the first principal component. In other words, high correlation between OHE-PCA transformed
variables means they are pulling together in a similar direction and, in the context of polities, this can
be interpreted as different variables contributing to a specific social goal.
   On the contrary, the interpretation of TRVs in Chronos is more transparent since stages of increasing
complexity are represented in each dimension. Under this perspective, high correlation between two or
more TRV-compressed variables means that they tend to increase complexity together.
   In order to have a better overview of the relations between dimensions, the matrices were visualized
as correlation graphs, reported in Figure 4.
Figure 4: Correlation graphs of Chronos and Seshat-PCA. Only variables with 𝜌 > 0.5 and p-value <0.001 are
displayed. The thickness and darkness of the edges is proportional to the correlation strength.


   The Seshat-PCA graph shows a general model of dimensions that cooperate towards a goal. For
example this graph presents a strong pattern of interplay between military technology, economy,
morality and agriculture. An explanation for this comes from the Structural Demographic Theory (SDT)
[25], which states that war, distributed wealth, agricultural productivity and strong morality in religion
can contribute to gain and keep societal stability.
   Chronos-TRV shows a different pattern, only partially overlapping with the one in Seshat-PCA.
Chronos-TRV shows strong correlations between the stages of increasing complexity in military
technology, economy, wafare tactics, information management and – with less strength – infrastructure
control and religion. This rich pattern can be interpreted in the light of the “Ratchet Effect” theory of
cultural transmission [26]. This theory posits that societal advancements, once achieved, are rarely
lost. As societies develop, they build upon existing knowledge and technology, leading to a cumulative
process that increase complexity, and the Chronos-TRV graph plots the strength of dimensions involved
in the process. New military technologies often lead to economic growth, with war or deterrence; a
strong economy can boost military research as well as fund warfare [27]. At the same time effective
warfare tactics often rely on superior information management and intelligence. Moreover, religion can
provide a unifying ideology, legitimizing state authority to invest effort in warfare or infrastructure
development, and this process is supported by technologies for information management, like media
[28].


Figure 5: Timeline of Chronos-TRV variables comparing the average advancements in politics, agritech and
militech. Values are averaged over all polities.


  However, not all the variables considered here present a ratchet effect pattern. The politics and
agritech variables are missing in the Chronos-TRV graph, because they have a correlation coefficient
below 0.5. This means that advancements in these dimensions are more subject to reversal than others.
A timeline (Figure 5) comparing the average advancements in politics, agritech and militech from the
Chronos-TRV dataset confirms this.
  In practice, the more a line tends upward, the more the variable is likely to follow a ratchet effect.
Societies that evolve in military technology, economy, warfare, information management or control on
the infrastructures often gain a competitive advantage and tend to increase complexity also in the other
dimensions in order to keep it.


4. Conclusion and Future Work
This paper addressed the issue of Knowledge Discovery and Representation for historical data in the
field of Digital Humanities, dealing in particular with the problem of data integrity. The experiments
presented in this paper suggest findings in the following areas:

    • Data Compression. From a technical point of view, TRV is an interpretable information com-
      pression technique, based on sequences and scholarly literature. It is suitable to explain variance
      of numerical variables with non-linear modeling, and to visualize complex phenomena through
      time. In order to have advantages over existing automated compression techniques, it is possible
      to use specific prompts to produce TRV-annotated data with LLMs. The suggested experimental
      design is to manually annotate a small dataset as a ground truth, generate TRVs-annotated data
      with LLMs, then testing the error in the generated data comparing it to the ground truth.
    • Hypothesis testing in Digital Humanities. Results reported in Table 2 show that religion type
      is more informative than morality in the prediction of social complexity. This is an hypothesis
      testing not just for different data mining techniques, but also for different theories. Consider that
      the variable seshat-pca-morality derives its categories from a systematic assessment of literature
      about Axial age theory [29], while the chronos-trv-religion variable lines up religion types by
      historical appearance, and can be subsumed under the umbrella of evolutionary theories of
      religion [30]. This means that, in principle, theories can be encoded and tested with compression
      techniques, and TRVs are suitable to encode theories that can provide scales or meaningful orders
      between categories. Moreover, order is not necessarily related to time. For example it is possible
      to use a cold-to-hot scale to encode categorical variables such as climate or geography into
      interpretable TRVs.
    • Knowledge Discovery. TRV encoding showed that politics and agritech variables do not follow a
      ratchet effect pattern. This suggests that it is necessary to distinguish between evolutive variables,
      that are linear over time (and more likely to show a ratchet effect), and adaptive variables, that
      show non-linear behavior over time (and more subject to reversal). Evolutive variables, like
      advances in military technology or economy, determine a competitive advantage that a society
      can no longer be given up. No known human society that adopted iron weapons ever abandoned
      them for bronze weapons. On the contrary, political systems evolve and involve easily. For
      example the Roman republic and the egalitarian societies of the neolithic introduced collective
      assemblies and population representatives, but in the following times these were replaced by
      many polities governed by sole rulers.
    • Model Explanation. Results of the correlation analysis suggest that TRVs can have a positive
      impact on interpretability, because we know in advance the semantics of each level of the scale
      and its relation to the following one. It is important to stress the fact that correlation is not a
      cause-effect relation and sequentiality is not causality (post hoc ergo propter hoc fallacy) but rather
      an interplay of Granger causality. Hence it is possible to associate different semantic relations to
      the steps of the scale. For example a semantic relation between the copper and the bronze level of
      the militech variable is enablement. In other words the ability to work copper enables the ability to
      work bronze, that is an alloy of copper and tin. In the same way the ability to manipulate symbols
      like seal stamps paves the way to the use of cuneiform writings for administrative information
      management. Under this perspective, the interpolations provided by TRVs can be useful to
         advance hypotheses on unknown things. For example we do not know what religion there was
         in Çatalhöyük, but it was something in-between a cult of the ancestors and a cult of fertility. The
         same for Göbekli Tepe, where TRVs predict a transition between a shamanic cult of the dead and
         a cult of ancestors. Moreover, in a network structure it is possible to generalize sematic relations
         to the edges between different dimensions: for example infrastructure control can facilitate the
         circulation of economic capital, and media can improve the dissemination of religious practices.
         Hence, it is possible to build a knowledge graph from historical data compressed with TRVs.

   The present work paves the way towards a new model of compressed knowledge representation
for historical data that is at the same time interpretable and machine-readable. Future work in this
direction includes:

       • testing prompts to generate annotation of TRVs with LLMs;
       • testing LLM-grounding to link TRV annotation to existing scholarly literature, that can be useful
         for digital libraries;
       • testing the inter-annotator agreement in the use of TRVs, both between humans and between
         LLMs;
       • experimenting with TRV step values encoding the sequence of events, like in the present paper,
         or both the sequence and the time between events;
       • comparing TRVs against other compression techniques;
       • applying TRVs in domains different than history.

This work presented a manual annotation of TRVs, whose purpose was to test their efficacy against
OHE-PCA in a controlled setting. The dataset produced is available online in a Google sheet1 among all
the other datasets produced within the Chronos project for collaborative open science under Creative
Common attribution, non-commercial share alike license.


Acknowledgments
This research was supported by the European Commission grant 101120657: European Lighthouse to
Manifest Trustworthy and Green AI - ENFIELD.
  This research employed data from the Seshat Databank (seshatdatabank.info) under Creative Com-
mons Attribution Non-Commercial (CC By-NC SA) licensing.


References
    [1] W. J. Frawley, G. Piatetsky-Shapiro, C. J. Matheus, Knowledge discovery in databases: An overview,
        AI magazine 13 (1992) 57–57.
    [2] R. Roller, Theory-driven statistics for the digital humanities: Presenting pitfalls and a practical
        guide by the example of the reformation, Journal of Cultural Analytics 7 (2023).
    [3] N. Horsley, What can a knowledge complexity approach reveal about big data and archival practice?,
        in: 2017 IEEE International Conference on Big Data (Big Data), IEEE, 2017, pp. 2246–2250.
    [4] G. Demartini, K. Roitero, S. Mizzaro, Data bias management, Communications of the ACM 67
        (2023) 28–32.
    [5] D. L. Barbera, E. Maddalena, M. Soprano, K. Roitero, G. Demartini, D. Ceolin, D. Spina, S. Mizzaro,
        et al., Crowdsourced fact-checking: Does it actually work?, Information Processing & Management
        61 (2024).
    [6] E. Hyvönen, Using the semantic web in digital humanities: Shift from data publishing to data-
        analysis and serendipitous knowledge discovery, Semantic Web 11 (2020) 187–193.

1
    https://docs.google.com/spreadsheets/d/1OW6CtmUudN3WTJ1VvWRZYZdTWVEjDJGns6Q8_I6EBwk/edit?usp=sharing
 [7] K. Golub, Y.-H. Liu, Information and knowledge organisation in digital humanities: Global per-
     spectives, Taylor & Francis, 2022.
 [8] G. Di Nunzio, J. A. Dyble, F. Giachelle, S. Gialdroni, et al., Linking historical evidence to digital
     maps: The micoll map, in: CEUR WORKSHOP PROCEEDINGS, volume 3536, 2023, pp. 104–112.
 [9] F. Giachelle, J. Dyble, G. M. Di Nunzio, S. Gialdroni, Exploring historical routes and waypoints
     with micoll digital map, in: International Conference on Theory and Practice of Digital Libraries,
     Springer, 2024, pp. 141–150.
[10] L. Davide, M. Rovera, F. Alfio, S. Tonelli, et al., Timeframe: Querying and visualizing event
     semantic frames in time, in: Proceedings of the First Workshop on Reference, Framing, and
     Perspective@ LREC-COLING 2024, ELRA and ICCL, 2024, pp. 13–17.
[11] A. Locaputo, B. Portelli, S. Magnani, E. Colombi, G. Serra, Ai for the restoration of ancient
     inscriptions: A computational linguistics perspective, in: Decoding Cultural Heritage: A Critical
     Dissection and Taxonomy of Human Creativity through Digital Tools, Springer, 2024, pp. 137–154.
[12] D. Firmani, F. Leotta, J. G. Mathew, J. Rossi, L. Balzotti, H. Song, D. Roman, R. Dautov, E. J.
     Husom, S. Sen, et al., Intend: Intent-based data operation in the computing continuum, in: CEUR
     WORKSHOP PROCEEDINGS, volume 3692, CEUR-WS, 2024, pp. 43–50.
[13] P. Turchin, H. Whitehouse, P. François, D. Hoyer, A. Alves, J. Baines, D. Baker, M. Bartokiak,
     J. Bates, J. Bennet, et al., An introduction to seshat: Global history databank, Journal of Cognitive
     Historiography 5 (2020) 115–123.
[14] A. W. Johnson, T. K. Earle, The evolution of human societies: from foraging group to agrarian
     state, Stanford University Press, 2000.
[15] H. M. Johnson, Religion in social change and social evolution, Sociological Inquiry 49 (1979)
     313–339.
[16] P. Turchin, H. Whitehouse, S. Gavrilets, D. Hoyer, P. François, J. S. Bennett, K. C. Feeney, P. Pere-
     grine, G. Feinman, A. Korotayev, et al., Disentangling the evolutionary drivers of social complexity:
     A comprehensive test of hypotheses, Science Advances 8 (2022) eabn3517.
[17] I. Ul Haq, I. Gondal, P. Vamplew, S. Brown, Categorical features transformation with compact
     one-hot encoder for fraud detection in distributed environment, in: Data Mining: 16th Australasian
     Conference, AusDM 2018, Bahrurst, NSW, Australia, November 28–30, 2018, Revised Selected
     Papers 16, Springer, 2019, pp. 69–80.
[18] M. Greenacre, P. J. Groenen, T. Hastie, A. I. d’Enza, A. Markos, E. Tuzhilina, Principal component
     analysis, Nature Reviews Methods Primers 2 (2022) 100.
[19] G. Chiarot, C. Silvestri, Time series compression survey, ACM Computing Surveys 55 (2023) 1–32.
[20] F. Celli, D. Mingazov, Knowledge extraction from llms for scalable historical data annotation,
     Electronics 13 (2024) 4990.
[21] N. Rane, S. Choudhary, J. Rane, Gemini versus chatgpt: applications, performance, architecture,
     capabilities, and implementation, Performance, Architecture, Capabilities, and Implementation
     (February 13, 2024) (2024).
[22] F. Celli, The wiki music dataset: A tool for computational analysis of popular music, arXiv preprint
     arXiv:1908.10275 (2019).
[23] P. Turchin, R. Brennan, T. Currie, K. Feeney, P. Francois, D. Hoyer, J. Manning, A. Marciniak,
     D. Mullins, A. Palmisano, et al., Seshat: The global history databank, Cliodynamics 6 (2015).
[24] P. Savage, Additional robustness analyses confirm that complex societies precede moralizing gods
     throughout world history, Nature Ecology & Evolution Community (blog) 5 (2019).
[25] J. A. Goldstone, Demographic structural theory: 25 years on, Cliodynamics 8 (2017).
[26] J. C. Landon, World History and the Eonic Effect: Civilization, Darwinism, Xlibris Corporation,
     2010.
[27] F. Clifford, F. Baum Christopher, The effect of war on economic growth, Cato Journal 40 (2020).
[28] J. Müller, T. N. Friemel, Dynamics of digital media use in religious communities—a theoretical
     model, Religions 15 (2024) 762.
[29] D. A. Mullins, D. Hoyer, C. Collins, T. Currie, K. Feeney, P. François, P. E. Savage, H. Whitehouse,
     P. Turchin, A systematic assessment of “axial age” proposals using global comparative historical
     evidence, American Sociological Review 83 (2018) 596–626.
[30] P. Boyer, B. Bergstrom, Evolutionary perspectives on religion, Annual review of anthropology 37
     (2008) 111–130.