<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Antwerp, Belgium
£ a.w.lassche@hum.leidenuniv.nl (A. Lassche); jan.kostkan@cas.au.dk (J. Kostkan); kln@cas.au.dk (K. Nielbo)
ȉ</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Chronicling Crises: Event Detection in Early Modern Chronicles from the Low Countries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alie Lassche</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Kostkan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kristo昀er</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nielbo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Humanities Computing Aarhus</institution>
          ,
          <addr-line>Jens Chr. Skous Vej 4, Building 1483, DK-8000 Aarhus C</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leiden University, Institute of History</institution>
          ,
          <addr-line>Doelensteeg 16, 2311 VL Leiden</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Between the Middle Ages and the nineteenth century, many middle-class Europeans kept a handwritten chronicle, in which they reported on events they considered relevant. Discussed topics varied from records of price 昀氀uctuations to local politics, and from weather reports to remarkable gossip. What we do not know yet, is to what extent times of con昀氀ict and crises in昀氀uenced the way in which people dealt with information. We have applied methods from information theory - dynamics in word usage and measures of relative entropy such asnovelty and resonance - to a corpus of early modern chronicles from the Low Countries (1500-1820) to provide more insight in the way early modern people were coping with information during impactful events. We detect three peaks in the novelty signal, which coincide with times of political uncertainty in the Northern and Southern Netherlands. Topic distributions provided by Top2Vec show that during these times, chroniclers tend to write more and more extensively about an increased variation of topics.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;event detection</kwd>
        <kwd>top2vec</kwd>
        <kwd>doc2vec</kwd>
        <kwd>information theory</kwd>
        <kwd>novelty</kwd>
        <kwd>resonance</kwd>
        <kwd>relative entropy</kwd>
        <kwd>early modern chronicles</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Between the Middle Ages and the nineteenth century, many middle-class men (and a handful
of women) in Europe kept a handwritten chronicle, in which they reported on current events
in their communities, and on what they considered interesting or relevant. These texts were
ordered chronologically, providing both the date and a report of a certain event (seFeigure 1).
Chronicles were rarely printed in the lifetime of their authors, but, despite their scribal form,
they could still circulate in the localities, be read and continued by other authors, in昀氀uencing
future generations [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        The described topics in chronicles were various. Chroniclers alternated between
descriptions of political developments and records of price 昀氀uctuations of grain, butter, and milk,
weather reports, mentions of their relatives’ birthdays, gossip, religious developments, and
reports on unusual, strange, or marvelous events both nearby and further away. Although
chronicles did change over time, the type of information chroniclers selected to be included
remained fairly stable, which enables us to study the genre across centuries. One subcategory
of chronicle texts emerged out of political crises, wars, and civil con昀氀icts. During such times,
many authors started to record public events that were upsetting their lives and to teach their
readers the lessons to be learned from them [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ].
      </p>
      <p>What we do not know, is how times of con昀氀ict and crises in昀氀uenced the chronicler’s way
of reporting and dealing with information. Did chroniclers collectively write about the same
topic during such a crisis, or were they more recipients of broader information? This paper will
demonstrate how chroniclers were coping with information during impactful events. We aim
to provide more insight into the way early modern people dealt with crises. We apply methods
from information theory to a corpus of early modern chronicles from the Low Countries to
椀昀nd an answer to the question formulated above.</p>
      <p>Related work will be discussed in Section 2. The used corpus will be introduced in Section
3, while Section 4 contains a description of the methods used in this study. In Section 5, the
obtained results will be discussed, and in Section 6, we draw some conclusions from the results
and do some suggestions about the direction future research should take.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. The theory of the event</title>
        <p>
          Chronicles can be considered as a collection of chronological events. However, few historians
have been studying ‘the event’ as a theoretical category [16, p. 198] – even though the
antievenementalism of social historian Fernand Braudel, who considered the history of events as
the mere froth on the waves of history, was largely replaced by a return to writing about events
in the 1970s. Anthropologist Marshall Sahlins has stated that ‘[e]vents can be distinguished
from uneventful happenings only to the extent that they violate the expectations generated by
cultural structures. The recognition of the event as the event, therefore, presupposes structure’
[16, p. 199][15]. The same distinction is made by [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]: ‘In an abstract sense, every occurrence
can be described as an event’, they state. However, in most contexts, the termevent is not used
in that way. There is time that is 昀椀lled with events, but since these are embedded in routines,
such times are not experienced as events in the narrower sense. Instead, ‘only those incidences
that strike us as noticeable ruptures with expected processes and routines are recognized as
real events’ [8, p. 78].
        </p>
        <p>
          From a linguistic perspective, the question of whether an incident is important enough to be
an event is not relevant: every state, change, or happening is considered an event. In linguistics,
the term event is related to the concept ofeventuality, which was introduced by the linguist
Emmon Bach in 1986 and comprised states, processes, and events 2[]. Many linguists indeed
understand the term eventuality in the broadest sense, comprising events, processes, states,
happenings, changes, episodes, etc., as is for example the case in a study on event detection
and classi昀椀cation for historical texts [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>The di昀erence between the historical and linguistic perspectives on events is that linguists
focus on the linguistic elements that are used, while historians focus on the result of an event.
When linguists use subcategories to distinguish between events, it is based on these linguistic
elements. Neither the historical perspective nor the linguistic perspective on events is
completely suitable for the study of events in early modern chronicles. In these texts, both abstract
and emphatic events are included. First and foremost, chroniclers report on happenings in
their surroundings that they consider being noticeable ruptures of their daily routine. This can
be the threat of war, the visit of a foreign king, a national con昀氀ict, or unusual weather
phenomena. This could be considered an event in the emphatic sense. At the same time, however,
chroniclers also include reports on daily or weekly routines such as checking the wind
direction and temperature, summarizing the Sunday sermon, reporting on the prices on the market,
or referring to weekly board meetings they attend. These are the abstract kind of events. In the
context of this study, we consider an event to be every description that is linked to a speci昀椀c
date. For more details on methods see Section 4 and Appendix A.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Information theory</title>
        <p>
          The methods used in this study are based on dynamics in word usage and measures of relative
entropy. We know from previous work that word usage in newspapers is sensitive to the
dynamics of socio-cultural events [
          <xref ref-type="bibr" rid="ref4 ref5">5, 4</xref>
          ]. Methods from complexity science, such as fractal
analysis, have been used to identify distinct domains of newspaper content based on temporal
patterns in word use [18], and to distinguish cultural and social catastrophic events that display
class-speci昀椀c fractal signatures in, among other things, word usage in newspapers [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          Previous studies have shown that entropy measures can be used to detect fundamental
conceptual di昀erences between distinct periods [ 7], opposite political movements [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and the
development of ideational factors in writing with a serial structure1[1]. More speci昀椀cally, several
studies have applied windowed relative entropy to thematic text representations to generate
signals that capture information novelty, which is the content di昀erence from the past, and
information resonance, which is the degree to which future information conforms to novelty.
The methods have been successfully applied to parliamentary debates from the French
Revolution [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], to Dutch newspapers from the second half of the 20st century [18], and to Danish
newspapers from the COVID-19 pandemic [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Corpus</title>
      <p>The data set applied in this study is a subset of a corpus that was collected and digitized in the
context of the research project ‘Chronicling Novelty. New knowledge in the Netherlands,
15001850’.1 The full corpus consists of about 320 early modern chronicles that are written in the
Dutch language between 1500 and 1850. They are chronologically organized, cover events that
happened in the lifetime of the author, and focus on local events more than national, individual,
or familial. About 130 of these chronicles had been published before as a contribution to a
journal, on the initiative of an archive, or in the private domain, and were digitized by the
Digital Library for Dutch Literature (DBNL). The other chronicles are kept in libraries and
archives throughout the Netherlands and Belgium.</p>
      <p>
        Every manuscript page was scanned and transcribed with both the Handwritten Text
Recognition tool Transkribus [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and the help of volunteers on the online crowdsourcing platform
Vele Handen. A昀琀erwards, both content and layout was annotated by volunteers, using labels
including page number, date, location, and person name.
      </p>
      <p>
        Our data set consists of 191 chronicles that were fully transcribed and annotated. However,
the date tag, which plays a pivotal role in this study, repeatedly was subject to many bugs and
crashes. The tag therefore needed a manual inspection. We use all 191 annotated chronicles for
training models (corpus annotated), but only the 96 chronicles in which thedate label was
manually checked, were used for analysis (corpus corrected). See Table 1 for more statistics
on the used corpus.
1On http://www.chroniclingnovelty.com/kronieken,/ an overview of the corpus can be found.
4. Pipeline
We developed a 昀椀ve-step research pipeline, additional details can be found in appendix A2:
1. Chunk chronicles into primitives. Since chronicles are chronologically structured
and mentions of a date are labeled as such, we make a cut before everydate label, to
chunk the texts into smaller fragments that can be connected to a date label. This was
done for bothcorpus annotated and corpus corrected. We call the resulting text
chunks primitives. We furthermore made a subset of primitives corrected daily,
only containing primitives with a fully speci昀椀ed date tag. Table 2 contains statistics
of the distinctive data sets. Theprimitives annotated are used in step 2, while the
primitives corrected daily are used in the other steps.
2. Primitive representations. We use a Top2Vec model to create both document
representations of the primitives annotated, and topics [1]. The model provides two relevant
outputs, which are the doc2vec embeddings of the documents [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and the cosine
similarities of the documents toward the estimated topic centroids. We reduced the trained
model to 100 topics.
3. Down-sampling (choosing prototypical primitives). In order to compute novelty
and resonance signals, we want only one textual representation per day. In doing so, we
avoid calculating novelty over primitives that are in fact descriptions of the same day.
Instead, we enable keeping the temporal dimension of novelty to at least one month. We
therefore cluster the primitives corrected daily per day and pick a prototypical
primitive, based on cosine similarity: if a day has multiple primitives, the primitive with
the shortest distance to the others is picked as a prototype, assuming that this primitive is
the most representative one. This method also allows us to calculate theuncertainty of a
prototypical primitive, which we express as the standard deviation of the mean distance
of a prototypical event to the other primitives on that day.
4. Diachronic topic analysis. We group primitives corrected daily per year, and
take the mean cosine similarities to every topic, to analyze topic 昀氀uctuation over time.
5. Novelty detection. We calculate novelty and resonance of our time series ofprototypes.
      </p>
      <p>We expect peaks in the novelty signal to be indications of an event.
2Please refer to the git repository for the full code:https://github.com/centre-for-humanities-computing/dutch-c
hronicles.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <sec id="sec-4-1">
        <title>5.1. General primitive and topic statistics</title>
        <p>3Given this, it is striking that no peak appears in 1672, the DutchRampjaar (Disaster Year) in which the people were
described as redeloos (irrational), the government as radeloos (distraught), and the country asreddeloos (beyond
salvation). However, this absence is mainly due to a lack of data in the full corpus from that period.
words
wolkachtig motbuitjes verhelderd betrekt wolkens buiachtig verdund verdonkert betrokken
overleede hat do overleeden edl weeduwe mevrou weed ju昀rou haar jaaren niwe oostindise dri
par la dont dans dernier nouveau un alors le sur avec du francais on il etre une consequence les
weinig duur daardoor thans hooi mogelijk steeds aanzien vooral prijzen aardappels oorzaak
spaengiarden ducdalbe vuel ducdalf deestyt brusel vuele prinsche dagelycx scuyten gescut
plechtigheyd geluyd beyaerd triumph luyster feest autoriteyten bywezen magten musicq
Several categories can be discerned from qualitative inspection of the topics in the reduced
Top2Vec model. There are natural, cultural, social, economic, and political topics – and some
topics fall into multiple categories. A sixth category contains topics of words that are not
clustered on semantic similarity, but on linguistic characteristics. InTable 3, we included an
example of every category. Topic 0 (the most dominant topic within the corpus) clearly belongs
to the category of natural topics, containing words related to the weather. Topic 1 falls into the
category of social topics, with words such as ‘overleeden’p(assed away), ‘weeduwe’ (widow)
and ‘ju昀rou’ ( miss). Topic 2 is not really a topic, because its words are clustered on linguistic
characteristics: they are all French. In topic 39, words related to prices and products, such
as ‘duur’ (expensive), ‘hooi’ (hay), ‘aardappels’ (potatoes), and ‘prijzen’ (prices), are clustered,
belonging to the economic category. The words ‘spaengiarden’S(paniards), ‘ducdalbe’ (Duke
of Alba), ‘prinsche’ (prince), and ‘gescut’ (artillery) indicate that topic 42 can be considered a
political one. Topic 48 contains words related to festivities (‘plechtigheydc’e(remony), ‘beyaerd’
(carillon), ‘feest’ (party), ‘musicq’ (music) and thus belongs to the cultural category.</p>
        <p>Topical dynamics were represented by plotting the mean cosine similarity of all primitives
in one year towards a certain topic. Some of the topics show a clear trend over time which can
easily be linked to a social, cultural, or political situation at that time. Other topics demonstrate
a repetitive pattern, remaining stable on the long term.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Chronicling novelty</title>
        <p>The prototype primitives, of which the frequency is visualized in dark green inFigure 2, were
used to compute the novelty signal. Three peaks can be observed in the novelty signal in
Figure 3, which remain solid when we adjust the window size . Peaks are visible around 1568,
1662, and 1789. These 昀氀uctuations in the signal indicate that a document vector is surprising
compared to its preceding vectors. The valley the novelty signal shows a昀琀er a peak means
that the following documents are less surprising because their vectors are more similar to the
previous ones. The slow ascent of the peaks in 1568 and 1789 point to a long period of steadily
increasing surprise. The steep descent that follows, indicates a sudden decrease in content
novelty of the documents.</p>
        <p>The three peaks mark the earlier mentioned key points in the history of the Low Countries:
the start of the Dutch Revolt (1568), the prelude to the War of Devolution (1662), and the
end of the patriot movement with the return of the Stadtholder (1789). Furthermore, they
approximately coincide with a peaking primitive frequency. This indicates that when more
chroniclers report (or chroniclers report more) in their chronicle, the content of their reports
changes. A peak in the novelty signal can mean several things, for example,(1) a few topics
become more dominant than others,(2) earlier dominant topics become less dominant, or(3)
more diverse topics appear. The yearly mean cosine similarity per topic that was obtained
with Top2Vec is used to get insight in the topics that are dominating at the time of the novelty
peaks. Although one might expect that, during crises, chronicles would become more focused
on a small number of topics directly related to the crisis, it turns out that the topic distribution
is 昀氀atter during novelty peaks than in other years: even the topics with the highest mean cosine
similarities are still close to or below 0. It indicates a large variety in described topics during
such years. The top words of the three most dominant topics during novelty peaks are included
in Table 4.</p>
        <p>It must be said that document length is an important driver for the variation between high
and low novelty. Longer documents are more novel than shorter documents, because long
documents contain more varied information and therefore have average similarity to many
topics, while short documents, containing less information, have high similarity with one topic,
but uniformly low similarity with the rest.</p>
        <p>A positive association between novelty and resonance would indicate an innovation bias,
meaning that novelty introduced in the past leaves traces in the future. In our results, the
resonance signal remains 昀氀at over time, which suggests that future information does not conform
to the introduced novelty. It means that a newly introduced event in the corpus of chronicles
does not have an e昀ect on the content that is described a昀琀erward. In other words: big events
do not impact the writing style and habits of chroniclers in the long term.</p>
        <p>Di昀erent interpretations are possible regarding the obtained results. Increasing diversity in
described topics during crises suggests that during times of high uncertainty, people are more
open to new information. Alternatively, it can be an indication of an expanding mediascape, or
a more thorough consumption of new media. Furthermore, it may be indicative of how early
modern people understood crises in a di昀erent way than we expect. A crisis is not only about
soldiers, sieges, and deaths (topics 4 and 5 in 1662), but it is also relevant what kind of food is
available (topic 10 in 1568), how the weather might in昀氀uence this (topic 0 in 1789), what the
city government decides (topic 9 in 1568), how the situation in a foreign city evolves, and to
not forget religious duties (topic 49 in 1789). This variety in topics could also show that not
every chronicler was equally a昀ected by these historical events. Still, they felt the need to start
or intensify their recordings on events happening in their lives. An overload of information
asks for a su昀케cient approach. A昀琀er being exposed to an information explosion, people tend to
select what is relevant to them, discarding the other topics. This is what the resonance signal
shows: there is a short period of information overload, but soon, the variety of described topics
decreases again.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion</title>
      <p>We have presented a method to detect events in early modern Dutch chronicles, which has
provided insight into the way chroniclers cope with information during impactful events, and
how these happenings in昀氀uence their way of writing. Our main conclusion is that early
modern chronicles tend to write more and more extensively during times of political uncertainty.
However, the topics they describe during such times also get more varied. This is shown by
a peak in the novelty signal, and a 昀氀atter topic distribution. Furthermore, such an increase
in event density and a change in the novelty signal does not in昀氀uence future reported events.
Soon, things get back to normal, and the (writing) life of the chronicler continues as it did
before.</p>
      <p>The representativeness of the corpus used for analysis is not unproblematic and is something
that could be improved in future research. We have used about one-third of the full corpus
of chronicles, mainly due to the fact that only this part was digitized and annotated at the
time. Besides, the focus on only ‘daily events’ introduced a bias in the results. The date with
the highest frequency of primitives (67), most of them were from one author, describing how
the Stadtholder Willem V visited his hometown Purmerend. Other frequently reported dates
pointed us as well to events happening on one day, rather than to events spread over a longer
period of time. An exploration of the frequently mentioned dates would gain value when
‘monthly dates’ were also included.</p>
      <p>In this study, we have used the corpus as a whole in computing novelty and resonance
signals, showing that the genre remains stable over time, despite of several impactful events.
Future work will focus on the novelty and resonance signals of individual authors, in order to see
whether the elevated topic diversity during crises can also be observed here. This should
furthermore provide more insight in their personal writing style, and whether certain events they
experience have a lasting in昀氀uence on their way of chronicling. Other future research should
investigate whether the absence of a positive relationship between novelty and resonance is
distinctive for the genre of chronicles, or more general something that can be observed in early
modern texts. Early modern pamphlets or newspapers – speci昀椀cally written for an audience –
would be an interesting showcase to explore their di昀erence from chronicles – since the latter
are in the end more o昀琀en a case of a private matter.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was executed at the Center for Humanities Computing Aarhus. The research
stay of Alie Lassche was funded by the Dr. Catharine van Tussenbroek Fonds, and the
Stichting de Fundatie van de Vrijvrouwe van Renswoude te ’s-Gravenhage. Kristo昀er Nielbo was
supported by the Nordic e-Infrastructure Collaboration (NeIC) and the Danish e-Infrastructure
Cooperation (DeIC) with grant DeiC-AU1-L-000001.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Methods</title>
      <sec id="sec-7-1">
        <title>Document segmentation</title>
        <p>The corpus does not come reliably segmented into sentences or documents. We use provided
date tags for our document segmentation. In general, we consider a document to be the text
beginning with a date tag&lt;date A&gt;, which in turn serves to date the document. A document
can span multiple lines and pages, and ends with a date tag&lt;date B&gt;, which indicates the
beginning of the next document. However, we do not segment within lines – a document
always contains an entire line of text. In doing so, we attempt to address the cases, when an
event’s dating does not appear at the beginning of the entry, but at some later point (e.g. in the
middle of a line). As a result of this rule, lines with multiple date tags are considered one event.
In such case, all date tags are recorded for later sanity checks, but the date tag appearing 昀椀rst
is chosen as the dating of such event. Allowing an event to have multiple date-tags is useful
when parsing chronicle entries dated with a range of dates. For example, an entry such as
`between &lt;date A&gt; and &lt;date B&gt;, there was a heavy rainfall which influenced
the harvest in a bad way' is not split into two, as long as both date tags appear on the
same line.</p>
        <p>The chronicles that were digitised by the DBNL pose a challenge here, since their lines do
not match the actual lines in the original manuscript. Instead, their lines are arti昀椀cially created
when converting them to page-XML in order to upload them in Transkribus. Since chroniclers
were very inconsistent in their use of punctuation, these lines sometimes have the length of one
paragraph. This is not a problem as long as such a chronicle entry corresponds to a single dated
event. However, in some cases one document can consist of multiple dated entries referring to
di昀erent events. Luckily, the vast majority of the documents we used to 昀椀t the novelty signal
have a single date-tag (20,372 out of 22,516 documents) and thus correspond to a single dated
entry. Furthermore, whether a document has a single date-tag or multiple does not seem to
predict novelty scores very well (seeFigure 4). In summary, the sanity checks we conducted did
not uncover a consequential bias to our analysis, resulting from our document segmentation
strategy.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Preprocessing</title>
        <p>Annotators have marked words that span multiple lines with a special character. First, we
concatenate these into a single word and remove all the other special characters used by
annotators. Next, unique IDs are assigned to documents, which can be used interchangeably in
both the annotated and corrected corpus.</p>
        <p>From here on, the processing steps di昀er for both corpora to re昀氀ect the task they are used for.
The annotated corpus is used for training of theTop2Vec model and should as such o昀er both
su昀케cient diversity and number of texts. For this reason, we only exclude documents shorter
than 50 characters (mostly OCR artifacts, or non-documents, such as page numbers). In total,
116,023 documents are passed on toTop2Vec training with the average length of 577 characters
(SD=1260).</p>
        <p>On the other hand, the corrected corpus is used to 昀椀t the novelty signal and should in
turn be reliably transcribed and dated. First we attempt to capture events that can be dated
up to a day (i.e. having a fully speci昀椀ed date tag in a YYYY-MM-DD format, as opposed to e.g.
YYYY-MM). This undoubtedly has an e昀ect on the results, since events that are not connected
to a speci昀椀c date by the chronicler, but instead to e.g. a month, are excluded. This is further
discussed in the 昀椀nal section of the paper. Second, we exclude documents shorter than 50 and
longer than 5000 characters. With this additional upper limit on document length we attempt
to exclude events, which contain verbatim copies of o昀케cial documents and non-events, such
as chronicle appendices. These very long documents are outliers in both corpora (less than 1%
of the documents in thecorrected corpus are longer than 5000 characters) and a majority of
them contain a single date-tag, meaning they are not a concatenation of multiple short events.
Finally, only documents dated in the period of interest (years 1500 through 1820, inclusive) are
kept. In total, 36,147 documents are passed on to the next step of novelty detection with the
average length of 525 characters (SD=635).</p>
      </sec>
      <sec id="sec-7-3">
        <title>Event representations</title>
        <p>We use a Top2Vec model to create both document representations and topics1[]. This model
is based on the assumption that many semantically similar documents are indicative of an
underlying topic. Consecutively,Top2Vec creates jointly embedded document vectors and word
vectors using doc2vec, it creates lower dimensional embeddings of document vectors using
UMAP, and it 昀椀nds dense areas of documents using HDBSCAN. For each dense area, the centroid
of the document vectors is then calculated, which is assumed to be the topic vector. Finally, it
searches for the n-closest word vectors to the resulting topic vector, in order to create a topic.</p>
        <p>An important di昀erence between Top2Vec and traditional bag-of-word topic modeling
methods such as LDA, is that the semantic embedding used in the 昀椀rst method has the advantage of
learning the semantic association between words and documents. We consideTrop2Vec
therefore a more suitable method, since the corpus at hand contains large spelling variation, for
which the semantic embedding approach can serve as a solution. Furthermore, LDA models
topics as distribution of words, which are then used to recreate the original document word
distributions with minimal error. This o昀琀en necessitates uninformative words which are not
topical to have high probabilities in the topics since they make up a large proportion of all text.
A stopword list can be used to solve this problem, but expanding the list in order to get rid of
these non-topical words can be a never ending iterating process.Top2Vec does not need a stop
list, because high frequency words that occur in all documents will not be particularly close to
any topic vector and thus not dominating in any topic.</p>
        <p>We train a model on all primitives annotated that are longer than 50 characters. The
trained model contains 426 topics, but we reduce it to 100 topics (a昀琀er using the elbow method
to 昀椀nd the optimum number of topics, see Figure 5), using the hierarchical topic reduction
function inTop2Vec, which 昀椀nds the representative topics of the corpus by iteratively merging
each smallest topic to the most similar topic until the number of 100 topics is reached. The
model provides two relevant outputs, which are the vector representations of the documents,
and the cosine similarities of the documents towards the topic centroids. Concerning the latter,
it is important to note thatTop2Vec, being a geometrical model, di昀ers here from a probabilistic
model such as LDA. The ‘weight’ is therefore not a probability between 0 and 1, but the cosine
similarity between a document and a topic, which is a value between−1 and +1.</p>
      </sec>
      <sec id="sec-7-4">
        <title>Choosing prototypical events</title>
        <p>As was mentioned earlier, there is a sharp di昀erence in the number of primitives across years
in our corpora. In order to alleviate this problem, we pick a single ‘prototypical’ document
for each day if there are multiple documents tied to that day. To acquire prototypes, we 昀椀rst
group doc2vec document embeddings (acquired in the previous step) into daily subsets. For
each subset, we then calculate pair-wise distances between embeddings. The embedding with
the lowest average distance to all the other in the subset is then picked as the prototype. The
distance metric used is cosine distance:</p>
        <p>Hereby we aim to capture the document that is most similar to other documents in a daily
subset. For example, if multiple documents refer to the same event on the same day, only one
will be picked to represent it. Furthermore, this step allows us to regularize the interval of
= 1 − A ⋅ B</p>
        <p>‖A‖‖B‖
measurement (time elapsed between datapoints). Regular intervals are important for choosing
the window ( ) parameter in novelty detection, as well as interpreting the resulting novelty
values; A昀琀er choosing prototypical events, a primitive with high novelty can be considered
novel in the context of or more days, and not just an eventful a昀琀ernoon with records (e.g.
Stadtholder Willem V visiting Purmerend).</p>
        <p>Furthermore, a sanity check in which we did not choose prototypical primitives revealed that
the novelty peaks remain practically unchanged. It is therefore very unlikely that the peaks
are driven by the picking of irregular documents as prototypes.</p>
      </sec>
      <sec id="sec-7-5">
        <title>Novelty detection</title>
        <p>The following measures are calculated on 300-Ddoc2vec embeddings of the chosen
prototypical events, ordered by date. First, embeddings are turned into a probability distribution using
the so昀琀max function:
We then proceed calculate novelty, transience and resonance. Withnovelty, we refer to an
event (Ā)’s reliable di昀erence from past events (Ā−1), (Ā−2), …, (Ā− )in window :
( )ÿ=</p>
        <p>ÿ ÿ
∑Ā=1 ÿ Ā
ℕ (Ā) =1 ∑
þ=1</p>
        <p>( (Ā)∣ (Ā−þ))
ℝ (Ā) = ℕ (Ā) −
(Ā)
and resonance as the degree to which future events (Ā+1), (Ā+2), …, (Ā+ )conform to an event
(Ā)’s novelty:
where
is the transience of (Ā):
This model for novelty and resonance was originally proposed in3[], but here we use the
symmetrized and smooth version with the Jensen-Shannon divergence ( ):
where
= 1 ( (Ā)+ (ā))and
2
( (Ā)∣ (ā)) = 1 ( (Ā)∣ ) + 1 ( (ā)∣ )</p>
        <p>
          2 2
signi昀椀es Kullback–Leibler divergence [
          <xref ref-type="bibr" rid="ref3">3, 18</xref>
          ]:
        </p>
        <p>
          (Ā)
(Ā)
( (Ā)∣ (ā)) =∑ ÿ log2 ( ÿ )
ÿ=1 ÿ(ā)
(6)
(7)
In this case, Jensen-Shannon divergence is preferred over Kullback–Leibler divergence for a
number of reasons. First, we maintain that allows us to relax assumptions about the
temporal order of observations, as it is a symmetric metric (meaning ( | ) = ( | )for
probability distributions and ) [
          <xref ref-type="bibr" rid="ref12">18, 12</xref>
          ]. Information in our dataset is not always presented
in a strictly chronological way: both events happening over a range of dates, and ‘昀氀ashbacks’
(recollections of past events presented out of order at a future date) are examples of cases
where attributing a temporal order would be problematic. Second, the calculated JS divergences
are a smoother version of KL divergences, with the maximum possible di昀erence between
probability distributions and being 1 (if a base-2 logarithm is used). This propriety makes
some downstream tasks such as peak detection easier, because extreme values will not be orders
of magnitude greater than the mean (and therefore an extra normalization step is not required).
        </p>
      </sec>
      <sec id="sec-7-6">
        <title>Nonlinear adaptive filtering</title>
        <p>
          Nonlinear adaptive 昀椀ltering is applied to the information signals because of the their inherent
noisiness [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. First, the signal is partitioned into segments (or windows) of length = 2 + 1
points, where neighboring segments overlap by + 1. The time scale is + 1 points, which
ensures symmetry. Then, for each segment, a polynomial of order is 昀椀tted. Note that = 0
means a piece-wise constant, and = 1 a linear 昀椀t. The 昀椀tted polynomial for ÿ ℎand (ÿ + 1) ℎ
is denoted as (ÿ)(1Ă), (ÿ+1)(2Ă), where Ă1, Ă2 = 1, 2, … , 2 + 1. Note that the length of the last
segment may be shorter than . We use the following weights for the overlap of two segments.
(ý)(1Ă) = 1 (ÿ)(Ă + ) + 2 (ÿ)(Ă), Ă = 1, 2, … , + 1
(8)
distance between the point of overlapping segments and the center of (ÿ), (ÿ+1). The weights
decrease linearly with the distance between point and center of the segment. This ensures that
the 昀椀lter is continuous everywhere, which ensures that non-boundary points are smooth.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Angelov</surname>
          </string-name>
          . “
          <article-title>Top2Vec: Distributed Representations of Topics”</article-title>
          . In: arXiv:
          <year>2008</year>
          .09470 [cs, stat] (
          <year>2020</year>
          ). arXiv:
          <year>2008</year>
          .09470 [cs, stat].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>E. Bach.</surname>
          </string-name>
          “
          <article-title>The Algebra of Events”</article-title>
          .
          <source>In: Linguistics and Philosophy 9</source>
          .1 (
          <issue>1986</issue>
          ), pp.
          <fpage>5</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. T. J.</given-names>
            <surname>Barron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Spang</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. DeDeo.</surname>
          </string-name>
          “Individuals, Institutions, and
          <article-title>Innovation in the Debates of the French Revolution”</article-title>
          .
          <source>In:Proceedings of the National Academy of Sciences 115.18</source>
          (
          <year>2018</year>
          ), pp.
          <fpage>4607</fpage>
          -
          <lpage>4612</lpage>
          . doi:
          <volume>10</volume>
          .1073/pnas.1717729115.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Daems</surname>
          </string-name>
          , T. D'haeninck, S. Hengchen,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zere</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Verbruggen</surname>
          </string-name>
          . “'Workers of the World'
          <article-title>? A Digital Approach to Classify the International Scope of Belgian Socialist Newspapers,</article-title>
          <year>1885</year>
          -
          <fpage>1940</fpage>
          ”.
          <source>In: Journal of European Periodical Studies 4.1</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>99</fpage>
          -
          <lpage>114</lpage>
          . doi:
          <volume>10</volume>
          .21825/jeps.v4i1.
          <fpage>10187</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Mao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Perc</surname>
          </string-name>
          . “
          <article-title>Culturomics Meets Random Fractal Theory: Insights into Long-Range Correlations of Social and Natural Phenomena over the Past Two Centuries”</article-title>
          .
          <source>In: Journal of The Royal Society Interface 9</source>
          .73 (
          <year>2012</year>
          ), pp.
          <fpage>1956</fpage>
          -
          <lpage>1964</lpage>
          . doi:
          <volume>10</volume>
          .1098 /rsif.
          <year>2011</year>
          .
          <volume>0846</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K. L.</given-names>
            <surname>Gray</surname>
          </string-name>
          .
          <article-title>Comparison of trend detection methods</article-title>
          . University of Montana,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [16] [18] [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guldi</surname>
          </string-name>
          . “
          <article-title>The Measures of Modernity: The New Quantitative Metrics of Historical Change Over Time and Their Critical Interpretation”</article-title>
          .
          <source>In:International Journal for History, Culture and Modernity 7</source>
          .1 (
          <issue>2019</issue>
          ), pp.
          <fpage>899</fpage>
          -
          <lpage>939</lpage>
          . doi:
          <volume>10</volume>
          .18352/hcm.589.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jung</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Karla</surname>
          </string-name>
          . “
          <article-title>1. Times of the Event: An Introduction”</article-title>
          .
          <source>In:History and Theory 60.1</source>
          (
          <issue>2021</issue>
          ), pp.
          <fpage>75</fpage>
          -
          <lpage>85</lpage>
          . doi:
          <volume>10</volume>
          .1111/hith.12193.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Colutto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hackl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mühlberger</surname>
          </string-name>
          . “
          <article-title>Transkribus - A Service Platform for Transcription, Recognition and Retrieval of Historical Documents”</article-title>
          .
          <source>In2:017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</source>
          . Vol.
          <volume>04</volume>
          . 9, pp.
          <fpage>19</fpage>
          -
          <lpage>24</lpage>
          . doi:
          <volume>10</volume>
          .1109/icdar.
          <year>2017</year>
          .
          <volume>307</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          . “
          <article-title>Distributed representations of sentences and documents”</article-title>
          .
          <source>InI:nternational conference on machine learning. Pmlr</source>
          .
          <year>2014</year>
          , pp.
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>K. L. Nielbo</surname>
            ,
            <given-names>K. F.</given-names>
          </string-name>
          <string-name>
            <surname>Baunvig</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
          </string-name>
          .
          <article-title>“A Curious Case of Entropic Decay: Persistent Complexity in Textual Cultural Heritage”</article-title>
          .
          <source>In:Digital Scholarship in the Humanities 34.3</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>542</fpage>
          -
          <lpage>557</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqy054.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>K. L. Nielbo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Haestrup</surname>
            ,
            <given-names>K. C.</given-names>
          </string-name>
          <string-name>
            <surname>Enevoldsen</surname>
            ,
            <given-names>P. B.</given-names>
          </string-name>
          <string-name>
            <surname>Vahlstrup</surname>
            ,
            <given-names>R. B.</given-names>
          </string-name>
          <string-name>
            <surname>Baglini</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Roepstor昀. When No News Is Bad</surname>
          </string-name>
          News - Detection
          <source>of Negative Events from News Media Content</source>
          .
          <year>2021</year>
          . arXiv:
          <volume>2102</volume>
          .06505 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pollmann</surname>
          </string-name>
          . “
          <article-title>Archiving the Present and Chronicling for the Future in Early Modern Europe”</article-title>
          .
          <source>In: Past &amp; Present 230.suppl 11</source>
          (
          <year>2016</year>
          ), pp.
          <fpage>231</fpage>
          -
          <lpage>252</lpage>
          . doi:
          <volume>10</volume>
          .1093/pastj/gtw029.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pollmann</surname>
          </string-name>
          .
          <source>Catholic Identity and the Revolt of the Netherlands</source>
          ,
          <volume>1520</volume>
          -
          <fpage>1635</fpage>
          . The Past &amp; Present Book Series. Oxford; New York: Oxford University Press,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlins</surname>
          </string-name>
          . “
          <article-title>The Return of the Event, Again: With Re昀氀ections on the Beginnings of the Great Fijian War of 1843 to 1855 between the Kingdoms of Bau and Rewa”</article-title>
          . In:Clio in Oceania:
          <article-title>Toward a Historical Anthropology</article-title>
          . Ed. by
          <string-name>
            <given-names>A.</given-names>
            <surname>Biersack</surname>
          </string-name>
          . Washington: Smithsonian Institution Press,
          <year>1991</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>W. H. Sewell</given-names>
            <surname>Jr</surname>
          </string-name>
          .
          <article-title>Logics of History: Social Theory</article-title>
          and
          <string-name>
            <given-names>Social</given-names>
            <surname>Transformation</surname>
          </string-name>
          . Chicago Studies in Practices of Meaning. Chicago: University of Chicago Press,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          . “
          <article-title>Novel Event Detection and Classi昀椀cation for Historical Texts”</article-title>
          .
          <source>In: Computational Linguistics 45.2</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>229</fpage>
          -
          <lpage>265</lpage>
          . doi:
          <volume>10</volume>
          .1162/coli\_a\_
          <volume>0</volume>
          <fpage>0347</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Wevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kostkan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Nielbo</surname>
          </string-name>
          . “
          <article-title>Event Flow - How Events Shaped the Flow of the News,</article-title>
          <year>1950</year>
          -
          <fpage>1995</fpage>
          ”. In: Computational Humanities Research Conference. Amsterdam,
          <year>2021</year>
          , pp.
          <fpage>62</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>