<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Detection of Dates in the Corpus of Diaries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haralds Matulis</string-name>
          <email>haralds.matulis@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanita Reinsone</string-name>
          <email>sanita.reinsone@lulfmi.lv</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilze Ļaksa-Timinska</string-name>
          <email>ilze.laksa-timinska@lulfmi.lv</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Literature, Folklore and Art of the University of Latvia</institution>
          ,
          <addr-line>Mūkusalas 3, Riga, LV1423</addr-line>
          ,
          <country country="LV">Latvia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1949</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>This paper deals with the automatic detection of dates in a corpus of digitized, hand-written diaries in Latvian. Date detection is an important step in processing diaries' corpus, as it allows to split the source texts by dates of entries and carry out diachronic analysis for separate diaries and compare metrics across different authors. This paper describes the workflow of data processing, provides step by step implementation of date detection algorithm, and gives an evaluation of empirical results with discussions of encountered practical challenges for precise date detection in personal diaries.</p>
      </abstract>
      <kwd-group>
        <kwd>1 date detection</kwd>
        <kwd>corpus analysis</kwd>
        <kwd>crowdsourcing</kwd>
        <kwd>digitization</kwd>
        <kwd>hand-written texts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>of diary texts or studies where such a corpus, potentially heterogeneous and yet representative, has been
analyzed with computational methods.</p>
      <p>An analysis of the date entries in the diary text corpus, with the aim of identifying the system of date
notation in each individual diary, revealed that date notation tends to be a highly creative process, the
author's taste, habits, and mood playing important roles. The date may be written at the beginning or at
the end of the entry, sometimes in the middle; sometimes, too, the entry may be without a date because
it is contextualized in the text. There are authors who mention the year only at the beginning of the year,
and the month at the beginning of the month, numbering the days with numerals. The abbreviations of
the year and the months are also very varied, and the use of punctuation and separators (full stop,
comma, slash, colon, semicolon, hyphen) is also varied. Finally, there are also errors in months, days,
or years. The landscape of dates in the diaries is indeed colorful.</p>
      <p>2. Methodological Considerations on Date Detection in Diaries
For the further analysis of diaries with different computational methods — topic modeling, change
of topics and sentiment over time, and comparison of metrics across different diaries — there was a
need for a more detailed breakdown of the source files. The decision was to slice every diary into
smaller chunks, extracting entries for single days. A single day is a semantically meaningful time unit
for analysis, when compared with a larger time period like a week or a month or a finer unit such as
morning, afternoon, or evening. Single day also coincides with the dominant notation system used by
diary authors to register their writings. In this corpus there are only some exceptions when a diary’s
entries are undated or refer to a longer time periods, like a month or a season. Therefore, a day seems a
reasonable choice for the finer partition of diaries.</p>
      <p>As data after this pre-processing would be used in humanities research, the precision of the data is
crucial; a decision was made to target initially the maximum data, possibly erring on the side of too
many false positives. In the next step, the output file was given to a digital humanities researcher, who
examined and manually corrected the wrong dates and deleted the non-dates so that the final version of
the digitized diaries would be close to perfect accuracy. As the date-detection process was conducted
in a cooperation between humanities researchers and a data scientist, the workflow had to be
comprehensible and simple for both sides. The source files were text files (.txt format), which were
shared on a drive. The data scientist downloaded the files, processed them, and uploaded the processed
files back onto the drive in a separate folder. The input and output files were both .txt format, thus
enabling both parties to access and work with files.</p>
      <p>The specific challenge of finding dates in the diary corpus consisted of two parts. First, to find all
dates occurring in the corpus. This was not an easy task, as the text was primarily meant for personal
use and not for computational analysis. Therefore, the style of date notation is oftentimes elliptic,
sometimes obscure, and varies widely even within one diary written by the same author. Second, to find
all dates which are serving as metadates to denote the day, month, and year of the specific entry — and
to distinguish these dates from false positives, i.e., such dates that only refer to some moment in the
narrative but do not indicate the time when this record was made.</p>
      <p>The wide variety of metadate notation, even by one author, might seem puzzling at first. And it
might provoke a question: are authors deliberately negligent or obscure with date formats? The answer
is that diaries, at least from the start, are written for oneself, and such elliptic ways of metadate
formatting are sufficient for the author and his/her purposes.</p>
      <p>31. 12. 32.</p>
      <p>Another observation is that a change in metadate format usually occurs with a larger time interval
between records — that could be as long as several years or as short as a month, after which the author
chooses to record the metadate in a format which seems more natural at that moment (perhaps not
reviewing the diary to compare the previous format used). Maybe such variety in one diary indicates
that writing is not the main or a prominent part of this person’s day and lifestyle, as daily writing habits
tend to develop more uniform patterns in a person’s writing.</p>
      <p>3. Different Conventions of Date Notation and Placement</p>
      <p>In regard to their metadate notation system, all diaries of this corpus can be divided into two groups:
those with an absolute and those with a relative system of metadate notation. The vast majority of
diaries fall into the group of absolute date. By absolute we refer to a notation system where the date, in
a full or shortened version, appears in a diary: dd.mm.yyyy – 14.02.1957; dd.mm.yy – 14.02.57; dd.mm
– 14.02. A relative metadate, on the other hand, relies on the overall hierarchical structure of the
metadate notation system in the diary, and the precise date of the entry can be deduced from the position
of that entry in the overall structure of the diary. Of all diaries, there were only two diaries using a
relative date system; these were addressed separately, by devising a particular search algorithm for each
one. All further discussion about date detection is about absolute metadate detection.</p>
      <p>
        The most frequent placement of the metadate adheres to this convention containing three rules: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
an empty line before an entry of a new day, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the metadate is written at the beginning of a new line,
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) the diary entry for that day starts on the next line. However, there were variations to this system,
which had to be accounted for.
      </p>
      <p>After doing some pilot experiments for metadate detection and evaluating the results, it soon became
clear that date_and_month is the most important part of the full_date record, as both the date and the
month are needed to determine the precise entry time of the diary record. Without ‘month’ we are left
just with the day — wondering in which month the author made that record. Without ‘day’ we are left
with just the month, unable to attribute this entry to a specific dd.mm.yyyy. However, without a ‘year,’
we can still usually guess what year it is from the context and previous records.</p>
      <p>As date_and_month is the critical minimum of information needed to find and recognize if that line
contains the metadate, all input files were searched for this group. Although date_and_month group of
the metadate is usually placed at the beginning of the entry and on a new line, it is not always at the
exact beginning of the line.
First, there could be a year before date_and_month: 1957. gada 14. februārī.</p>
      <p>There could also be a weekday before: Ceturtdien, 14. februārī.</p>
      <p>Or a location of the entry: Stokholmā, ceturtdienā, 14. februārī.</p>
      <p>In some less common cases the date_and_month is intertwined with the words of the first sentence:
Ir jau pienācis 14. februāris… // And so the 14th of February has already come …</p>
      <p>Therefore, a buffer of 25 characters was allocated to the beginning of the paragraph, allowing for
the date_and_month group to start anywhere from the 1st through 25th character of the paragraph, but
not later. Practical experiments with larger intervals showed not to improve date detection quality while
bringing more false positive results.</p>
    </sec>
    <sec id="sec-2">
      <title>4. Date Detection Algorithm</title>
      <p>The Latvian NLP pipeline [9], which was used as a morphological parser of diaries to augment data
with part-of-speech tags and lemmas, also contains a date recognition feature. However, due to
abovementioned irregularities in date notation techniques and the need to distinguish metadates from other
dates in diaries, the Latvian NLP pipeline date recognition feature delivered only partially sufficient
results. Therefore, it was decided to write a custom date detection algorithm, using regular expressions.</p>
      <p>Regular expressions allow users to search a text for specific characters or sequences of characters
and then perform operations on them. To account for all different cases of date_and_month placement,
a general pattern was created consisting of three parts: any 0 to 25 characters at the beginning of the
line (optional) + date_and_month + year (optional). Regular expressions which are the building blocks
of the pattern for date detection are provided in Table 1 below. The original code was written in
Javascript programming language, but is largely compatible with other programming languages.</p>
      <p>
        The date detection algorithm was tested on diaries and improved until the results were satisfactory.
In general, the improvements followed two lines: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) including more regular expression patterns of
metadate composition, when some dates were found unrecognized in processed files, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) narrowing
down regular expression patterns to exclude false positive results. The variety of metadate patterns, as
can be seen in Figure 2, made it impossible to imagine all combinations of symbols beforehand;
therefore, such a trial-and-error method was appropriate to fine-tune the search pattern.
      </p>
      <sec id="sec-2-1">
        <title>Date – a whitespace character (optional), followed by an optional 0 or 1 or 2 or 3, followed by any digit, followed by dot (optional), followed by forward slash / (optional), followed by whitespace (optional):</title>
      </sec>
      <sec id="sec-2-2">
        <title>Latvian months – matches the first three letters of the month – as often authors use abbreviations and not the full month name.</title>
      </sec>
      <sec id="sec-2-3">
        <title>Roman months – followed by a period, comma, or whitespace.</title>
      </sec>
      <sec id="sec-2-4">
        <title>Arab numbers – to match dd.mm.yyyy, dd.mm.yy and yyyy.dd.mm., separated by . / \ - and followed by optional whitespace.</title>
        <p>To detect dates, the following algorithm and workflow were used. A modular, step-by-step approach
helped to clearly identify what occurred in each phase, to evaluate intermediary output, to see how well
the algorithm performed, and if necessary to modify it. If the date formatting of the input .txt file is
clearly known, the modular approach also allows one to adjust settings focusing a regular expressions
search on relevant patterns, thus decreasing false positive findings — which might be useful when
working with larger files.</p>
        <p>The .txt-in .txt-out workflow permitted to skip the building, learning, and testing of User Interface.
Although, for larger-scale corpus processing, a graphical User Interface might be useful — allowing
one to adjust finer settings according to input files formats and minimizing the chance of introducing
errors by hand-correcting the double wrapped &lt;&lt; &gt;&gt; metadates. Table 2 below describes the date
detection algorithm step by step.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Input the text file in a .txt format.</title>
      </sec>
      <sec id="sec-2-6">
        <title>The algorithm checks every paragraph of the input file. If a paragraph’s beginning (first 25 characters + the following pattern) contains a date, then the beginning of a paragraph is copied, wrapped in &lt; &gt;, and pasted above that paragraph, with an empty line added above the wrapped text fragment.</title>
      </sec>
      <sec id="sec-2-7">
        <title>In the third step, all the lines containing the wrapped &lt; &gt; text fragment are parsed and a</title>
        <p>metadate is predicted.</p>
      </sec>
      <sec id="sec-2-8">
        <title>In the fourth step, &lt; &gt; wrapped metadates are checked for inconsistencies —</title>
        <p>unchronological dates, repeated dates, years out of legitimate years interval, etc.</p>
      </sec>
      <sec id="sec-2-9">
        <title>Suspicious dates are wrapped in double brackets &lt;&lt; &gt;&gt;.</title>
      </sec>
      <sec id="sec-2-10">
        <title>After the fourth step, the output file in a .txt format is returned to a humanities researcher, who manually checks all double-wrapped dates, changes them to the correct date and removes one pair of &lt; &gt; from corrected double-wrapped dates.</title>
        <p>An example of the algorithm at work is given below, showing an excerpt from a sample diary with
the output of every step, plus comments. The fragment is taken from the diary LFK Ak36, written by a
school teacher. It describes three days at the end of 1949 in Soviet Latvia. Below is the English
translation of the text which follows in Table 3 in Latvian:
17 Dec.</p>
        <p>There is dedication in the class to improve discipline and achievements. A class behavior register
has been introduced — this promotes class discipline. On the occasion of comrade Stalin's birthday on
the 20th of December, pioneers Melbārdis E. and Pinka A. will take the solemn pledge. Altogether, 7
pioneers (50%) in the class will raise discipline in their achievements.</p>
        <p>16 Dec.</p>
        <p>Today the class carefully arranged books on the windowsill — there is no other place left — and
paid more attention to the teachers’ desk. I am pleased that today, for the first time in the class, I praised
behavior and gave a mark of 5 for it. Comparing with the previous, one can see the uniformity of the
group taking shape and the sense of responsibility setting in.</p>
        <p>22 XII.</p>
        <p>I have to stop again at Indulis. [..]
&lt;17. dec.&gt;
17. dec.</p>
      </sec>
      <sec id="sec-2-11">
        <title>Klasē vērojama centība labot disciplīnu un sekmes. Ievesta ir</title>
        <p>klases uzvedības atzīmju burtnīca - tas sekmē disciplīnu klasē.</p>
      </sec>
      <sec id="sec-2-12">
        <title>Par godu b. Staļina dzimšanas dienai 20. dec. nodos svinīgo</title>
        <p>solījumu pionieri Melbārdis E. un Pinka A. Klasē ar to 7
pionieri (50%), kas cels ari sekmēs disciplīnu.
&lt;16. dec.&gt;
16. dec.</p>
      </sec>
      <sec id="sec-2-13">
        <title>Klase šodien rūpīgi sakārtojusi grāmatas uz loga - citur nav</title>
        <p>vietas - un uzmanību vairāk pievērsusi arī skolotāju galdam.</p>
      </sec>
      <sec id="sec-2-14">
        <title>Priecīga. ka šodien par uzvedību pirmo reizi klasē izteicos</title>
        <p>atzinīgi un novērtēju ar 5. Salīdzinot ar iepriekšējo, var vērot
sastāva vienveidības veidošanos un kolektiva atbildības
sajūtu.
&lt;22 XII.&gt;
22 XII.</p>
      </sec>
      <sec id="sec-2-15">
        <title>Atkal jāapstājas pie Induļa. [..] &lt;17.12.1949.&gt; 17. dec.</title>
      </sec>
      <sec id="sec-2-16">
        <title>Klasē vērojama centība labot disciplīnu un sekmes. Ievesta ir</title>
        <p>klases uzvedības atzīmju burtnīca - tas sekmē disciplīnu klasē.</p>
      </sec>
      <sec id="sec-2-17">
        <title>Par godu b. Staļina dzimšanas dienai 20. dec. nodos svinīgo</title>
        <p>solījumu pionieri Melbārdis E. un Pinka A. Klasē ar to 7
pionieri (50%), kas cels ari sekmēs disciplīnu.
&lt;16.12.1949.&gt;
16. dec.</p>
      </sec>
      <sec id="sec-2-18">
        <title>Klase šodien rūpīgi sakārtojusi grāmatas uz loga - citur nav</title>
        <p>vietas - un uzmanību vairāk pievērsusi arī skolotāju galdam.</p>
      </sec>
      <sec id="sec-2-19">
        <title>Priecīga. ka šodien par uzvedību pirmo reizi klasē izteicos</title>
        <p>The algorithm checks every
paragraph of the input file. If a
paragraph's beginning (first 25
characters + the following pattern)
contains a date, then the
beginning of a paragraph is copied,
wrapped in &lt; &gt;, and pasted above
that line and a modified copy of
input file is saved to a temporary
file which will be processed in the
next stage. Also, an empty line is
added above the wrapped
fragment.</p>
      </sec>
      <sec id="sec-2-20">
        <title>In the third step, the temporary file</title>
        <p>is processed again, now parsing all
the lines containing the wrapped &lt;
&gt; text fragment and predicting a
date from that.</p>
      </sec>
      <sec id="sec-2-21">
        <title>Here the algorithm converts from</title>
      </sec>
      <sec id="sec-2-22">
        <title>Latin and Latvian month</title>
        <p>abbreviations to Arab numbers.</p>
        <p>And the year, 1949, is correctly
guessed from the incomplete
date_and_month, as the year was
atzinīgi un novērtēju ar 5. Salīdzinot ar iepriekšējo, var vērot
sastāva vienveidības veidošanos un kolektiva atbildības
sajūtu.
given in the very beginning of the
diary.
4
&lt;&lt;16.12.1949.&gt;&gt;
16. dec.</p>
      </sec>
      <sec id="sec-2-23">
        <title>Klase šodien rūpīgi sakārtojusi grāmatas uz loga - citur nav</title>
        <p>vietas - un uzmanību vairāk pievērsusi arī skolotāju galdam.</p>
      </sec>
      <sec id="sec-2-24">
        <title>Priecīga. ka šodien par uzvedību pirmo reizi klasē izteicos atzinīgi un novērtēju ar 5. Salīdzinot ar iepriekšējo, var vērot sastāva vienveidības veidošanos un kolektiva atbildības sajūtu.</title>
        <p>&lt;17.12.1949.&gt; In the fourth step, the file is
17. dec. checked for inconsistencies in</p>
      </sec>
      <sec id="sec-2-25">
        <title>Klasē vērojama centība labot disciplīnu un sekmes. Ievesta ir dates — unchronological dates,</title>
        <p>klases uzvedības atzīmju burtnīca - tas sekmē disciplīnu klasē. repeated dates, years out of</p>
      </sec>
      <sec id="sec-2-26">
        <title>Par godu b. Staļina dzimšanas dienai 20. dec. nodos svinīgo legitimate years interval, etc.</title>
        <p>solījumu pionieri Melbārdis E. un Pinka A. Klasē ar to 7 Inconsistent dates are wrapped in
pionieri (50%), kas cels ari sekmēs disciplīnu. double brackets &lt;&lt; &gt;&gt;.
After the fourth step, the output
file in a .txt format is returned to a
humanities researcher, who
manually checks all
doublewrapped dates, changes them to
the correct date, and removes one
pair of &lt; &gt; from corrected
doublewrapped dates.
&lt;22.12.1949.&gt;
22 XII.</p>
      </sec>
      <sec id="sec-2-27">
        <title>Atkal jāapstājas pie Induļa. [..]</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Evaluation of Results and Discussion</title>
      <p>After running the final iteration of the metadates detection algorithm on the corpus of 36 diaries,
15,303 metadates were detected, of which 456 (2.98%) were wrapped in double brackets &lt;&lt; &gt;&gt; as
suspicious and possibly wrong metadates. Upon further inspection of these 456 double-wrapped
metadates by close reading of the texts, they were classified into following categories, see Table 4.</p>
      <p>
        Evaluation of the suspicious metadates confirmed the importance of human evaluation.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) For several categories (1 and 2; also 5, 6, and 7) metadates were correctly formed and looked
identical in the text, and only a close reading could reveal if it was a correct metadate or a mistake.
Categories 3 and 4 (past and future events) also are correctly formed dates, but inside the narration, and
only a close reading can distinguish them from metadates.
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Most often, double-wrapped &lt;&lt;&gt;&gt; metadates were errors of non-chronological or impossible
dates’ being introduced in the earlier digitization process. However, sometimes non-chronological dates
were present in the original diaries; that emphasized the need to give a unique identification number to
every entry, so that both the entry’s date and its sequence in the diary could be preserved when splitting
data and saving separate entries.
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) When inspecting for possible years, initially an additional check was performed to search only
those years occurring within the interval of possible years as denoted in diary’s metadata. However, it
was noticed that sometimes authors have included later remarks in years which are outside the stated
time interval of the diary (category 7).
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) There were about a dozen metadate notations found in the whole corpus where date was not
expressed with some variation of date_and_month but with traditional names, e.g.: Ziemassvētki
(Christmas), Otrās Lieldienas (Easter Day), Vasarsvētki (Whit Sunday), Jāņi (Summer Solstice), etc.
These date notations were dealt with on a case-by-case basis by humanities researchers — the reason
being that oftentimes these date notations were not precise enough to extract a specific date
automatically.
Usually 2 entries for a day, occasionally 3.
      </p>
      <sec id="sec-3-1">
        <title>MULTIPLE entries for</title>
        <p>the same day
DOUBLED metadate for
the same entry</p>
      </sec>
      <sec id="sec-3-2">
        <title>PAST events</title>
      </sec>
      <sec id="sec-3-3">
        <title>FUTURE events</title>
        <p>A TYPO or a mistake
A WRONG YEAR by
author or transcriber</p>
      </sec>
      <sec id="sec-3-4">
        <title>Truly</title>
        <p>UNCHRONOLOGICAL
sequence in a diary
WRONG
unchronological date</p>
      </sec>
      <sec id="sec-3-5">
        <title>BEFORE the current</title>
        <p>entry
UNRECOGNIZED start of
a new YEAR by
algorithm
Date detection
algorithm ERRORS
4
62
5
68
7
81
93
4
29
1
13
5
17
3
13
15
2
–</p>
      </sec>
      <sec id="sec-3-6">
        <title>In one diary, metadate was before and after the</title>
        <p>entry.</p>
      </sec>
      <sec id="sec-3-7">
        <title>Author writes about past events, starts with a</title>
        <p>date mention; it is recognized as the metadate
for entry.</p>
      </sec>
      <sec id="sec-3-8">
        <title>Author writes about future events, starts with a</title>
        <p>date mention; it is recognized as the metadate
for entry.</p>
      </sec>
      <sec id="sec-3-9">
        <title>A typo or mistake in the day / month / year of</title>
        <p>metadate by author or transcriber, causing the
appearance of an unchronological metadate.</p>
      </sec>
      <sec id="sec-3-10">
        <title>For example, month in the metadate changes</title>
        <p>from December to January, but the year remains
the same.</p>
      </sec>
      <sec id="sec-3-11">
        <title>Several days in mixed order, or longer sequences from another month or year, perhaps added later in the original.</title>
      </sec>
      <sec id="sec-3-12">
        <title>Causes current correct date to be recognized as wrong and wrapped by &lt;&lt; &gt;&gt;.</title>
      </sec>
      <sec id="sec-3-13">
        <title>After a long break (e.g., Nov 1966-Mar 1967) and</title>
        <p>no explicit year in metadate, algorithm marks it
as a possible typo &lt;&lt; &gt;&gt;.</p>
      </sec>
      <sec id="sec-3-14">
        <title>Incorrectly parsed dates with no obvious reason.</title>
      </sec>
      <sec id="sec-3-15">
        <title>Correct dates, wrapped &lt;&lt; &gt;&gt; as suspicious with no obvious reason.</title>
        <p>The cleaned metadates wrapped in &lt; &gt; further served as a separator to split a .txt file into separate
day entries and save into a .json file. An example of one entry text with additional metadata is in Figure
3. Every entry has the following information: “lfk_number” — the abbreviated number of the diary in
the ILFA archive; “number_of_entry_for_this_author” — allows one to detect correct
nonchronological entries; “metadate” — a metadate as predicted by the date detection algorithm;
“sql_date” — metadate transformed to .sql format for later computational analysis;
“number_of_characters” — showing entry length; “when_added_to_database” — the date when added
to the database.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Acknowledgements</title>
    </sec>
    <sec id="sec-5">
      <title>7. References</title>
      <p>In total, 14,364 day entries from 36 diaries were added to the database. In further steps, data were
enriched processing entries with a morphological parser of the Latvian NLP pipeline [9] and by adding
demographic metadata to each entry, such as the author’s gender, or author’s age at the moment of
writing. The division of the corpus into daily entries opens good possibilities for creating different
timerelated sub-corpora by combining age groups and gender, e.g. diary texts of 21–35 years old women,
diaries of 36–50 years old men in the 1950s, etc.</p>
      <p>Further computational analysis of diaries will require solving several methodological challenges,
such as evaluation of the representativeness of the diary corpus. The diaries differ greatly in length,
writing frequency, and stylistics, and it is yet to be determined what computational methods could offer
to the general discourse of diary research, including diachronic research of single diaries and
crosscomparison of different diaries according to similar properties of author’s age, author’s gender, and
other categories. Date detection carried out on the corpus of Latvian diaries has invited new perspectives
of inquiry for diaries [7], perspectives that have already been applied to periodicals [3], book printing,
and other domains of time-bound written documents.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Dennis-Henderson</surname>
          </string-name>
          , Ashley, Roughan, Matthew, Mitchell, Lewis, Tuke, Jonathan (
          <year>2020</year>
          ).
          <article-title>Life Still Goes on: Analysing Australian WW1 Diaries through Distant Reading</article-title>
          .
          <source>Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage</source>
          ,
          <source>Social Sciences, Humanities and Literature</source>
          ,
          <source>International Committee on Computational Linguistics</source>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kiran</surname>
            ,
            <given-names>Kumar</given-names>
          </string-name>
          <string-name>
            <surname>Bandeli</surname>
            , Hussain, Muhammed Nihal, Agarwal,
            <given-names>Nitin.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>A Framework towards Computational Narrative Analysis on Blogs</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2593</volume>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Koncar</surname>
          </string-name>
          , Philipp, Fuchs, Alexandra, Hobisch, Elisabeth, Geiger, Bernhard, Scholger, Martina, Helic, Denis (
          <year>2020</year>
          ).
          <article-title>Text sentiment in the Age of Enlightenment: an analysis of spectator periodicals</article-title>
          .
          <source>Applied Network Science, No. 5</source>
          (
          <issue>1</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Lejeune</surname>
          </string-name>
          ,
          <string-name>
            <surname>Philippe</surname>
          </string-name>
          (
          <year>2009</year>
          ). On Diary. Honolulu: University of Hawaii Press.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Myers</surname>
          </string-name>
          , Victoria,
          <string-name>
            <surname>David O'Shaughnessy</surname>
          </string-name>
          , and Mark Philp (eds),
          <source>The Diary of William Godwin, (Oxford: Oxford Digital Library</source>
          ,
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Reinsone</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sanita</surname>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Searching for Deeper Meanings in Cultural Heritage Crowdsourcing</article-title>
          .
          <article-title>In A History of Participation in Museums and Archives</article-title>
          .
          <source>Traversing Citizen Science and Citizen Humanities</source>
          ,
          <source>Routledge's research series Museum and Heritage Studies</source>
          , edited by Per Hetland, Palmyre Pierraux, Line Esborg,
          <year>2020</year>
          , pp.
          <fpage>186</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Reinsone</surname>
          </string-name>
          , Sanita, Matulis, Haralds, Ļaksa-Timinska,
          <article-title>Ilze. Metadatos balstīta dienasgrāmatu teksta korpusa analīze [Metadata Based Analysis of Diary Corpus]</article-title>
          .
          <source>Letonica</source>
          ,
          <year>2022</year>
          (to be published).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Thain</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marion</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Perspective: Digitizing the Diary - Experiments in Queer Encoding (A Retrospective and a Prospective)</article-title>
          .
          <source>Journal of Victorian Culture</source>
          , No.
          <volume>21</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>226</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Znotiņš</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Cīrule</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>NLP-PIPE: Latvian</surname>
            <given-names>NLP Tool</given-names>
          </string-name>
          <string-name>
            <surname>Pipeline. Human Language Technologies - The Baltic</surname>
            <given-names>Perspective</given-names>
          </string-name>
          , IOS Press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>