=Paper=
{{Paper
|id=Vol-3232/paper32
|storemode=property
|title=Automatic Detection of Dates in the Corpus of Diaries
|pdfUrl=https://ceur-ws.org/Vol-3232/paper32.pdf
|volume=Vol-3232
|authors=Haralds Matulis,Sanita Reinsone,Ilze Ļaksa-Timinska
|dblpUrl=https://dblp.org/rec/conf/dhn/MatulisRL22
}}
==Automatic Detection of Dates in the Corpus of Diaries==
<pdf width="1500px">https://ceur-ws.org/Vol-3232/paper32.pdf</pdf>
<pre>
Automatic Detection of Dates in the Corpus of Diaries
Haralds Matulis1, Sanita Reinsone1, and Ilze Ļaksa-Timinska1
1
    Institute of Literature, Folklore and Art of the University of Latvia, Mūkusalas 3, Riga, LV1423, Latvia

                      Abstract
                      This paper deals with the automatic detection of dates in a corpus of digitized, hand-written
                      diaries in Latvian. Date detection is an important step in processing diaries’ corpus, as it allows
                      to split the source texts by dates of entries and carry out diachronic analysis for separate diaries
                      and compare metrics across different authors. This paper describes the workflow of data
                      processing, provides step by step implementation of date detection algorithm, and gives an
                      evaluation of empirical results with discussions of encountered practical challenges for precise
                      date detection in personal diaries.

                      Keywords 1
                      date detection, corpus analysis, crowdsourcing, digitization, hand-written texts.


       1. Introduction: Pilot Corpus of Diaries
    Diary writing tradition is a complex phenomenon [4]. Forms and styles of how personal diaries are
written can differ even within one notebook of a single author. However, it can be assumed that the date
at the beginning of a daily record is a formal element that distinguishes diaries from other forms of
personal autobiographical reflections in written form. Dates are important formal elements that keep
diaries structured and aligned with the narrated time.
    The corpus of diaries was built by the Institute of Literature, Folklore and Art (ILFA), University of
Latvia. It is based on the Autobiography Collection, started in 2018, in which various life writing
materials are archived and made digitally accessible at http://autobiografijas.lv. In collecting materials,
priority was given to previously unpublished autobiographical items and to those not housed in other
cultural heritage institutions: in other words, to those generally inaccessible because they were stored at
home. At the next stage, a crowdsourcing platform at garamantas.lv was used to transcribe diaries from
the scanned manuscripts. The file transformation from hand-written to electronic text was performed
(1) sometimes by the authors of the diaries themselves, (2) occasionally by the project researchers at
ILFA, (3) and mostly by volunteers [6]. The collection covers the 20th century through today. For the
pilot corpus, 36 diaries were selected which were fully transcribed. They are of different lengths (some
diaries consisting of a few entries, others written over as many as 55 years), written by authors of
different age and educational and social backgrounds in the Latvian language from 1917 to 2021.
    Information on the creation of corpora of diaries and the use of digital methods in its analysis is
relatively scarce in the academic literature, and studies on this topic are not many. Existing research
articles have mostly focused on the creation of publicly accessible digital editions of diaries, for
example, reflecting upon the encoding of diaries in a particular format or annotating a transcript with
persons, places, and other named entities ([8] and [5]). A relatively recent study has developed a new
system for the computer-aided identification of narrative threads in diary-like online blogs, using
several natural language processing techniques [2]. A team of researchers at the University of Adelaide
has been building a corpus of World War I diaries, containing over 500 diaries written between August
1914 and November 1918. Applying a number of distant reading methods, the study provides a general
overview, showing thematic trends and the distribution of particular concepts across the corpus [1]. To
date, we have not been able to find studies that have analyzed the creation of a comprehensive corpus


The 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, March 15-18, 2022
EMAIL: haralds.matulis@gmail.com (A. 1); sanita.reinsone@lulfmi.lv (A. 2); ilze.laksa-timinska@lulfmi.lv (A. 3)
ORCID: 0000-0002-0142-7677 (A. 1); 0000-0003-1980-5450 (A. 2); 0000-0001-7213-4954 (A. 3)
                             © 2022 Copyright for this paper by its authors.!
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

                            CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                      334
of diary texts or studies where such a corpus, potentially heterogeneous and yet representative, has been
analyzed with computational methods.


Figure 1: Public interface of the manuscript transcription tool at garamantas.lv.

     An analysis of the date entries in the diary text corpus, with the aim of identifying the system of date
notation in each individual diary, revealed that date notation tends to be a highly creative process, the
author's taste, habits, and mood playing important roles. The date may be written at the beginning or at
the end of the entry, sometimes in the middle; sometimes, too, the entry may be without a date because
it is contextualized in the text. There are authors who mention the year only at the beginning of the year,
and the month at the beginning of the month, numbering the days with numerals. The abbreviations of
the year and the months are also very varied, and the use of punctuation and separators (full stop,
comma, slash, colon, semicolon, hyphen) is also varied. Finally, there are also errors in months, days,
or years. The landscape of dates in the diaries is indeed colorful.


    2. Methodological Considerations on Date Detection in Diaries
    For the further analysis of diaries with different computational methods — topic modeling, change
of topics and sentiment over time, and comparison of metrics across different diaries — there was a
need for a more detailed breakdown of the source files. The decision was to slice every diary into
smaller chunks, extracting entries for single days. A single day is a semantically meaningful time unit
for analysis, when compared with a larger time period like a week or a month or a finer unit such as
morning, afternoon, or evening. Single day also coincides with the dominant notation system used by
diary authors to register their writings. In this corpus there are only some exceptions when a diary’s
entries are undated or refer to a longer time periods, like a month or a season. Therefore, a day seems a
reasonable choice for the finer partition of diaries.
    As data after this pre-processing would be used in humanities research, the precision of the data is
crucial; a decision was made to target initially the maximum data, possibly erring on the side of too
many false positives. In the next step, the output file was given to a digital humanities researcher, who
examined and manually corrected the wrong dates and deleted the non-dates so that the final version of
the digitized diaries would be close to perfect accuracy. As the date-detection process was conducted
in a cooperation between humanities researchers and a data scientist, the workflow had to be
comprehensible and simple for both sides. The source files were text files (.txt format), which were
shared on a drive. The data scientist downloaded the files, processed them, and uploaded the processed
files back onto the drive in a separate folder. The input and output files were both .txt format, thus
enabling both parties to access and work with files.


                                                    335
    The specific challenge of finding dates in the diary corpus consisted of two parts. First, to find all
dates occurring in the corpus. This was not an easy task, as the text was primarily meant for personal
use and not for computational analysis. Therefore, the style of date notation is oftentimes elliptic,
sometimes obscure, and varies widely even within one diary written by the same author. Second, to find
all dates which are serving as metadates to denote the day, month, and year of the specific entry — and
to distinguish these dates from false positives, i.e., such dates that only refer to some moment in the
narrative but do not indicate the time when this record was made.
    The wide variety of metadate notation, even by one author, might seem puzzling at first. And it
might provoke a question: are authors deliberately negligent or obscure with date formats? The answer
is that diaries, at least from the start, are written for oneself, and such elliptic ways of metadate
formatting are sufficient for the author and his/her purposes.


      31. 12. 32.       3/I. 33.           2. jūlijā 1933. g.     7. septembra naktī   10. Septemb. 1933. g.    5. Septembrī


  š. g. 11. septm.   1935. g.           1933. gadā 1. janvārī 2. janvārī 1935. g.      31. Oktobrī 1933. gadā      31/V. 36. g.


  12. aug. 1938. g 6. jūlijā 1941. g.      1. 3. 44. g.


Figure 2: Many types of date notation in the diary of Davis Dauvarts, LFK Ak145.

   Another observation is that a change in metadate format usually occurs with a larger time interval
between records — that could be as long as several years or as short as a month, after which the author
chooses to record the metadate in a format which seems more natural at that moment (perhaps not
reviewing the diary to compare the previous format used). Maybe such variety in one diary indicates
that writing is not the main or a prominent part of this person’s day and lifestyle, as daily writing habits
tend to develop more uniform patterns in a person’s writing.


    3. Different Conventions of Date Notation and Placement
    In regard to their metadate notation system, all diaries of this corpus can be divided into two groups:
those with an absolute and those with a relative system of metadate notation. The vast majority of
diaries fall into the group of absolute date. By absolute we refer to a notation system where the date, in
a full or shortened version, appears in a diary: dd.mm.yyyy – 14.02.1957; dd.mm.yy – 14.02.57; dd.mm
– 14.02. A relative metadate, on the other hand, relies on the overall hierarchical structure of the
metadate notation system in the diary, and the precise date of the entry can be deduced from the position
of that entry in the overall structure of the diary. Of all diaries, there were only two diaries using a
relative date system; these were addressed separately, by devising a particular search algorithm for each
one. All further discussion about date detection is about absolute metadate detection.
    The most frequent placement of the metadate adheres to this convention containing three rules: (1)
an empty line before an entry of a new day, (2) the metadate is written at the beginning of a new line,
(3) the diary entry for that day starts on the next line. However, there were variations to this system,
which had to be accounted for.
    After doing some pilot experiments for metadate detection and evaluating the results, it soon became
clear that date_and_month is the most important part of the full_date record, as both the date and the
month are needed to determine the precise entry time of the diary record. Without ‘month’ we are left
just with the day — wondering in which month the author made that record. Without ‘day’ we are left
with just the month, unable to attribute this entry to a specific dd.mm.yyyy. However, without a ‘year,’
we can still usually guess what year it is from the context and previous records.
    As date_and_month is the critical minimum of information needed to find and recognize if that line
contains the metadate, all input files were searched for this group. Although date_and_month group of
the metadate is usually placed at the beginning of the entry and on a new line, it is not always at the
exact beginning of the line.


                                                                336
   First, there could be a year before date_and_month: 1957. gada 14. februārī.
   There could also be a weekday before: Ceturtdien, 14. februārī.
   Or a location of the entry: Stokholmā, ceturtdienā, 14. februārī.
   In some less common cases the date_and_month is intertwined with the words of the first sentence:
   Ir jau pienācis 14. februāris… // And so the 14th of February has already come …
   Therefore, a buffer of 25 characters was allocated to the beginning of the paragraph, allowing for
the date_and_month group to start anywhere from the 1st through 25th character of the paragraph, but
not later. Practical experiments with larger intervals showed not to improve date detection quality while
bringing more false positive results.


    4. Date Detection Algorithm
    The Latvian NLP pipeline [9], which was used as a morphological parser of diaries to augment data
with part-of-speech tags and lemmas, also contains a date recognition feature. However, due to above-
mentioned irregularities in date notation techniques and the need to distinguish metadates from other
dates in diaries, the Latvian NLP pipeline date recognition feature delivered only partially sufficient
results. Therefore, it was decided to write a custom date detection algorithm, using regular expressions.
    Regular expressions allow users to search a text for specific characters or sequences of characters
and then perform operations on them. To account for all different cases of date_and_month placement,
a general pattern was created consisting of three parts: any 0 to 25 characters at the beginning of the
line (optional) + date_and_month + year (optional). Regular expressions which are the building blocks
of the pattern for date detection are provided in Table 1 below. The original code was written in
Javascript programming language, but is largely compatible with other programming languages.
    The date detection algorithm was tested on diaries and improved until the results were satisfactory.
In general, the improvements followed two lines: (1) including more regular expression patterns of
metadate composition, when some dates were found unrecognized in processed files, and (2) narrowing
down regular expression patterns to exclude false positive results. The variety of metadate patterns, as
can be seen in Figure 2, made it impossible to imagine all combinations of symbols beforehand;
therefore, such a trial-and-error method was appropriate to fine-tune the search pattern.

Table 1
Regular Expressions to Find Metadates in Text
        Regular Expression                                      Comment
 ^.{0,25}/                         Beginning – any 0 to 25 characters at the beginning of the
                                   paragraph.
 /(19|20)\d{2}\.?\s?/              Year – a four-digit sequence, the first two digits should be 19
                                   or 20, followed by any 2 numbers, followed by an optional dot,
                                   followed by an optional whitespace.
 /\s?[0123]?\d\.?\/?\s?/           Date – a whitespace character (optional), followed by an
                                   optional 0 or 1 or 2 or 3, followed by any digit, followed by dot
                                   (optional), followed by forward slash / (optional), followed by
                                   whitespace (optional):
 /jan|feb|mar|apr|mai|j[uū]n|      Latvian months – matches the first three letters of the month –
 j[uū]|aug|sep|okt|nov|dec/i       as often authors use abbreviations and not the full month
                                   name.
 /XII|XI|IX|X|VIII|VII|VI|         Roman months – followed by a period, comma, or whitespace.
 IV|V|III|II|I[.,\s]/

 /\d{1,4}[./\\-]\s?\d{1,2}[./\\-     Arab numbers – to match dd.mm.yyyy, dd.mm.yy and
 ]\s?\d{1,4}\.?/                     yyyy.dd.mm., separated by . / \ - and followed by optional
                                     whitespace.


                                                  337
   To detect dates, the following algorithm and workflow were used. A modular, step-by-step approach
helped to clearly identify what occurred in each phase, to evaluate intermediary output, to see how well
the algorithm performed, and if necessary to modify it. If the date formatting of the input .txt file is
clearly known, the modular approach also allows one to adjust settings focusing a regular expressions
search on relevant patterns, thus decreasing false positive findings — which might be useful when
working with larger files.
   The .txt-in .txt-out workflow permitted to skip the building, learning, and testing of User Interface.
Although, for larger-scale corpus processing, a graphical User Interface might be useful — allowing
one to adjust finer settings according to input files formats and minimizing the chance of introducing
errors by hand-correcting the double wrapped << >> metadates. Table 2 below describes the date
detection algorithm step by step.

Table 2
Metadate Detection Algorithm in Five Steps
  Step                                                Action
    1    Input the text file in a .txt format.
    2    The algorithm checks every paragraph of the input file. If a paragraph’s beginning (first
         25 characters + the following pattern) contains a date, then the beginning of a
         paragraph is copied, wrapped in < >, and pasted above that paragraph, with an empty
         line added above the wrapped text fragment.
    3    In the third step, all the lines containing the wrapped < > text fragment are parsed and a
         metadate is predicted.
    4    In the fourth step, < > wrapped metadates are checked for inconsistencies —
         unchronological dates, repeated dates, years out of legitimate years interval, etc.
         Suspicious dates are wrapped in double brackets << >>.
    5    After the fourth step, the output file in a .txt format is returned to a humanities
         researcher, who manually checks all double-wrapped dates, changes them to the
         correct date and removes one pair of < > from corrected double-wrapped dates.

   An example of the algorithm at work is given below, showing an excerpt from a sample diary with
the output of every step, plus comments. The fragment is taken from the diary LFK Ak36, written by a
school teacher. It describes three days at the end of 1949 in Soviet Latvia. Below is the English
translation of the text which follows in Table 3 in Latvian:

   17 Dec.
   There is dedication in the class to improve discipline and achievements. A class behavior register
has been introduced — this promotes class discipline. On the occasion of comrade Stalin's birthday on
the 20th of December, pioneers Melbārdis E. and Pinka A. will take the solemn pledge. Altogether, 7
pioneers (50%) in the class will raise discipline in their achievements.
   16 Dec.
   Today the class carefully arranged books on the windowsill — there is no other place left — and
paid more attention to the teachers’ desk. I am pleased that today, for the first time in the class, I praised
behavior and gave a mark of 5 for it. Comparing with the previous, one can see the uniformity of the
group taking shape and the sense of responsibility setting in.
   22 XII.
   I have to stop again at Indulis. [..]


                                                     338
  Table 3
  Metadate Detection – Workflow Example
Step                               Output                                            Comments
  1   17. dec.                                                         First step: input the text file. Here
      Klasē vērojama centība labot disciplīnu un sekmes. Ievesta ir it is an excerpt from a diary
      klases uzvedības atzīmju burtnīca - tas sekmē disciplīnu klasē. written in 1949. / 1950.
      Par godu b. Staļina dzimšanas dienai 20. dec. nodos svinīgo
      solījumu pionieri Melbārdis E. un Pinka A. Klasē ar to 7
      pionieri (50%), kas cels ari sekmēs disciplīnu.
      16. dec.
      Klase šodien rūpīgi sakārtojusi grāmatas uz loga - citur nav
      vietas - un uzmanību vairāk pievērsusi arī skolotāju galdam.
      Priecīga. ka šodien par uzvedību pirmo reizi klasē izteicos
      atzinīgi un novērtēju ar 5. Salīdzinot ar iepriekšējo, var vērot
      sastāva vienveidības veidošanos un kolektiva atbildības
      sajūtu.
      22 XII.
      Atkal jāapstājas pie Induļa. [..]

 2     <17. dec.>                                                       The algorithm checks every
       17. dec.                                                         paragraph of the input file. If a
       Klasē vērojama centība labot disciplīnu un sekmes. Ievesta ir    paragraph's beginning (first 25
       klases uzvedības atzīmju burtnīca - tas sekmē disciplīnu klasē.  characters + the following pattern)
       Par godu b. Staļina dzimšanas dienai 20. dec. nodos svinīgo      contains a date, then the
       solījumu pionieri Melbārdis E. un Pinka A. Klasē ar to 7         beginning of a paragraph is copied,
       pionieri (50%), kas cels ari sekmēs disciplīnu.                  wrapped in < >, and pasted above
       <16. dec.>                                                       that line and a modified copy of
       16. dec.                                                         input file is saved to a temporary
       Klase šodien rūpīgi sakārtojusi grāmatas uz loga - citur nav file which will be processed in the
       vietas - un uzmanību vairāk pievērsusi arī skolotāju galdam. next stage. Also, an empty line is
       Priecīga. ka šodien par uzvedību pirmo reizi klasē izteicos added above the wrapped
       atzinīgi un novērtēju ar 5. Salīdzinot ar iepriekšējo, var vērot fragment.
       sastāva vienveidības veidošanos un kolektiva atbildības
       sajūtu.
       <22 XII.>
       22 XII.
       Atkal jāapstājas pie Induļa. [..]

 3     <17.12.1949.>                                                In the third step, the temporary file
       17. dec.                                                     is processed again, now parsing all
       Klasē vērojama centība labot disciplīnu un sekmes. Ievesta irthe lines containing the wrapped <
       klases uzvedības atzīmju burtnīca - tas sekmē disciplīnu klasē.
                                                                    > text fragment and predicting a
       Par godu b. Staļina dzimšanas dienai 20. dec. nodos svinīgo  date from that.
       solījumu pionieri Melbārdis E. un Pinka A. Klasē ar to 7     Here the algorithm converts from
       pionieri (50%), kas cels ari sekmēs disciplīnu.              Latin     and      Latvian   month
       <16.12.1949.>                                                abbreviations to Arab numbers.
       16. dec.                                                     And the year, 1949, is correctly
       Klase šodien rūpīgi sakārtojusi grāmatas uz loga - citur nav guessed from the incomplete
       vietas - un uzmanību vairāk pievērsusi arī skolotāju galdam. date_and_month, as the year was
       Priecīga. ka šodien par uzvedību pirmo reizi klasē izteicos


                                                    339
     atzinīgi un novērtēju ar 5. Salīdzinot ar iepriekšējo, var vērot given in the very beginning of the
     sastāva vienveidības veidošanos un kolektiva atbildības diary.
     sajūtu.
     <22.12.1949.>
     22 XII.
     Atkal jāapstājas pie Induļa. [..]

4    <17.12.1949.>                                                    In the fourth step, the file is
     17. dec.                                                         checked for inconsistencies in
     Klasē vērojama centība labot disciplīnu un sekmes. Ievesta ir    dates — unchronological dates,
     klases uzvedības atzīmju burtnīca - tas sekmē disciplīnu klasē.  repeated dates, years out of
     Par godu b. Staļina dzimšanas dienai 20. dec. nodos svinīgo      legitimate years interval, etc.
     solījumu pionieri Melbārdis E. un Pinka A. Klasē ar to 7         Inconsistent dates are wrapped in
     pionieri (50%), kas cels ari sekmēs disciplīnu.                  double brackets << >>.
     <<16.12.1949.>>                                                  After the fourth step, the output
     16. dec.                                                         file in a .txt format is returned to a
     Klase šodien rūpīgi sakārtojusi grāmatas uz loga - citur nav humanities researcher, who
     vietas - un uzmanību vairāk pievērsusi arī skolotāju galdam. manually checks all double-
     Priecīga. ka šodien par uzvedību pirmo reizi klasē izteicos wrapped dates, changes them to
     atzinīgi un novērtēju ar 5. Salīdzinot ar iepriekšējo, var vērot the correct date, and removes one
     sastāva vienveidības veidošanos un kolektiva atbildības pair of < > from corrected double-
     sajūtu.                                                          wrapped dates.

     <22.12.1949.>
     22 XII.
     Atkal jāapstājas pie Induļa. [..]


    5. Evaluation of Results and Discussion
   After running the final iteration of the metadates detection algorithm on the corpus of 36 diaries,
15,303 metadates were detected, of which 456 (2.98%) were wrapped in double brackets << >> as
suspicious and possibly wrong metadates. Upon further inspection of these 456 double-wrapped
metadates by close reading of the texts, they were classified into following categories, see Table 4.
   Evaluation of the suspicious metadates confirmed the importance of human evaluation.
   (1) For several categories (1 and 2; also 5, 6, and 7) metadates were correctly formed and looked
identical in the text, and only a close reading could reveal if it was a correct metadate or a mistake.
Categories 3 and 4 (past and future events) also are correctly formed dates, but inside the narration, and
only a close reading can distinguish them from metadates.
   (2) Most often, double-wrapped <<>> metadates were errors of non-chronological or impossible
dates’ being introduced in the earlier digitization process. However, sometimes non-chronological dates
were present in the original diaries; that emphasized the need to give a unique identification number to
every entry, so that both the entry’s date and its sequence in the diary could be preserved when splitting
data and saving separate entries.
   (3) When inspecting for possible years, initially an additional check was performed to search only
those years occurring within the interval of possible years as denoted in diary’s metadata. However, it
was noticed that sometimes authors have included later remarks in years which are outside the stated
time interval of the diary (category 7).
   (4) There were about a dozen metadate notations found in the whole corpus where date was not
expressed with some variation of date_and_month but with traditional names, e.g.: Ziemassvētki
(Christmas), Otrās Lieldienas (Easter Day), Vasarsvētki (Whit Sunday), Jāņi (Summer Solstice), etc.
These date notations were dealt with on a case-by-case basis by humanities researchers — the reason


                                                   340
being that oftentimes these date notations were not precise enough to extract a specific date
automatically.

Table 4
Ten categories of double wrapped << >> metadates found in close reading.
 Nr       Description of the     Count In how                          Comments
               category                   many
                                         diaries
 1.    MULTIPLE entries for       103      24     Usually 2 entries for a day, occasionally 3.
       the same day
 2.    DOUBLED metadate for        4        1     In one diary, metadate was before and after the
       the same entry                             entry.
 3.    PAST events                62       13     Author writes about past events, starts with a
                                                  date mention; it is recognized as the metadate
                                                  for entry.
 4.    FUTURE events               5        5     Author writes about future events, starts with a
                                                  date mention; it is recognized as the metadate
                                                  for entry.
 5.    A TYPO or a mistake        68       17     A typo or mistake in the day / month / year of
                                                  metadate by author or transcriber, causing the
                                                  appearance of an unchronological metadate.
 6.    A WRONG YEAR by             7        3     For example, month in the metadate changes
       author or transcriber                      from December to January, but the year remains
                                                  the same.
 7.    Truly                      81       13     Several days in mixed order, or longer sequences
       UNCHRONOLOGICAL                            from another month or year, perhaps added
       sequence in a diary                        later in the original.
 8.    WRONG                      93       15     Causes current correct date to be recognized as
       unchronological date                       wrong and wrapped by << >>.
       BEFORE the current
       entry
 9.    UNRECOGNIZED start of       4        2     After a long break (e.g., Nov 1966-Mar 1967) and
       a new YEAR by                              no explicit year in metadate, algorithm marks it
       algorithm                                  as a possible typo << >>.
10. Date detection                29        –     Incorrectly parsed dates with no obvious reason.
       algorithm ERRORS                           Correct dates, wrapped << >> as suspicious with
                                                  no obvious reason.

    The cleaned metadates wrapped in < > further served as a separator to split a .txt file into separate
day entries and save into a .json file. An example of one entry text with additional metadata is in Figure
3. Every entry has the following information: “lfk_number” — the abbreviated number of the diary in
the ILFA archive; “number_of_entry_for_this_author” — allows one to detect correct non-
chronological entries; “metadate” — a metadate as predicted by the date detection algorithm;
“sql_date” — metadate transformed to .sql format for later computational analysis;
“number_of_characters” — showing entry length; “when_added_to_database” — the date when added
to the database.


                                                   341
Figure 3: One entry from the diary of Alvils Kalnietis, LFK Ak26

In total, 14,364 day entries from 36 diaries were added to the database. In further steps, data were
enriched processing entries with a morphological parser of the Latvian NLP pipeline [9] and by adding
demographic metadata to each entry, such as the author’s gender, or author’s age at the moment of
writing. The division of the corpus into daily entries opens good possibilities for creating different time-
related sub-corpora by combining age groups and gender, e.g. diary texts of 21–35 years old women,
diaries of 36–50 years old men in the 1950s, etc.
    Further computational analysis of diaries will require solving several methodological challenges,
such as evaluation of the representativeness of the diary corpus. The diaries differ greatly in length,
writing frequency, and stylistics, and it is yet to be determined what computational methods could offer
to the general discourse of diary research, including diachronic research of single diaries and cross-
comparison of different diaries according to similar properties of author’s age, author’s gender, and
other categories. Date detection carried out on the corpus of Latvian diaries has invited new perspectives
of inquiry for diaries [7], perspectives that have already been applied to periodicals [3], book printing,
and other domains of time-bound written documents.


    6. Acknowledgements
  This paper is supported by the project “Digital Resources for Humanities: Integration and
Development” (No. VPP-IZM-DH-2020/1-0001) funded by Latvian Council of Science.


    7. References
[1] Dennis-Henderson, Ashley, Roughan, Matthew, Mitchell, Lewis, Tuke, Jonathan (2020). Life Still
    Goes on: Analysing Australian WW1 Diaries through Distant Reading. Proceedings of the 4th Joint
    SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences,
    Humanities and Literature, International Committee on Computational Linguistics, pp. 90–104.
[2] Kiran, Kumar Bandeli, Hussain, Muhammed Nihal, Agarwal, Nitin. (2020). A Framework towards
    Computational Narrative Analysis on Blogs. CEUR Workshop Proceedings, vol. 2593, pp. 63–69.
[3] Koncar, Philipp, Fuchs, Alexandra, Hobisch, Elisabeth, Geiger, Bernhard, Scholger, Martina,
    Helic, Denis (2020). Text sentiment in the Age of Enlightenment: an analysis of spectator
    periodicals. Applied Network Science, No. 5(1).
[4] Lejeune, Philippe (2009). On Diary. Honolulu: University of Hawaii Press.
[5] Myers, Victoria, David O'Shaughnessy, and Mark Philp (eds), The Diary of William Godwin,
    (Oxford: Oxford Digital Library, 2010).
[6] Reinsone, Sanita (2020). Searching for Deeper Meanings in Cultural Heritage Crowdsourcing. In
    A History of Participation in Museums and Archives. Traversing Citizen Science and Citizen
    Humanities, Routledge’s research series Museum and Heritage Studies, edited by Per Hetland,
    Palmyre Pierraux, Line Esborg, 2020, pp. 186–207.
[7] Reinsone, Sanita, Matulis, Haralds, Ļaksa-Timinska, Ilze. Metadatos balstīta dienasgrāmatu teksta
    korpusa analīze [Metadata Based Analysis of Diary Corpus]. Letonica, 2022 (to be published).
[8] Thain, Marion (2016). Perspective: Digitizing the Diary – Experiments in Queer Encoding (A
    Retrospective and a Prospective). Journal of Victorian Culture, No. 21 (2), pp. 226–241.
[9] Znotiņš, A. and Cīrule, E. NLP-PIPE: Latvian NLP Tool Pipeline. Human Language Technologies
    - The Baltic Perspective, IOS Press, 2018.


                                                    342

</pre>