=Paper=
{{Paper
|id=Vol-3671/paper4
|storemode=property
|title=Estimating Narrative Durations: Proof of Concept
|pdfUrl=https://ceur-ws.org/Vol-3671/paper4.pdf
|volume=Vol-3671
|authors=Mustafa Ocal,Akul Singh,Mark Finlayson
|dblpUrl=https://dblp.org/rec/conf/ecir/OcalSF24
}}
==Estimating Narrative Durations: Proof of Concept==
<pdf width="1500px">https://ceur-ws.org/Vol-3671/paper4.pdf</pdf>
<pre>
                                Estimating Narrative Durations: Proof of Concept
                                Mustafa Ocal, Akul Singh and Mark Finlayson
                                Knight Foundation School of Computing and Information Sciences Florida International University CASE Building,
                                Room 362, 11200 S.W. 8th Street, Miami, FL USA 33199


                                           Abstract
                                           The duration of a narrative—that is, how long the events described in the story world take to unfold—is
                                           a feature of general interest to narrative understanding, and has been a topic of interest to theoreticians
                                           of narrative for some time. We combine prior work on timeline extraction (the TLEX algorithm) with
                                           an event duration estimation method that uses temporal pattern mining from large text corpora, to
                                           demonstrate a method for estimating the duration of narratives. We first gathered over 400K event
                                           durations mined from nearly 400M words from two corpora (the iWeb corpus and LexusNexis) using 10
                                           hand-crafted temporal patterns. We then apply our approach to 30 selected stories from two corpora
                                           that already have annotated gold standard events and times (the ProppLearner corpus and the TimeBank
                                           corpus), estimating the duration of each story. We then conducted a preliminary evaluation using
                                           human judgements, showing that the durations extracted by the system are judged reasonable by people
                                           approximately 70% (for folktales) and 28% (for news stories) of the time. The gap between actual and
                                           desired performance reveals several challenges which are of interest, related both to duration estimation
                                           and its evaluation.

                                           Keywords
                                           Temporal Reasoning, Temporal Information Retrieval, TimeML, Duration


                                1. Introduction
                                Genette et al. [1] defines a narrative as “the representation of an event or of a sequence of
                                events”. This captures the notion—common to many definitions of the term—that narratives
                                involve events and so can be placed in time. Genette himself was well known for examining the
                                relationship between narrative time and discourse time, i.e., time as it unfolds in the story vs.
                                the time taken to actually read the narrative [2]. More generally, it has been observed that the
                                relationship between narrative and temporality is one of the most popular research areas in
                                narrative theory [3].
                                   Most approaches to narrative time require us to have some sense of the duration of events
                                in the story world. Unfortunately, even the identification of basic event durations has been
                                relatively understudied in NLP, and this is even more true of the duration of a complex sequence
                                of events. Fortunately, several recent results have opened the way for more quantitative studies
                                of narrative time (and, indeed, event sequences generally). In particular, TLEX, or TimeLine
                                EXtraction, is a recently developed algorithm for extracting exact timelines from texts annotated
                                with temporal annotation scheme TimeML [4]. Timelines expose the global order of events in


                                In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Litvak (eds.): Proceedings of the Text2Story’24 Workshop, Glasgow
                                (Scotland), 24-March-2024
                                Envelope-Open mocal@fiu.edu (M. Ocal); asing118@fiu.edu (A. Singh); markaf@fiu.edu (M. Finlayson)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                           41
a narrative. Beyond this, all that is needed to compute the duration of the narrative are the
durations of individual events and time periods.
   We demonstrate a proof of concept of an approach to event duration estimation that uses
temporal pattern mining from large text corpora, which allows us, in combination with TLEX,
to extract durations of specific narratives. The event duration estimation system first extracts
the durations of individual events from large text corpora using a set of 10 manually identified
temporal patterns related to verbal events. This results in a large dataset of event duration
statistics, from which we can compute average event durations for events expressed with
verbs. The corpora used in this work included both news and editorial articles drawn from
LexisNexus [5], as well as a selection of the intelligent Web (iWeb) corpus [6], resulting in
a dataset with over 400,000 specific durations for almost 6,000 types of verbal events. Next,
we used the TLEX algorithm [using provided implementations; 7] to extract the timelines of
15 narratives from a small corpus of folktales [the ProppLearner Corpus; 8], and 15 random
news stories from the TimeBank corpus [9]. All of these narratives have gold-standard TimeML
annotations, which comprise events, times, and temporal relations, but our approach could
just as easily be applied on top of automatically computed TimeML annotations using existing
automatic temporal annotation systems. Finally, our system combines the event durations with
the timeline to estimate the overall duration of the narrative. We then conducted a preliminary
evaluation using human judges, the result of which suggests that the technique has promise. It
also reveals interesting challenges for evaluation as well as useful next steps.
   The paper is organized as follows. First, we review TimeML, prior work on event duration
extraction, and approaches for timeline extraction (§2). Next, we present the results of our event
duration estimation and explain the overall narrative duration system in detail, as well as how
we applied it to the TimeML corpora (§3). We next describe our preliminary evaluation using
human raters, which shows reasonable agreement and performance modulo such a small sample
(§4). Finally, we discuss the results and propose future work (§5), and provide a summary of the
contributions (§6).


2. Related Work
2.1. Event Duration
Estimating event durations is a task that has been examined in prior NLP research. Prior
work can be classified into two types. The first is coarse-grained classification, which predicts
whether an event is more or less than some given amount of time, e.g., more or less than a day.
The second type is fine-grained classification, which predicts the specific duration of an event,
possibly grouping durations into bins, e.g., seconds, minutes, hours, days, weeks, months, or years.
   Researchers have proposed a number of supervised machine learning-based solutions in both
of these tasks. Pan et al. [10] presented a maximum entropy system that uses tokens, lemmas,
POS tags, and subject-object relations as features, achieving 73.5% accuracy on coarse-grained
classification and 61.9% on fine-grained. Similarly, Gusev et al. [11] also used a maximum
entropy classifier but added named entities, verbs, verb types, and verb dependencies as features,
achieving 74.8% accuracy on course-grained classification and 66% on fine-grained. Later,
Vempala et al. [12] used convolutional neural networks with event class, POS tags, named


                                                42
entities, and dependencies as features, achieving 83.2% accuracy on coarse-grained classification.
The most recent prior work presents rule-based systems instead of machine learning-based
systems. Zhou et al. [13] defined a list of trigger words such as for, since, and on, which was
then used in conjunction with semantic role labeling (SRL) to identify temporal prepositional
phrases related to verbal events, and their system achieves 84.1% accuracy on coarse-grained
classification. Finally, Yang et al. [14] defined temporal patterns involving keywords such as
for, take, spend, last, lasting, duration, and period, and performs temporal pattern extraction,
achieving 76.9% on coarse-grained classification and 76.2% on fine-grained classification.
   These approaches have used two existing duration datasets: the McTaco dataset and the
TimeBank Duration dataset. McTaco was built by giving crowdsourcers a sentence and asking
them how long each event in the sentence lasted [15], which defined a possible range for
each event. However, it should be noted that the dataset contains only a few hundred event
durations. The TimeBank Duration dataset contains 58 news articles in which each event is
annotated with lower-bound and upper-bound duration [16]. However, the corpus has only a
44% inter-annotator agreement for fine-grained classification, which is quite low; this perhaps
is a result of the fact that determining the duration of an event can be highly subjective. The
reason why prior approaches have high accuracy when the agreement score is low is that they
use relaxed matching to evaluate their accuracy. For example, if the targeted event has [3 hours,
3 months] as lower and upper bound duration, a relaxed matching approach will accept the
system’s prediction as correct if the system’s output is hours, days, weeks, or months.
   In addition to low inter-annotator agreement, the TimeBank Duration dataset contains many
questionable annotations. First, many of the duration ranges seem to depend strongly on the
limited size of the corpus. The following example from the dataset indicates that think takes
from 3 months to 3 years, ignorance and fear take from 1 year to 20 years, both oddly specific.

(1)    “I think[3𝑀𝑂,3𝑌 ] , a sense of, of ignorance[1𝑌 ,20𝑌 ] about Islam, a fear[1𝑌 ,20𝑌 ] about who
       Muslims are...”

Other problems include that the corpus assigns durations for words that are not events (e.g.,
nouns such as dialect or language). The corpus also contains finite durations for events that
should have an unbounded duration, for example, being dead. Less problematically, they also
assigned durations for the events that never happened, but this no doubt introduces an additional
element of subjectivity (e.g., the minister was not attacked).
   Since the McTaco dataset only provides the duration for a limited number of events, and the
TimeBank Duration dataset contains numerous questionable annotations, there is a need for a
new dataset that includes reliable and reasonable durations.

2.2. TimeML
We begin with a representation of temporal information in the text. We use TimeML, which is
an annotation scheme for temporal information in text [17]. TimeML provides the ability to
notate events, temporal expressions (i.e., dates, times, durations), and the relations between
them that capture temporal information. TimeML annotations can be used for a variety of
NLP tasks, including information extraction, question answering, summarization, and machine
translation [18, 19, 20].


                                                43
   There are a number of manually and automatically annotated TimeML corpora. We used
two: the ProppLearner corpus and the TimeBank corpus. ProppLearner consists of 15 texts
comprising 18,862 words, with 3,438 events, 142 temporal expressions, and 2,778 temporal
relationships, all of which were manually annotated [8]. On the other hand, the TimeBank
corpus consists of 183 news stories gathered from diverse American news sources [9]. Note
that while using texts with gold-standard event and time annotations allows us to evaluate
our approach independent of noise or inaccuracies in the TimeML layer, our approach could
just as easily be applied on top of automatically computed event and time annotations, using
existing automatic temporal annotation systems, e.g., TARSQI [21], CAEVO [22], ClearTK [23],
or others [24, 25].

2.3. Timeline Extraction
Timelines are almost never explicit in the texts and rarely can be extracted directly. There
have been prior ML-based approaches to extracting timelines from TimeML annotations [26,
27, 28, 29]. However, these approaches have limitations, in particular, they do not handle all
possible temporal relations, they ignore subordinated events, and they have less-than-perfect
performance even on perfect TimeML annotations. Unlike prior approaches, Finlayson et al.
[4] presented a Temporal Constraint Satisfaction Problem (TCSP)-based solution called TLEX
(TimeLine EXtraction). TLEX converts TimeML graphs into temporally connected subgraphs,
which are then converted to TCSPs, which can be solved using standard CSP techniques. Such
solutions, by assigning integers to the start and end time points of temporal intervals in the
TimeML graph, reveal a global order of events and times (i.e., timeline). Importantly, TLEX
is deterministic, complete, and provably correct, resulting in 100% performance modulo the
original TimeML graph. Also, later work presented a reference implementation of TLEX in the
form of a Java library jTLEX [7], which we make use of in our implementation.


3. Methodology
Our system implements six steps. First, we preprocess a large dataset of free text by cleaning it
and splitting it into sentences. Second, we use an off-the-shelf temporal parser [SynTime; 30] to
detect temporal expressions (TIMEXes) in each sentence in the corpus. Third, for every sentence
containing at least one TIMEX we identify the verbs and identify their lemmas. Fourth, we apply
a set of temporal patterns to extract the duration expressed for the verb. When applied to the
whole corpus, this data allows us to extract the mode of the duration of each verb lemma. Fifth,
we used jTLEX to extract timelines of the ProppLearner and TimeBank texts and to identify
“real-world” timelines and their events. Finally, we assign durations to the verbal events in the
ProppLearner and TimeBank texts, which are placed in the timeline (the mode as extracted in
step 4), allowing us to estimate the duration of the entire narrative.

3.1. Data Preprocessing
We mined temporal information from a large corpus of web documents, made up of two
different collections. The first collection contains 8,059 news and editorial articles sourced from


                                                44
LexusNexus [5], which we obtained from the authors. This collection ranges widely across
topics. The second collection was a selection from the iWeb corpus, which is a web-based corpus
and contains nearly 14 billion words from 22 million web pages [6]. Because iWeb is quite
large, and we were only trying to develop a proof of concept, we used only the first 14 million
sentences of the corpus. We chose these corpora not because they were particularly special
or especially suitable for the task but rather because they were already available to us; other
large collections of texts would no doubt have served equally well, and larger collections would
have also potentially improved the overall accuracy of the duration estimations. We removed
irrelevant content such as headlines, usernames, XML tags, phone numbers, and website names.
For each text, we split the sentences using Spacy [31] and removed the duplicate sentences.
This process resulted in a total of over 13 million unique sentences.

3.2. Temporal Expression Recognition
We next extracted sentences that contain possible event durations. We extracted temporal
expressions using SynTime, which is a rule-based temporal expression detector [30]. It identifies
the time tokens from raw text, then looks for modifiers and numerals next to time tokens to
form time segments, and finally merges the time segments into time expressions. SynTime
also identifies specific durations and normalizes them; for example, the temporal expression
six minutes would result in a normalized duration of PT6M, which is a time duration expressed
in ISO-8601. After the temporal expression recognition, we removed sentences without any
temporal expressions, leaving 510,379 sentences.

3.3. Verb Detection
For each sentence with a temporal duration, we then identified verbs using the POS tagger
found in the Stanford CoreNLP library [32]. We also identified lemmas using CoreNLP.

3.4. Temporal Pattern Mining
Although every remaining sentence contains a time expression, not every time expression is
relevant to the verb duration in the sentence. For instance, in the sentence He called me after 5
pm., the temporal expression “5 pm” is present but does not reveal anything about how long
the call lasted. Indeed, there are many ways in which a TIMEX can be related to a verb. We
defined 10 temporal patterns to express what we found were the most common relationships
expressed between temporal durations and verbs, shown in Table 1. Note that the first five
patterns provide the exact duration of an event, while patterns 6–9 provide an upper bound and
pattern 10 a lower bound. Based on the number of durations extracted from the text, we estimate
these patterns account for approximately 90–95% of the verb-duration relevant statements. A
summary of the temporal patterns found is shown in Table 2.
   We estimate the duration of any particular verbal event as follows. First, we extracted the
duration of events from the candidates which match patterns 1–5. For 1, 4, and 5 the event
duration will be the normalized TIMEX value. For 2 and 3, the event duration will be TIMEX2
value minus TIMEX1 value. For example, for “He played soccer from 1 pm to 2.30 pm.” the
duration of playing soccer will be 2.30 - 1 = 1.5 hours (PT1.5H). When the boundaries of an event


                                               45
 1. VERB + for + TIMEX                                             6. VERB + (on|in|at) + TIMEX
 He walked for 45 minutes.                                         I’m flying on Wednesday.
 2. VERB + between + TIMEX + and + TIMEX                           7. VERB + daily
 The power went out between 4 pm and 6 pm.                         He works out daily.
 3. VERB + from? + TIMEX1 + “to” + TIMEX2                          8. VERB + (each|every) + TIMEX
 He played soccer from 1 pm to 2.30 pm.                            He ate donuts every morning in Paris.
 4. VERB + (lasts|lasted) + TIMEX                                  9. VERB + (per|once|twice|NUM times) + TIMEX
 The drive lasted 40 minutes.                                      Muslims pray five times a day.
 5. VERB + PP? + (took|takes) + TIMEX                              10. VERB + since + TIMEX
 Flying to NYC from Miami takes 3 hours.                           She’s been doing the puzzle since 2 pm.

Table 1
List of Temporal Patterns

Pattern # →    1         2          3          4      5        6          7         8          9           10
                                                               on/                  each/      per/
Corpus         For       Between    .. to ..   Last   Take     in/at      daily     every      times       since    Total
LexisNexus     3,126     239        825        28     243      18,523     351       745        426         962      25,468
iWeb           72,943    4,229      8,228      860    5,439    268,247    10,038    17,739     13,923      18,121   419,767
Combined       76,069    4,468      9,053      888    5,682    286,770    10,389    18,484     14,349      19,083   445,235

Table 2
Results of temporal pattern mining for categories 1–10. The combined row is the statistics for the
duration dataset that I created.


have different units (e.g., days versus hours), we first convert both boundaries into seconds,
apply subtraction to ascertain the duration, and then convert the result back into the highest
relevant time unit for ease of interpretation and consistency.
   If the sentence does not match patterns 1–5, we checked for patterns 6–10 to extract the lower
and upper-bound durations. For categories 6 and 8, the upper-bound duration is the normalized
TIMEX value. For category 7, the upper-bound duration is always PT24H because the TIMEX is
daily. Category 9 provides the upper-bound duration as the normalized TIMEX value divided
by the repeating number. For example, once a day: PT24H / 1 = PT24H, twice a day: PT24H /
2 = PT12H, three times a day: PT24H / 3 = PT8H, and so on. On the other hand, category 10
provides the lower-bound duration, which is obtained by subtracting the normalized TIMEX
value from DCT. For instance, in the sentence “The kid has been watching the cartoon since 8
AM (DCT = 9 AM)”. Here, the lower-bound duration is 9 AM–8 AM = PT1H.
   Once all the sentences have been processed, and each extracted duration is associated with
its verb lemma, we extract the most frequent duration (the mode), which is then passed on to
the next stage as the duration of the event. To illustrate this, let’s consider the event walking.
Our duration dataset contains 54 event durations from 37 sentences that fall under categories
1 to 5. Figure 1 shows a bar graph representation of these 54 event durations. Based on our
duration reasoning system, we can conclude that walking takes between 3 minutes and 3 days,
with the most likely duration being 30 minutes, which is the most frequent value. Importantly,
we used the mode rather than the average because many verbs had very long upper tails (e.g.,
the verb go had a maximum duration of 40 years), which skewed the judgements.


                                                              46
                              8
                              7
                              6


              # of instance
                              5
                              4
                              3
                              2
                              1
                              0
                                  3M   5M   6M   10M 15M 20M 25M 30M 40M 45M 50M 60M 75M 90M   2H   3H   10H   1D   1.5D   2D   3D
                                                                           Duration


Figure 1: The bar graph of exact duration candidates for the event walking. The X-axis is the exact
duration (M, minutes; H, hours; D, days), and the Y-axis is the number of instances of the duration.


3.5. Applying TLEX
We used jTLEX to generate timelines for ProppLearner texts1 and randomly selected 15 TimeBank
texts. The TLEX method differentiates events that happen on the “real-world” timeline and
events that occur on possible, conditional, or modal world timelines (also called subordinated
timelines). For instance, in the sentence “I went to the supermarket but forgot to buy milk.” the
going to the supermarket is represented as occurring in the timeline of the narrative, but the
buying of the milk was forgotten, and so appears on a subordinated and hypothetical (negated)
timeline. To compute the duration of the narrative, we use only the main, real-world timeline.
From the timeline, we can then extract a sequence of events that “covers” the full timeline (i.e.,
time spans are not duplicated in different events). Combining event durations with the main
timeline enables us to estimate the duration of a narrative. For each event (indicated by a verb
lemma) in a text, we used the mode of duration as described above.


4. Evaluation
We recruited judges, who are undergraduate and graduate NLP researchers, to rate the reason-
ableness of the system’s output 2 . We assigned nine judges sections of stories paired with the
overall duration of the section as predicted by the system and asked to judge whether or not
the system’s estimation was reasonable. We defined “reasonable” as being within roughly 10%
of what the judge would consider the “true” duration. When a judge marked an estimate as
unreasonable, we asked them to explain why. We assigned sections of stories rather than whole
stories because many of the folktales contained highly ambiguous starting or ending sections,
such as “Once upon a time a man and a woman lived together in the woods.” or “They lived
happily ever after.” Judges varied dramatically in their judgements of these types of events, and
so we truncated stories to start and end with the first and last “short” event (as judged by us).
   For 3 out of 15 sections from the ProppLearner stories, all nine judges marked the system’s
output reasonable. For 11 texts, the judges partially agreed, and for one text, all nine judges
1
    As described elsewhere [33], we used versions of ProppLearner in which inconsistent annotations were corrected.
2
    While we specifically chose judges with a background in narrative research, it’s crucial to acknowledge that there’s
    no method to select judges in a manner that eliminates subjectivity in evaluating event durations, given their
    inherently subjective nature.


                                                                           47
judged the estimation unreasonable. The overall agreement score for our system’s narrative
duration estimation on ProppLearner stories was 0.696. Notably, the inter-annotator agreement
for event duration in the fine-grained task within the TimeBank Duration corpus was 0.44
[10]. While a direct comparison between our narrative-level duration agreement and their
event-level duration agreement is not feasible, it is evident that our system’s performance was
commendable.
   On the other hand, for the TimeBank corpus, none of the stories received unanimous agree-
ment from all nine judges in favor of the system’s estimation. Surprisingly, seven out of the 15
stories had two or fewer judges marking the estimation as unreasonable. Consequently, the
overall agreement score for our system’s estimations on TimeBank stories was notably lower at
0.281. The main reason for that is that 7 out of 15 stories had duration inconsistency, which is a
novel concept that we describe in more detail in Section 5.


5. Discussion & Future Work
We have introduced here a novel, yet simple, methodology for estimating the temporal duration
of narratives. Consequently, we opted for an experimental evaluation approach, yielding
agreement scores of 0.696 and 0.281. We investigated the lower agreement scores and identified
five primary challenges and avenues for future research.
   Notably, we initially attempted to use Large Language Models (LLMs), such as ChatGPT [34]
or BARD [35], to evaluate our approach. We prompted the LLMs with identical story segments
and solicited estimations for the duration of events. We observed that the LLMs struggled to
predict temporal durations and often refused to produce an answer, many times requesting
additional context. Therefore human evaluation was a more reliable and explainable method.
   First, in our error analysis, we uncovered instances of what we term duration inconsistency
in the temporal annotation. In such cases, the temporal annotation suggests that event 𝐴
includes event 𝐵, yet the duration of event 𝐴 is shorter than that of event 𝐵. Notably, this
phenomenon was observed in 7 out of the 15 TimeBank stories. Interestingly, certain events
lasting mere seconds sometimes included multiple events spanning days, weeks, months, or
even years. For instance, in one story, the temporal annotation indicates that the event said, The
Pentagon said today it will re-examine the question of whether the remains inside the Tomb of the
Unknown from the Vietnam War, includes the event re-examine. However, our dataset reflects
that the action of saying can be accomplished within seconds, while the action of re-examining
may extend over weeks or months.
   One of the most persistent difficulties we encountered is the highly ambiguous nature of
duration judgements. In prior work by others, even where annotators were asked to judge the
specific durations of events in context, the agreement was quite low, with agreement dropped
as additional annotators were added. In our own evaluation, we noted similar problems. In
an earlier version of our evaluation, we asked our judges to give us the duration of whole
stories rather than relatively constrained sections. This resulted in hugely variable duration
judgements, from as little as 1 day to as much as (the oddly specific) 751 years 3 weeks 3 hours
and 15 minutes. The source of these ambiguities is numerous and can include different senses
of the same word, differing patients and agents of the events, cultural or other commonsense


                                               48
context, or the level of detail provided.
   Another place with room for improvement is how to pick the specific duration to use
from the duration distribution. Here, we used the mode of the distribution, but as can be seen
in Figure 1, the duration distributions are not necessarily normal or described by a common
distribution function, and so more sophisticated means of selecting the appropriate duration in a
specific context would be useful. In addition, it’s important to note that the duration of activities,
such as walking, can be highly subjective. For instance, if one were to consider the duration
of walking as indicated by the mode of 30 minutes in folktales, it becomes evident that such
narratives often depict characters embarking on journeys spanning days or even hours. These
divergent values underscore the domain-dependency of duration metrics, indicating that data
from one domain may not necessarily generalize well to another. In the case of walking, it might
be more prudent to consider multiple values associated with this activity, each representing a
different granularity (e.g., minutes, hours, days), and then select the appropriate duration based
on the contextual requirements.
   Event duration is heavily influenced by the event arguments (i.e., subject and object). While
walking takes minutes, walking with God can take years. Another example is that sometimes a
narrative can contain information about an action being repeated multiple times, or a single
action being applied to multiple items as so implying a longer duration. For instance, in the
following example, the seizing of 12 cows potentially takes longer than the seizing of a single
cow (perhaps not 12 times as long though, depending on how the dragon does it). This was
one of the reasons why two annotators disagreed with the system’s duration estimation in this
instance.

(2)    The dragon flew into a rage and instead of six, seized twelve cows …

  Finally, a clear deficiency of the current work is that we focus only on durations of events
expressed with verbs. Future work should expand the system to include non-verb events as
well, which presumably will involve additional patterns for duration mining.


6. Contributions
Our contributions in this paper are three-fold. First, we presented a proof of concept method
for narrative duration extraction. Second, we presented a large event duration dataset that
consists of 445,235 durations for over 8,000 individual verbal event lemmas. Third, we note a
number of interesting problems, especially with regard to challenges of evaluation, and note a
number of directions for future improvements of the approach.


References
 [1] G. Genette, A. Sheridan, M.-R. Logan, Figures of literary discourse, Columbia University
     Press New York, 1982.
 [2] G. Genette, Narrative discourse: An essay in method, volume 3, Cornell University Press,
     1983.


                                                 49
 [3] M. Fludernik, Time in narrative, in: D. Herman, M. Jahn, M.-L. Ryan (Eds.), Routledge
     Encyclopedia of Narrative Theory, Routledge, New York, NY, 2005, pp. 608–612.
 [4] M. A. Finlayson, A. Cremisini, M. Ocal, Extracting and aligning timelines, Computational
     Analysis of Storylines: Making Sense of Events (2021) 87.
 [5] W. V. H. Yarlott, A. Ochoa, A. Acharya, L. Bobrow, D. C. Estrada, D. Gomez, J. Zheng,
     D. McDonald, C. Miller, M. A. Finlayson, Finding trolls under bridges: Preliminary work
     on a motif detector, arXiv preprint arXiv:2204.06085 (2022).
 [6] M. Davies, J.-B. Kim, The advantages and challenges of” big data”: Insights from the 14
     billion word iweb corpus, Linguistic Research 36 (2019) 1–34.
 [7] M. Ocal, A. Singh, J. Hummer, A. Radas, M. A. Finlayson, jTLEX: A Java Library for
     TimeLine EXtraction, in: Proceedings of the 17th Conference of the European Chapter of
     the Association for Computational Linguistics, Dubrovnik, Croatia, 2023, p. 27–34.
 [8] M. A. Finlayson, Propplearner: Deeply annotating a corpus of russian folktales to enable
     the machine learning of a russian formalist theory, Digital Scholarship in the Humanities
     32 (2017) 284–300. URL: +http://dx.doi.org/10.1093/llc/fqv067. doi:10.1093/llc/fqv067.
 [9] J. Pustejovsky, P. Hanks, R. Saurí, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim,
     D. Day, L. Ferro, M. Lazo, The timebank corpus, Proceedings of Corpus Linguistics (2003).
[10] F. Pan, R. Mulkar-Mehta, J. R. Hobbs, Learning event durations from event descriptions,
     in: Proceedings of the 21st International Conference on Computational Linguistics and
     44th Annual Meeting of the Association for Computational Linguistics, 2006, pp. 393–400.
[11] A. Gusev, N. Chambers, D. R. Khilnani, P. Khaitan, S. Bethard, D. Jurafsky, Using query
     patterns to learn the duration of events, in: Proceedings of the Ninth International
     Conference on Computational Semantics (IWCS 2011), 2011, pp. 145–154.
[12] A. Vempala, E. Blanco, A. Palmer, Determining event durations: Models and error analysis,
     in: Proceedings of the 2018 Conference of the North American Chapter of the Association
     for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers),
     2018, pp. 164–168.
[13] B. Zhou, Q. Ning, D. Khashabi, D. Roth, Temporal common sense acquisition with minimal
     supervision, arXiv preprint arXiv:2005.04304 (2020).
[14] Z. Yang, X. Du, A. Rush, C. Cardie, Improving event duration prediction via time-aware
     pre-training, arXiv preprint arXiv:2011.02610 (2020).
[15] B. Zhou, D. Khashabi, Q. Ning, D. Roth, ” going on a vacation” takes longer than” going for a
     walk”: A study of temporal commonsense understanding, arXiv preprint arXiv:1909.03065
     (2019).
[16] F. Pan, R. Mulkar-Mehta, J. R. Hobbs, Annotating and learning event durations in text,
     Computational Linguistics 37 (2011) 727–752.
[17] J. Pustejovsky, J. Castaño, R. Ingria, R. Saurí, R. Gaizauskas, A. Setzer, G. Katz, TimeML:
     robust specification of event and temporal expressions in text, in: Fifth International
     Workshop on Computational Semantics (IWCS-5), 2003, pp. 1–11. URL: http://citeseerx.ist.
     psu.edu/viewdoc/download?doi=10.1.1.161.8972&rep=rep1&type=pdf.
[18] B. Boguraev, R. K. Ando, Timeml-compliant text analysis for temporal reasoning., in:
     IJCAI, volume 5, 2005, pp. 997–1003.
[19] E. Saquete, P. Martínez-Barco, R. Muñoz, J. L. Vicedo, Splitting complex temporal ques-
     tions for question answering systems, in: Proceedings of the 42Nd Annual Meeting


                                               50
     on Association for Computational Linguistics, ACL ’04, Association for Computational
     Linguistics, Stroudsburg, PA, USA, 2004. URL: https://doi.org/10.3115/1218955.1219027.
     doi:10.3115/1218955.1219027.
[20] S. Liu, M. X. Zhou, S. Pan, Y. Song, W. Qian, W. Cai, X. Lian, Tiara: Interactive, topic-based vi-
     sual text summarization and analysis, ACM Trans. Intell. Syst. Technol. 3 (2012) 25:1–25:28.
     URL: http://doi.acm.org/10.1145/2089094.2089101. doi:10.1145/2089094.2089101.
[21] M. Verhagen, I. Mani, R. Saurí, J. Littman, R. Knippen, S. B. Jang, A. Rumshisky, J. Phillips,
     J. Pustejovsky, Automating Temporal Annotation with TARSQI, in: ACL, 2005, pp. 81–84.
     URL: http://aclweb.org/anthology/P/P05/P05-3021.pdf.
[22] N. Chambers, T. Cassidy, B. McDowell, S. Bethard, Dense event ordering with a multi-
     pass architecture, Transactions of the Association for Computational Linguistics 2 (2014)
     273–284. URL: https://doi.org/10.1162/tacl_a_00182. doi:10.1162/tacl\_a\_00182.
[23] S. Bethard, ClearTK-TimeML: A minimalist approach to tempeval 2013, in: Second joint
     conference on lexical and computational semantics (*SEM), volume 2: proceedings of the
     seventh international workshop on semantic evaluation (SemEval 2013), 2013, pp. 10–14.
[24] P. Mirza, S. Tonelli, Catena: Causal and temporal relation extraction from natural language
     texts, in: The 26th international conference on computational linguistics, ACL, 2016, pp.
     64–75.
[25] M. Ocal, A. Perez, A. Radas, M. Finlayson, Holistic evaluation of automatic timeml annota-
     tors, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference,
     2022, pp. 1444–1453.
[26] I. Mani, M. Verhagen, B. Wellner, C. M. Lee, J. Pustejovsky, Machine learning of temporal
     relations, in: Proceedings of the 21st International Conference on Computational Lin-
     guistics and the 44th Annual Meeting of the Association for Computational Linguistics
     (ICCL-ACL’06), 2006, pp. 753–760. Sydney, Australia.
[27] Q. X. Do, W. Lu, D. Roth, Joint inference for event timeline construction, in: Proceedings
     of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and
     Computational Natural Language Learning (EMNLP-CoNLL’12), 2012, pp. 677–687.
[28] O. Kolomiyets, S. Bethard, M.-F. Moens, Extracting narrative timelines as temporal de-
     pendency structures, in: Proceedings of the 50th Annual Meeting of the Association for
     Computational Linguistics (ACL’12), 2012, pp. 88–97.
[29] A. Leeuwenberg, M. F. Moens, Towards extracting absolute event timelines from english
     clinical reports, IEEE/ACM Transactions on Audio, Speech, and Language Processing 28
     (2020) 2710–2719. doi:10.1109/TASLP.2020.3027201.
[30] X. Zhong, A. Sun, E. Cambria, Time expression analysis and recognition using syntactic
     token types and general heuristic rules, in: Proceedings of the 55th Annual Meeting of the
     Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 420–429.
[31] Y. Vasiliev, Natural Language Processing with Python and SpaCy: A Practical Introduction,
     No Starch Press, 2020.
[32] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, D. McClosky, The stanford
     corenlp natural language processing toolkit, in: Proceedings of 52nd annual meeting of
     the association for computational linguistics: system demonstrations, 2014, pp. 55–60.
[33] M. Ocal, M. Finlayson, Evaluating information loss in temporal dependency trees, in: Pro-
     ceedings of The 12th Language Resources and Evaluation Conference, 2020, pp. 2148–2156.


                                                 51
[34] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
[35] S. Narang, A. Chowdhery, Pathways language model (palm): Scaling to 540 billion
     parameters for breakthrough performance, Google AI Blog (2022).


                                              52

</pre>