=Paper= {{Paper |id=None |storemode=property |title=Recognition and Normalization of Temporal Expressions in Serbian Texts |pdfUrl=https://ceur-ws.org/Vol-920/p97-jacimovic.pdf |volume=Vol-920 |dblpUrl=https://dblp.org/rec/conf/bci/Jacimovic12 }} ==Recognition and Normalization of Temporal Expressions in Serbian Texts== https://ceur-ws.org/Vol-920/p97-jacimovic.pdf

Recognition and Normalization of Temporal Expressions
in Serbian Texts
Jelena Jaćimović
University of Belgrade
Faculty of Dental Medicine
Dr. Subotića 8, 11000 Belgrade, Serbia
+381646144435
jjacimovic@rcub.bg.ac.rs

ABSTRACT 13:45 časova ‘13:45 hours’, 1:45 popodne ‘1:45 in the afternoon’,
This paper presents a system for recognition and normalization of 15 do dva popodne ‘quarter to two in the afternoon’, and many
temporal expressions (TEs) in Serbian texts according to the others. Furthermore, lexical variants, such as sat and čas for a
TimeML specification language. Based on a finite-state temporal unit ‘hour’, are also widespread. Along with some other
transducers methodology, local grammars are designed to Slavic languages, Serbian is a highly inflected and a free word
recognize calendar dates, times of day, periods of time and order language with a complex number system in which, beside
durations, to determine the extension of detected expressions, as singular and plural, paucal also exists. Since numerals agree in
well as to normalize their values, interpreted in ISO format. The gender and number with the nouns they modify, temporal
results of a preliminary evaluation demonstrate usefulness of this expressions jedan sat ‘one hour’, dva sata ‘two hours’ and pet sati
method in both the recognition and the normalization phases. ‘five hours’ use three different inflected forms of a noun sat
‘hour’ – nominative singular, paucal and genitive plural,
respectively.
Categories and Subject Descriptors
A number of resources have been developed for the processing of
I.2.7 [Artificial Intelligence]: Natural Language Processing – temporal information in English texts, as well as in other
Text analysis; F.1.1 [Computation by Abstract Devices]: languages, such as French, Italian, Spanish, German, Chinese, etc.
Models of Computation – Automata Previous efforts regarding recognition of TEs in Serbian have
achieved quite promising results [2]. This article details the
General Terms ongoing development of a system for TEs recognition and
Human factors, Performance normalization in Serbian texts.

Keywords 2. TASK DEFINITION
Temporal expression recognition, temporal expression Any natural language possesses several mechanisms for
normalization, ISO-8601, finite-state transducers expressing temporal information, which can be grouped into three
large categories: TEs, events and the temporal relations that hold
among times and events [3]. As natural language phrases, TEs
1. INTRODUCTION give information about when something happened, how long
Processing of temporal expressions (TEs) has received increasing something lasted, or how often something occurred. Present task
attention in Natural Language Processing research community is limited to recognition of TEs denoting calendar dates (1), times
over the past fifteen years. The Message Understanding of day (2), periods of time (3) and durations (4) in newspaper
Conferences (MUCs) in 1996 and 1998 have played a significant texts and normalization of their values, interpreted in ISO format:
role, but their evaluations covered only recognition of TEs, while
(1) 8. decembra dve hiljade jedanaeste ‘8th December two
a novel contribution towards the normalization of TEs was made
thousand eleventh’ → “2011-12-08”
in 2000 [1]. The first exercise evaluating system performance that
7. VIII 2008. godine → “2008-08-07”
deals both with recognition and normalization of TEs was the
leta 1995 ‘summer of 1995’→ “1995-SU”
Time Expression Recognition and Normalization (TERN) 2004
(2) pet minuta do ponoći ‘five minutes to midnight’ → “T23:55”
competition. With the rapid increase of electronic information and
12:55 časova ‘12:55 hours’ → “T12:55”
very frequent TEs occurrences, precise temporal representation of
17. februara, u 7 časova i 55 minuta uveče ‘17th February, at 7
a text has become extremely important for many applications,
hours and 55 minutes in the evening’ → “XXXX-02-17T19:55”
such as machine translation, information extraction, question
subota u dva sata popodne ‘Saturday at two hours in the
answering, etc.
afternoon’ → “XXXX-WXX-6T14:00”
Due to various existing interpretations of time within free text, (3) od podneva do pet časova popodne ‘from noon to five hours in
recognition and normalization of TEs in narrative text represent a the afternoon’ → “T12/17”
particular challenge that is far from being a simple task for između devet i 12 meseci ‘between nine and 12 months’ →
automatic text processing systems. For example, in Serbian the “P9/12M”
same temporal information can be written in different forms: (4) tri nedelje ‘three weeks’ → “P3W”
BCI’12, September 16–20, 2012, Novi Sad, Serbia. This work also covers modified or quantified TEs (5), such as:
Copyright © 2012 by the paper’s authors. Copying permitted only for private and (5) početkom godine 2009. ‘early in the year 2009’ → START
academic purposes. This volume is published and copyrighted by its editors. 2009
Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences,
University of Novi Sad.
manje od dva meseca ‘less than two months’ → type=”DATE” value=”2009-11” mod=”END”
and . Those recognized units becomes variables $a$, petak oko 9 časova ‘Friday about 9 o’clock’ → type=”TIME”
$b$ and $c$, respectively, and will be used in the output as values value=”XXXX-WXX-5T09” mod=”APPROX”
of a tag attribute value.
više od dve godine ‘more than two years’ →
The recognition and normalization of multi-word numerals that type=”DURATION” value=”P2Y” mod=”MORE_THAN”
often appear in duration expressions was done by dictionary FSTs
[5], in order to correctly tag numerals composed using both digit skoro deset dana ‘nearly ten days’ → type=”DURATION”
and alphabetic representation (e.g. 5 milijardi i 70 miliona ‘5 value=”P10D” mod=”LESS_THAN”.
billion and 70 millions’). Their lemmas could be retrieved from The XML output of local grammars designed for recognition and
those applied dictionaries and used in output as normalization of TEs is presented in the following tagged text:
values of a tag attribute value.
Nakon 23
Therefore, rules applied in this system were grouped into possible dana bezuspešnog patroliranja, U-24 se 16. decembra
combination of an expression rule, a normalization function and vraća u Konstancu, gde će ostati do 18. januara 1943. godine, kada
polazi u novo patroliranje. ‘After 23 days of fruitless patrolling,
3.3 Description of the Annotation Scheme on December 16th U-24 returns to Constance, where it will stay
Each detected TE was marked up with the tag, which until January 18th 1943, when a new patrol starts.’
may contain the following attributes: type, value and mod. At
this moment, other optional attributes described in the TimeML
annotation guidelines [7] remain beyond the scope of the current
4. EVALUATION RESULTS
The previously described system for normalization of TEs has
version of the system. The convention of indicating tag names and
been evaluated on a set of 6 articles from Serbian Wikipedia:
attribute values in all upper case (e.g. DATE, APPROX) and
German submarine U-24, German submarine U-28, German
attribute names in lower or mixed case (e.g. type, mod) was
submarine U-13, German submarine U-29, German submarine U-
respected.
19 and German submarine U-558. These chosen texts were not
As the non-optional attribute, type may have the following used in the development of FSTs and represent completely unseen
values: DATE (calendar time), TIME (time of the day or a material containing many occurrences of TEs (Table 1).
combination of calendar date and time of the day) or DURATION
(explicit durations). Table 1: Articles Used for Evaluation

The attribute value contains the normalized form of the detected Text Words Date Time Duration
date, time or duration, derived from the ISO 8601 standard format
U-24 1,115 60 12 1
for representing time values [8]. Points of time were expressed as
a string patterns YYYY-MM-DDThh:mm:ss (year-month- U-28 1,574 51 19 1
dayThour:minute:seconds) and may be truncated from the right
U-13 1,010 36 10 1
(e.g. March 2002 was interpreted as 2002-03). For the unknown
or vague parts of the value a placeholder character X was used U-29 1,846 51 19 3
(see the second example in (2)). For the representation of the
normalized values, week-based format was also used – YYYY- U-19 1,303 58 16 4
Www-D (year-Wweek_number –day_of_the_calendar_week; see U-558 2,639 30 34 13
the last line in (2)). In order to separate components in the
representation of time intervals, solidus [/] was used (see example Total 9,487 286 110 23
(3)). Durations were expressed as a string patterns Pn, where P is
used as a duration designator and n indicates one or more digits The FSTs performance has been evaluated with respect to
(see example (4)). recognition, bracketing and normalization tasks. For that reason, a
Combination of calendar dates and times-of-day were also new attribute provera ‘check’ has been added to each XML tag.
represented with values in the ISO format. In case the text Possible values of this attribute were the following: OK/OK (TE
includes some reference to the specific date (7), value attribute was correctly recognized, full extent was correctly determined,
must also contain the date like the following: correctly assigned normalized value), OK/NOK (TE was correctly
recognized, full extent was correctly determined, but normalized
(7) u 9:30, 3. januara 2007. ‘at 9:30, 3rd January 2007’ → value was not correct), UOK (TE was partially recognized
type=”TIME” val=”2007-01-03T09:30”. correctly, full extent was not correctly determined (e.g. longer
The optional mod attribute was used together with other attributes patterns denoting temporal ranges that were not included in FST -
in order to capture temporal modifiers that change or clarify the 7. april 1944. - jul 1944. ‘7th April 1944. – July
at this moment are illustrated in (8). 1944.)), UOK/E (TE was partially recognized correctly, full
extent was not correctly determined, because of the incorrect
(8) početkom 2007. ‘early 2007’ → type=”DATE” value=”2007”
input), NOK (TE was partially recognized correctly, full extent
mod=”START”
was not correctly determined for some other reasons), MISS/E
polovinom februara ‘mid-February’ → type=”DATE” (TE was not recognized, because of the incorrect input) and MISS
value=”XXXX-02” mod=”MID” (TE was not recognized for some other reasons).

99
The overall evaluation of the system is presented in Table 2 and as well as events and temporal relations that hold between
Table 3. temporal entities. In order to improve the performance of this
system, it would be very useful to apply transducers on the text in
Table 2: Evaluation Data a precise order, as a cascade. This simple and effective way of
Check DATE TIME DURATION Total organizing FSTs may greatly increase precision and speed of the
system, as well as ability to manage priority between patterns.
OK/OK 260 100 16 376
OK/NOK 0 0 0 0 6. REFERENCES
UOK 11 5 3 19 [1] Mani, I. and Wilson, G. Robust Temporal Processing of News.
In Proceedings of the 38th Annual Meeting on Association for
UOK/E 7 1 0 8 Computational Linguistics (Hong Kong, 2000), 69-76.
NOK 0 1 0 1 [2] Krstev, C., Vitas, D., Obradović, I. and Utvić, M. E-
Dictionaries and Finite-State Automata for the Recognition of
MISS/E 1 1 0 2 Named Entities. In Proceedings of the 9th International
MISS 7 2 4 13 Workshop on Finite State Methods and Natural Language
Processing (Blois, France, July 2011). Association for
Total 286 110 23 419 Computational Linguistics, 48-56.
[3] Marsic, G. Temporal processing of news: annotation of
An error analysis shows that the main source of errors and missed temporal expressions, verbal events and temporal relations.
TEs (lines UOK and MISS in Table 2) was the occurrence of Doctoral Thesis. University of Wolverhampton, 2011.
combined temporal expressions that were not included among the [4] Paumier, S. Unitex 2.1 User manual. 2011.
FSTs rules. There were no false recognitions (line NOK), except
[5] Krstev, C. Processing of Serbian - Automata, Texts and
for one case regarding time expressions. To all correctly
Electronic Dictionaries. Faculty of Philology, University of
recognized expressions were added correctly assigned normalized
Belgrade, Belgrade, 2008.
values (line OK/OK), which indicates that this method could be
useful for both the recognition and the normalization phases. [6] ISO/DIS 24617-1 Language Resources Management -
FSTs performance showed precision and recall rate of 0.946 Semantic Annotation Framework (SemAF) - Part 1: Time and
(Table 3). Although duration expressions achieved the lowest F- Events (SemAF-Time, ISO-TimeML). International Organization
measure, priority is given to precision over recall. for Standardization, Geneva, Switzerland, 2009.
Table 3: Performance Measures for Recognition of TEs [7] Pustejovsky, J., Bunt, H., Lee, K. and Romary, L. ISO-
TimeML: an International Standard for Semantic Annotation. In
TEs Precision Recall F-measure Proceedings of the 7th International Conference on Language
Resources and Evaluation, LREC 2010 (Paris, France, 2010,
DATE 0.935 0.970 0.952
2010). ELRA, 394-397.
TIME 0.935 0.971 0.952 [8] ISO 8601 Data Elements and Interchange formats -
DURATION 0.842 0.800 0.821 Information interchange - Representation of Dates and Times.
International Organization for Standardization, Geneva,
Total 0.931 0.962 0.946 Switzerland, 2004.
[9] Kolomiyets, O. and Moens, M.-F. KUL: Recognition and
Normalization of Temporal Expressions. In Proceedings of the
5. CONCLUSIONS AND FUTURE WORK 5th International Workshop on Semantic Evaluation (Uppsala,
This paper presented the system for recognition and normalization
Sweden, 2010). Association for Computational Linguistics, 325-
of TEs in Serbian natural language texts, based on a finite-state
328.
transducers methodology. This approach is effective and
competitive with respect to other techniques and makes it possible [10] Negri, M. and Marseglia, L. Recognition and Normalization
to include further knowledge easily [9-11]. As a rule-based of Time Expressions: ITC-irst at TERN 2004. Tecnical Report
temporal tagger, in the ACE TERN 2004, Chronos system [10] WP3.7. Information Society Technologies, 2005.
achieved the highest F-measure of 0.926, with precision and recall [11] Friburger, N. and Maurel, D. Finite-state transducer cascades
rates of 0.976 and 0.880, respectively. The system presented in to extract named entities in texts. Theoretical Computer Science,
this work can be compared with Chronos, since it is also based on 313, 1 (Feb 2004), 93-104.
a single module that performs both the recognition and
normalization tasks.
The evaluation of the presented system is conducted on a small
set of articles, but the results are quite good and in
correspondence with the results obtained in previous system
evaluation of TEs recognition task [2]. Nevertheless,
improvements are needed in order to increase precision, which
may affect further processes of temporal analysis. Future research
in temporal processing is also needed to complete the tagger, in
particular for recognition of relative expressions and sets of time,

100