Recognition and Normalization of Temporal Expressions in Serbian Texts Jelena Jaćimović University of Belgrade Faculty of Dental Medicine Dr. Subotića 8, 11000 Belgrade, Serbia +381646144435 jjacimovic@rcub.bg.ac.rs ABSTRACT 13:45 časova ‘13:45 hours’, 1:45 popodne ‘1:45 in the afternoon’, This paper presents a system for recognition and normalization of 15 do dva popodne ‘quarter to two in the afternoon’, and many temporal expressions (TEs) in Serbian texts according to the others. Furthermore, lexical variants, such as sat and čas for a TimeML specification language. Based on a finite-state temporal unit ‘hour’, are also widespread. Along with some other transducers methodology, local grammars are designed to Slavic languages, Serbian is a highly inflected and a free word recognize calendar dates, times of day, periods of time and order language with a complex number system in which, beside durations, to determine the extension of detected expressions, as singular and plural, paucal also exists. Since numerals agree in well as to normalize their values, interpreted in ISO format. The gender and number with the nouns they modify, temporal results of a preliminary evaluation demonstrate usefulness of this expressions jedan sat ‘one hour’, dva sata ‘two hours’ and pet sati method in both the recognition and the normalization phases. ‘five hours’ use three different inflected forms of a noun sat ‘hour’ – nominative singular, paucal and genitive plural, respectively. Categories and Subject Descriptors A number of resources have been developed for the processing of I.2.7 [Artificial Intelligence]: Natural Language Processing – temporal information in English texts, as well as in other Text analysis; F.1.1 [Computation by Abstract Devices]: languages, such as French, Italian, Spanish, German, Chinese, etc. Models of Computation – Automata Previous efforts regarding recognition of TEs in Serbian have achieved quite promising results [2]. This article details the General Terms ongoing development of a system for TEs recognition and Human factors, Performance normalization in Serbian texts. Keywords 2. TASK DEFINITION Temporal expression recognition, temporal expression Any natural language possesses several mechanisms for normalization, ISO-8601, finite-state transducers expressing temporal information, which can be grouped into three large categories: TEs, events and the temporal relations that hold among times and events [3]. As natural language phrases, TEs 1. INTRODUCTION give information about when something happened, how long Processing of temporal expressions (TEs) has received increasing something lasted, or how often something occurred. Present task attention in Natural Language Processing research community is limited to recognition of TEs denoting calendar dates (1), times over the past fifteen years. The Message Understanding of day (2), periods of time (3) and durations (4) in newspaper Conferences (MUCs) in 1996 and 1998 have played a significant texts and normalization of their values, interpreted in ISO format: role, but their evaluations covered only recognition of TEs, while (1) 8. decembra dve hiljade jedanaeste ‘8th December two a novel contribution towards the normalization of TEs was made thousand eleventh’ → “2011-12-08” in 2000 [1]. The first exercise evaluating system performance that 7. VIII 2008. godine → “2008-08-07” deals both with recognition and normalization of TEs was the leta 1995 ‘summer of 1995’→ “1995-SU” Time Expression Recognition and Normalization (TERN) 2004 (2) pet minuta do ponoći ‘five minutes to midnight’ → “T23:55” competition. With the rapid increase of electronic information and 12:55 časova ‘12:55 hours’ → “T12:55” very frequent TEs occurrences, precise temporal representation of 17. februara, u 7 časova i 55 minuta uveče ‘17th February, at 7 a text has become extremely important for many applications, hours and 55 minutes in the evening’ → “XXXX-02-17T19:55” such as machine translation, information extraction, question subota u dva sata popodne ‘Saturday at two hours in the answering, etc. afternoon’ → “XXXX-WXX-6T14:00” Due to various existing interpretations of time within free text, (3) od podneva do pet časova popodne ‘from noon to five hours in recognition and normalization of TEs in narrative text represent a the afternoon’ → “T12/17” particular challenge that is far from being a simple task for između devet i 12 meseci ‘between nine and 12 months’ → automatic text processing systems. For example, in Serbian the “P9/12M” same temporal information can be written in different forms: (4) tri nedelje ‘three weeks’ → “P3W” BCI’12, September 16–20, 2012, Novi Sad, Serbia. This work also covers modified or quantified TEs (5), such as: Copyright © 2012 by the paper’s authors. Copying permitted only for private and (5) početkom godine 2009. ‘early in the year 2009’ → START academic purposes. This volume is published and copyrighted by its editors. 2009 Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences, University of Novi Sad. manje od dva meseca ‘less than two months’ → type=”DATE” value=”2009-11” mod=”END” and . Those recognized units becomes variables $a$, petak oko 9 časova ‘Friday about 9 o’clock’ → type=”TIME” $b$ and $c$, respectively, and will be used in the output as values value=”XXXX-WXX-5T09” mod=”APPROX” of a tag attribute value. više od dve godine ‘more than two years’ → The recognition and normalization of multi-word numerals that type=”DURATION” value=”P2Y” mod=”MORE_THAN” often appear in duration expressions was done by dictionary FSTs [5], in order to correctly tag numerals composed using both digit skoro deset dana ‘nearly ten days’ → type=”DURATION” and alphabetic representation (e.g. 5 milijardi i 70 miliona ‘5 value=”P10D” mod=”LESS_THAN”. billion and 70 millions’). Their lemmas could be retrieved from The XML output of local grammars designed for recognition and those applied dictionaries and used in output as normalization of TEs is presented in the following tagged text: values of a tag attribute value. Nakon 23 Therefore, rules applied in this system were grouped into possible dana bezuspešnog patroliranja, U-24 se 16. decembra combination of an expression rule, a normalization function and vraća u Konstancu, gde će ostati do 18. januara 1943. godine, kada polazi u novo patroliranje. ‘After 23 days of fruitless patrolling, 3.3 Description of the Annotation Scheme on December 16th U-24 returns to Constance, where it will stay Each detected TE was marked up with the tag, which until January 18th 1943, when a new patrol starts.’ may contain the following attributes: type, value and mod. At this moment, other optional attributes described in the TimeML annotation guidelines [7] remain beyond the scope of the current 4. EVALUATION RESULTS The previously described system for normalization of TEs has version of the system. The convention of indicating tag names and been evaluated on a set of 6 articles from Serbian Wikipedia: attribute values in all upper case (e.g. DATE, APPROX) and German submarine U-24, German submarine U-28, German attribute names in lower or mixed case (e.g. type, mod) was submarine U-13, German submarine U-29, German submarine U- respected. 19 and German submarine U-558. These chosen texts were not As the non-optional attribute, type may have the following used in the development of FSTs and represent completely unseen values: DATE (calendar time), TIME (time of the day or a material containing many occurrences of TEs (Table 1). combination of calendar date and time of the day) or DURATION (explicit durations). Table 1: Articles Used for Evaluation The attribute value contains the normalized form of the detected Text Words Date Time Duration date, time or duration, derived from the ISO 8601 standard format U-24 1,115 60 12 1 for representing time values [8]. Points of time were expressed as a string patterns YYYY-MM-DDThh:mm:ss (year-month- U-28 1,574 51 19 1 dayThour:minute:seconds) and may be truncated from the right U-13 1,010 36 10 1 (e.g. March 2002 was interpreted as 2002-03). For the unknown or vague parts of the value a placeholder character X was used U-29 1,846 51 19 3 (see the second example in (2)). For the representation of the normalized values, week-based format was also used – YYYY- U-19 1,303 58 16 4 Www-D (year-Wweek_number –day_of_the_calendar_week; see U-558 2,639 30 34 13 the last line in (2)). In order to separate components in the representation of time intervals, solidus [/] was used (see example Total 9,487 286 110 23 (3)). Durations were expressed as a string patterns Pn, where P is used as a duration designator and n indicates one or more digits The FSTs performance has been evaluated with respect to (see example (4)). recognition, bracketing and normalization tasks. For that reason, a Combination of calendar dates and times-of-day were also new attribute provera ‘check’ has been added to each XML tag. represented with values in the ISO format. In case the text Possible values of this attribute were the following: OK/OK (TE includes some reference to the specific date (7), value attribute was correctly recognized, full extent was correctly determined, must also contain the date like the following: correctly assigned normalized value), OK/NOK (TE was correctly recognized, full extent was correctly determined, but normalized (7) u 9:30, 3. januara 2007. ‘at 9:30, 3rd January 2007’ → value was not correct), UOK (TE was partially recognized type=”TIME” val=”2007-01-03T09:30”. correctly, full extent was not correctly determined (e.g. longer The optional mod attribute was used together with other attributes patterns denoting temporal ranges that were not included in FST - in order to capture temporal modifiers that change or clarify the 7. april 1944. - jul 1944. ‘7th April 1944. – July at this moment are illustrated in (8). 1944.)), UOK/E (TE was partially recognized correctly, full extent was not correctly determined, because of the incorrect (8) početkom 2007. ‘early 2007’ → type=”DATE” value=”2007” input), NOK (TE was partially recognized correctly, full extent mod=”START” was not correctly determined for some other reasons), MISS/E polovinom februara ‘mid-February’ → type=”DATE” (TE was not recognized, because of the incorrect input) and MISS value=”XXXX-02” mod=”MID” (TE was not recognized for some other reasons). 99 The overall evaluation of the system is presented in Table 2 and as well as events and temporal relations that hold between Table 3. temporal entities. In order to improve the performance of this system, it would be very useful to apply transducers on the text in Table 2: Evaluation Data a precise order, as a cascade. This simple and effective way of Check DATE TIME DURATION Total organizing FSTs may greatly increase precision and speed of the system, as well as ability to manage priority between patterns. OK/OK 260 100 16 376 OK/NOK 0 0 0 0 6. REFERENCES UOK 11 5 3 19 [1] Mani, I. and Wilson, G. Robust Temporal Processing of News. In Proceedings of the 38th Annual Meeting on Association for UOK/E 7 1 0 8 Computational Linguistics (Hong Kong, 2000), 69-76. NOK 0 1 0 1 [2] Krstev, C., Vitas, D., Obradović, I. and Utvić, M. E- Dictionaries and Finite-State Automata for the Recognition of MISS/E 1 1 0 2 Named Entities. In Proceedings of the 9th International MISS 7 2 4 13 Workshop on Finite State Methods and Natural Language Processing (Blois, France, July 2011). Association for Total 286 110 23 419 Computational Linguistics, 48-56. [3] Marsic, G. Temporal processing of news: annotation of An error analysis shows that the main source of errors and missed temporal expressions, verbal events and temporal relations. TEs (lines UOK and MISS in Table 2) was the occurrence of Doctoral Thesis. University of Wolverhampton, 2011. combined temporal expressions that were not included among the [4] Paumier, S. Unitex 2.1 User manual. 2011. FSTs rules. There were no false recognitions (line NOK), except [5] Krstev, C. Processing of Serbian - Automata, Texts and for one case regarding time expressions. To all correctly Electronic Dictionaries. Faculty of Philology, University of recognized expressions were added correctly assigned normalized Belgrade, Belgrade, 2008. values (line OK/OK), which indicates that this method could be useful for both the recognition and the normalization phases. [6] ISO/DIS 24617-1 Language Resources Management - FSTs performance showed precision and recall rate of 0.946 Semantic Annotation Framework (SemAF) - Part 1: Time and (Table 3). Although duration expressions achieved the lowest F- Events (SemAF-Time, ISO-TimeML). International Organization measure, priority is given to precision over recall. for Standardization, Geneva, Switzerland, 2009. Table 3: Performance Measures for Recognition of TEs [7] Pustejovsky, J., Bunt, H., Lee, K. and Romary, L. ISO- TimeML: an International Standard for Semantic Annotation. In TEs Precision Recall F-measure Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010 (Paris, France, 2010, DATE 0.935 0.970 0.952 2010). ELRA, 394-397. TIME 0.935 0.971 0.952 [8] ISO 8601 Data Elements and Interchange formats - DURATION 0.842 0.800 0.821 Information interchange - Representation of Dates and Times. International Organization for Standardization, Geneva, Total 0.931 0.962 0.946 Switzerland, 2004. [9] Kolomiyets, O. and Moens, M.-F. KUL: Recognition and Normalization of Temporal Expressions. In Proceedings of the 5. CONCLUSIONS AND FUTURE WORK 5th International Workshop on Semantic Evaluation (Uppsala, This paper presented the system for recognition and normalization Sweden, 2010). Association for Computational Linguistics, 325- of TEs in Serbian natural language texts, based on a finite-state 328. transducers methodology. This approach is effective and competitive with respect to other techniques and makes it possible [10] Negri, M. and Marseglia, L. Recognition and Normalization to include further knowledge easily [9-11]. As a rule-based of Time Expressions: ITC-irst at TERN 2004. Tecnical Report temporal tagger, in the ACE TERN 2004, Chronos system [10] WP3.7. Information Society Technologies, 2005. achieved the highest F-measure of 0.926, with precision and recall [11] Friburger, N. and Maurel, D. Finite-state transducer cascades rates of 0.976 and 0.880, respectively. The system presented in to extract named entities in texts. Theoretical Computer Science, this work can be compared with Chronos, since it is also based on 313, 1 (Feb 2004), 93-104. a single module that performs both the recognition and normalization tasks. The evaluation of the presented system is conducted on a small set of articles, but the results are quite good and in correspondence with the results obtained in previous system evaluation of TEs recognition task [2]. Nevertheless, improvements are needed in order to increase precision, which may affect further processes of temporal analysis. Future research in temporal processing is also needed to complete the tagger, in particular for recognition of relative expressions and sets of time, 100