=Paper=
{{Paper
|id=None
|storemode=property
|title=Information Extraction from the Weather Reports in Serbian
|pdfUrl=https://ceur-ws.org/Vol-920/p105-vujicic-stankovic.pdf
|volume=Vol-920
|dblpUrl=https://dblp.org/rec/conf/bci/StankovicP12
}}
==Information Extraction from the Weather Reports in Serbian==
Information Extraction from the Weather Reports in Serbian Staša Vujičić Stanković Vesna Pajić University of Belgrade University of Belgrade Faculty of Mathematics Faculty of Agriculture Studentski trg 16, Belgrade Nemanjina 6, Zemun Serbia Serbia +381 11 202 78 01 +381 64 2977630 stasa@matf.bg.ac.rs svesna@agrif.bg.ac.rs ABSTRACT ontologies, machine translation system from Serbian to English, In this paper, we describe a process of extracting information and vice versa, and different kinds of linguistic researches in the from meteorological texts in Serbian. The text corpus consists of domain of weather forecast. Some specifics of Serbian that are almost 46000 sentences. Having in mind the specifics of Serbian important for this research are presented in Section 2. The corpus and characteristics of meteorological sublanguage, we develop a of meteorological texts in Serbian, collected during 2010, 2011, classification schema for structuring extracted information and and 2012 years from several sources is described in Section 3. transducers for annotating pieces of information in the text The main goal of the extraction process was to annotate corpus. We describe the transducer for extracting information information contained in a text description. Three types of about daily temperatures and give some evaluation parameters for information were of interest: location, time, and meteorological all other transducers used in the information extraction process. phenomena. Semantic classes of information used to structure the data are described in Section 4. Categories and Subject Descriptors The process of information extraction is presented in Section 5. I.2.7 [Artificial Intelligence]: Natural Language Processing – The extraction rules are defined by finite state transducers (FST) Text analysis, Language parsing and understanding; H.3.0 ([4] and [15]) and recursive transition networks (RTN) with [Information Storage and Retrieval]: General output ([4] and [16]), both referred to as transducers in this paper. We used the UNITEX software system [12] for the creation and application of the transducers. General Terms Algorithms, Experimentation, Languages, Performance Finally, we evaluate the information extraction process and give the directions for the future research. Keywords Information extraction, transducers, Serbian language, language 2. THE SPECIFICS OF SERBIAN resources The specific features of Serbian determine, to a great extent, approach and method that will be used for the information extraction from texts written in Serbian. 1. INTRODUCTION Weather forecast reports are interesting for natural language Serbian is a language with rich morphology. For example, the processing because of their properties and the possibility of most adjectives in Serbian may take more than 40 different forms. various uses of extracted data. These texts have been studied over There are algorithms for different Natural Language Processing the years in the areas such as information extraction, text mining (NLP) tasks that have excellent results when applied to texts in or text understanding, and the obtained data were used for English, but very bad when it comes to texts in a language with machine translation from one language to another (TAUM- rich morphology, such as Serbian. The rich morphological system METEO system developed in Canada for machine translation of Serbian requires the use of additional linguistic resources, such from English to French and vice versa [2] and [14]), data as electronic dictionaries and grammars, for text processing. Thus, visualization described in [5], web information extraction using it is possible to develop systems for the information extraction extraction ontologies represented in [11], creating dialogue that would be efficient when applied to texts in Serbian. manager system as in [1], summarization of data from multiple This paper describes a process of extracting information from sources ([6] and [7]), etc. texts in Serbian, in which the electronic dictionary for Serbian In this paper we present the process of extracting information ([8] and [9]) was used. This dictionary was written in the DELA about weather conditions from meteorological texts in Serbian, format [13]. It contains 125269 lemmas of simple words and which can be used for different purposes (for example, for 4378245 simple word forms, as well as 5251 lemmas of automatic creation of lexicon or annotation of texts). The main compounds and 106731 forms of compounds [10]. goal of this research was to provide foundations for developing electronic resources in Serbian, construction of sublanguages, 3. THE CHARACTERISTICS OF THE BCI’12, September 16–20, 2012, Novi Sad, Serbia. TEXT CORPUS Copyright © 2012 by the paper’s authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. Meteorological texts have been collected during 2010, 2011, and Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences, 2012 years from several sources (Republic Hydrometeorological University of Novi Sad. 105 Service of Serbia,1 the Meteos agency,2 the Politika daily news,3 4. SEMANTIC CLASSES FOR B92,4 SMedia5 and Internet portal Krstarica6). The created text corpus contains 13705 text descriptions, which consist of a total INFORMATION STRUCTURING of 45862 sentences. The information contained in the textual descriptions of weather conditions, which were of interest in the research, are grouped 3.1 Weather Forecast Sublanguage into semantic classes of different levels. A semantic class, The language used for describing weather conditions in textual together with possible additional classification, should be reports is very specific and easily recognizable. A limited set of assigned to each separate fragment of the text. Hierarchical words from natural language, which is used to describe the classes are shown in Table 1. meteorological phenomenon, can be treated as a sublanguage, Table 1: Class Hierarchy Used to Structure Information along with its characteristics: Extracted from the Text – limited vocabulary – the same words are used to describe a Type Element Feature Value examples meteorological phenomenon in almost every weather report; – irregular syntax – sentences in meteorological reports TipPadavina kiša, sneg ... (PrecipitationType) (rain, snow...) typically do not contain auxiliary verb, and often do not have Padavine a predicate (“Vetar slab, jugoistočni.” – “Wind weak, (Precipitation) ObimPadavina slaba, jaka, ... (Precipitation- southeast.”) or adverbs; (weak, strong...) Amount) – text structure – it is not possible to distinguish different statements based only on punctuation, since a sentence often PrisustvoOblaka sunčano, oblačno Oblačnost (CloudPresence) (sunny, cloudy) contains multiple statements, and a few sentences sometimes (Cloudiness) ObimOblačnosti promenljivo, potpuno... merges into one separated with commas. (CloudAmount) (variable, fully...) On the one hand, the existence of such sublanguage facilitates the PravacVetra jugoistočni, severni ... text processing, since many syntactic rules are simplified in (WindDirection) (southeast, north….) comparison to natural language. On the other hand, it is contempt Vetar JačinaVetra jak, slab... of natural language syntax rules that prevents the use of existing (Wind) (WindAmount) (strong, weak...) Meteo electronic grammars, developed and available for a given natural BrzinaVetra language. 16 m/s (WindSpeed) 12 stepeni, 12 C, dva 3.2 The Structure of Textual Meteorological Temperatura stepena, ispod nule ... Descriptions (Temperature) (12 degrees, 12 C, two The descriptions of weather conditions consist of smaller degrees, below zero…) fragments (sentences and parts of sentences), which carry three Temperatura KatTemperature (Temperature) (Temperature- najviša, jutarnja ... types of information (meteorological phenomenon, location and (maximum, morning…) time), combined together in a statement. Therefore, every Category) semantic unit of the text structure (particular statement) can be OpisTemperature hladno,toplije, porast ... treated as a triple. The ideal (Temperature- (cold, warmer, rising…) Description) information extraction process from the following description in Serbian “Ujutru i pre podne u nižim delovima grada magla ili Pojava TipPojave magla, oluja ... (Phenomenon) (PhenomenonType) (fog, storm...) sumaglica.“ (“In the morning and before the noon in the lower parts of the city fog or haze.”) would extracts the following ImeTeritorije Srbija, Evropa, Beograd triples: Teritorija (TeritoryName) ... Location (Teritory) DeoTeritorije severoistok, južni delovi <“niži delovi grada”, “ujutru”, “magla ili sumaglica”> (TeritoryPart) (northeast, southern parts) <“niži delovi grada”, “pre podne”, “magla ili sumaglica”> Lokalitet Lokalitet na planinama, lokalno... (<“the lower parts of the city”, “In the morning”, “ fog or haze”> (Locality) (Locality) (in the mountains, localy) <“the lower parts of the city”, “before the noon”, “ fog or haze”>) Datum 15. januar The statements mutually overlap in the textual descriptions, (Date) (January 15th) usually with no clear boundary between two different statements. Dan ImeDana ponedeljak, utorak .. This semantic structure requires a special approach, semantically (DayName) (Monday, Tuesday...) (Day) oriented, in order to resolve coreferences between different parts. ujutru, posle podne Time DeoDana However, the first steps in this process are the detection and (in the morning, in the (DayPart) afternoon) isolation of the values of individual features. This paper describes exactly this process, while merging isolated pieces of information sledeće nedelje, tokom Period Period februara and their values into the statements will be the subject of a future (Period) (Period) (next week, during research. February…) The names of the features, given in Table 1, are used for 1 http://www.hidmet.gov.rs annotating pieces of information in the text. 2 http://www.meteos.rs 3 http://www.politika.rs The annotations had the following syntax: 4 http://www.b92.net text segment 5 http://www.smedia.rs 6 http://www.krstarica.com 106 Hence, the example sentence “U većem delu zemlje promenljivo oblačno, mestimično slaba kiša, pljuskovi i grmljavina.” (“In most of the country variable cloudiness, with areas of light rain, showers, and thunder.”), should be annotated as follows:U većem delu zemlje lokalitet> promenljivo oblačno ,mestimično Figure 2: Subgraph vrednost.grf that recognizes numericslaba values written as numbers or text.kiša , The lexical maskrecognizes successive digits. The lexical pljuskovi i maskrecognizes all the words in the dictionary that are grmljavina . marked with a code NUM (jedan, dva, tri - one, two, three, ...). Thus, this subgraph recognizes, among others, the following (In most of the country expressions: 10, minus dva – minus two, +5 ili jedanaest – eleven.variable The main transducer temperatura.grf (Figure 1) contains acloudiness , subgraph call stepen.grf. This graph is intended to recognizewith areas of expressions that describe the degrees on the Celsius scale, as thelight common unit of temperature measure, in the texts in Serbianrain , language. Subgraph stepen.grf is shown in Figure 3.showers andthunder .) 5. INFORMATION EXTRACTION PROCESS We used transducers (FST and RTN) as extraction rules. The transducer that describes the rule for extracting particular piece of Figure 3: Subgraph stepen.grf that recognizes phrases for information was created for each feature given in Table 1. The marking degrees on the Celsius scale. rules were applied through the software system UNITEX, where The lexical mask which refers to a dictionary word () the structuring of data was done by annotating text segments that recognizes any form of the word stepen – degree (stepena, carry information. The application of transducers was performed stepeni, stepenima etc.). Graph temperatura.grf recognizes the sequentially, one by one. The application order was not important following phrases: oko +8 °C (approximately +8 ° C), - 1C, - 30 for the majority of created transducers, although it is possible to ° C,- 4 stepena (- 4 degrees), od -1 C do 1 C (from -1 C to 1 C), -1 organize the information extraction process so that the successive do +3 stepena (-1 to +3 degrees), -12 do -8 (-12 to -8), od 11 do application of transducers improves the efficiency of the process 15 stepeni (from 11 to 15 degrees), 11 stepeni (11 degrees), od pet (a cascade of transducers, one operating after the other using the do devet stepeni (from five to nine degrees), oko četiri (about results of previously applied transducers [3]). In this section, we four), ispod 0 (below 0) etc. will present one of the transducers that extracts information related to the temperature. Similarly, for each feature in the Table 1 a rule extraction is created for annotation of the text segments that carry specific Temperature data have been presented in the texts as values (12 information. stepeni – 12 degrees, 12°C, 12 C, dva stepena – two degrees, ispod nule – below zero, minus 5 ...) or descriptive (hladno - cold, hladnije - colder, toplo - warm, toplije – warmer, pad 5.1 Analysis of Extracted Information and temperature – the temperature drop, temperatura u porastu - the Process Efficiency temperature rising ...). For each way of representing temperature, The process of information extraction from the meteorological a special extraction rule has been created. Figure 1 shows the texts is in the initial phase. During this phase, the analysis of the main transducer (temperatura.grf) in the RTN for extracting texts from the described corpus was performed and the information related to the temperature. transducers for extracting simple features were created. Since the Subgraph calls are marked with gray colour. Subgraph extraction rules are still evolving, and the text corpus over which vrednost.grf recognizes different expressions for the specific the extraction is carried out is fairly large (45862 sentences with value (number of degrees) of the temperature. This subgraph is more than one million tokens), a comprehensive evaluation of the shown in Figure 2. system’s efficiency, which would accurately assess the precision and recall, is not currently possible. However, an initial analysis of the created transducers, which would determine the directions for further development, is possible. Table 2 lists the transducers which were used to extract information, in order of their implementation. The number of extracted text segments is shown in the third column of the table, while the evaluation of precision is presented in the fourth. Figure 1: The main transducer temperatura.grf within the RTN, for extracting information about the temperature. 107 Table 2: Performance Evaluation of Graphs Used for the [2] Chevalier, L., Dansereau, J., and Poulin, G. 1978. TAUM- Extraction of Information METEO: Description du Système. Universite de Montreal, Canada. Number of Evaluation [3] Friburger, N. and Maurel, D. 2004. Finite-state transducer Transducer Features extracted of cascades to extract named entities in texts. Theoretical text precision Computer Science 313, 1 (2004), 93–104. segments [4] Jurafsky, D. and Martin, J. H. 2008. Speech and language opisTemp OpisTemperature 11518 100% processing, 2nd edition. Prentice-Hall Inc. (Temperature- Description) [5] Kerpedjiev, S. and Noncheva, V. 1990. Intelligent Handling temperature Temperatura 25618 99.6% of Weather Forecasts. In Proceedings of the 13th (Temperature) International Conference on Computational Linguistics katTemp KatTemperature 14817 100% COLING-90, 3 (Helsinki, Finland, August 20–25, 1990), (Temperature-Category) 379–381. vetarPre JacinaVetra 7720 100% [6] Kononenko I., Popov I., and Zagorulko Yu. 1999. Approach (WindAmount) and to Understanding Weather Forecast Telegrams with Agent- PravacVetra Based Technique. In Perspectives of System Informatics, (WindDirection) Third International Andrei Ershov Memorial Conference, vetarPost JacinaVetra 1559 100% PSI'99 (Novosibirsk, Russia, July 6–9, 1999), 511–516. (WindAmount) and [7] Kononenko I., Kononenko S., Popov I., and Zagorulko Yu. PravacVetra 2000. Information extraction from non-segmented text (on (WindDirection) the material of weather forecast telegrams). In Proceedings padavine TipPadavina 18878 100% of the 6th International Conference, RIAO 2000 (College de (PrecipitationType) and France, France, April 12–14, 2000), 1069–1088. ObimPadavina (Precipitation-Amount) [8] Krstev, C. 2008. Processing of Serbian Automata, Texts and oblacnost ObimOblacnosti 18875 98% Electronic dictionaries. Faculty of Philology, University of (CloudAmount) and Belgrade, Belgrade, Serbia. PrisustvoOblaka [9] Krstev, C. and Vitas, D. 2005. Corpus and Lexicon – Mutual (CloudPresence) Incompleteness. In Proceedings from the Corpus Linguistics deoTeritorije DeoTeritorije 4918 99.8% Conference Series, 1, 1, ISSN 1747-939 (Birmingham (TeritoryPart) University, UK, July 14–17, 2005). imeTeritorije ImeTeritorije 6036 95% (TeritoryName) [10] Krstev, C., Vitas, D., Obradović, I., and Utvić, M. 2011. E- lokalitet Lokalitet (Locality) 7623 98% Dictionaries and Finite-State Automata for the Recognition of Named Entities. In Proceedings of the 9th International pojava Pojava (Phenomenon) 3737 100% Workshop on Finite State Methods and Natural Language 6. CONCLUSION Processing (Blois, France, July 12–15, 2011), 48–56. The high precision of the transducers is expected, given that this [11] Labsky, M., Nekvasil, M., and Svatek, V. 2007. Towards is an early stage of the system design and the extraction rules web information extraction using extraction ontologies and creation process. Further development of the process, in order to (indirectly) domain ontologies. In Proceedings of the 4th extract a larger number of individual pieces of information (i.e. to international conference on Knowledge capture K-CAP '07 increase recall), will surely reduce the precision. However, it is (Whistler, BC, Canada, October 28–31, 2007), ACM New expected the transducers will still maintain high efficiency. York, NY, USA, 201–202. We would like to emphasize that the next step in the process, after [12] Paumier, S. 2008. Unitex 2.1 User Manual. the extraction of simple features, is merging the extracted data http://www-igm.univmlv.fr/~unitex/UnitexManual2.1.pdf. into classes of higher semantic level. During that process, it will [13] Silberztein, M. 1993. Dictionnaires électroniques et analyse be possible to further improve efficiency, by resolving automatique de textes: le système INTEX. Edition Masson, ambiguities or correcting wrongly interpreted text segments. Paris. [14] Slocum, J. 1985. A Survey of Machine Translation: its 7. ACKNOWLEDGMENTS History, Current Status, and Future Prospects. In: This research was conducted through the projects 178006 and III Computational Linguistics 11, 1 (1985), 1–17. 47003, financed by the Serbian Ministry of Science. [15] Vitas, D. 2006. Prevodioci i interpretatori: Uvod u teoriju i metode kompilacije programskih jezika. Faculty of 8. REFERENCES Mathematics, University of Belgrade, Belgrade, Serbia. [1] Brkić, M., and Matetić, M. 2007. Modeling Natural [16] Woods, W. 1970. Transition network grammars for natural Language Dialogue for Croatian Weather Forecast System. language analysis, In Communications of the ACM 13, 10 In Proceedings of the 18th International Conference on (1970), 591–606. Information and Intelligent Systems (Varaždin, Croatia, 2003), 391–396. 108