=Paper= {{Paper |id=None |storemode=property |title=Information Extraction from the Weather Reports in Serbian |pdfUrl=https://ceur-ws.org/Vol-920/p105-vujicic-stankovic.pdf |volume=Vol-920 |dblpUrl=https://dblp.org/rec/conf/bci/StankovicP12 }} ==Information Extraction from the Weather Reports in Serbian== https://ceur-ws.org/Vol-920/p105-vujicic-stankovic.pdf
                                      Information Extraction
                               from the Weather Reports in Serbian
                     Staša Vujičić Stanković                                                                   Vesna Pajić
                        University of Belgrade                                                             University of Belgrade
                        Faculty of Mathematics                                                             Faculty of Agriculture
                      Studentski trg 16, Belgrade                                                          Nemanjina 6, Zemun
                                Serbia                                                                            Serbia
                         +381 11 202 78 01                                                                  +381 64 2977630
                       stasa@matf.bg.ac.rs                                                             svesna@agrif.bg.ac.rs

ABSTRACT                                                                                ontologies, machine translation system from Serbian to English,
In this paper, we describe a process of extracting information                          and vice versa, and different kinds of linguistic researches in the
from meteorological texts in Serbian. The text corpus consists of                       domain of weather forecast. Some specifics of Serbian that are
almost 46000 sentences. Having in mind the specifics of Serbian                         important for this research are presented in Section 2. The corpus
and characteristics of meteorological sublanguage, we develop a                         of meteorological texts in Serbian, collected during 2010, 2011,
classification schema for structuring extracted information and                         and 2012 years from several sources is described in Section 3.
transducers for annotating pieces of information in the text                            The main goal of the extraction process was to annotate
corpus. We describe the transducer for extracting information                           information contained in a text description. Three types of
about daily temperatures and give some evaluation parameters for                        information were of interest: location, time, and meteorological
all other transducers used in the information extraction process.                       phenomena. Semantic classes of information used to structure the
                                                                                        data are described in Section 4.
Categories and Subject Descriptors                                                      The process of information extraction is presented in Section 5.
I.2.7 [Artificial Intelligence]: Natural Language Processing –                          The extraction rules are defined by finite state transducers (FST)
Text analysis, Language parsing and understanding; H.3.0                                ([4] and [15]) and recursive transition networks (RTN) with
[Information Storage and Retrieval]: General                                            output ([4] and [16]), both referred to as transducers in this paper.
                                                                                        We used the UNITEX software system [12] for the creation and
                                                                                        application of the transducers.
General Terms
Algorithms, Experimentation, Languages, Performance                                     Finally, we evaluate the information extraction process and give
                                                                                        the directions for the future research.
Keywords
Information extraction, transducers, Serbian language, language                         2. THE SPECIFICS OF SERBIAN
resources                                                                               The specific features of Serbian determine, to a great extent,
                                                                                        approach and method that will be used for the information
                                                                                        extraction from texts written in Serbian.
1. INTRODUCTION
Weather forecast reports are interesting for natural language                           Serbian is a language with rich morphology. For example, the
processing because of their properties and the possibility of                           most adjectives in Serbian may take more than 40 different forms.
various uses of extracted data. These texts have been studied over                      There are algorithms for different Natural Language Processing
the years in the areas such as information extraction, text mining                      (NLP) tasks that have excellent results when applied to texts in
or text understanding, and the obtained data were used for                              English, but very bad when it comes to texts in a language with
machine translation from one language to another (TAUM-                                 rich morphology, such as Serbian. The rich morphological system
METEO system developed in Canada for machine translation                                of Serbian requires the use of additional linguistic resources, such
from English to French and vice versa [2] and [14]), data                               as electronic dictionaries and grammars, for text processing. Thus,
visualization described in [5], web information extraction using                        it is possible to develop systems for the information extraction
extraction ontologies represented in [11], creating dialogue                            that would be efficient when applied to texts in Serbian.
manager system as in [1], summarization of data from multiple                           This paper describes a process of extracting information from
sources ([6] and [7]), etc.                                                             texts in Serbian, in which the electronic dictionary for Serbian
In this paper we present the process of extracting information                          ([8] and [9]) was used. This dictionary was written in the DELA
about weather conditions from meteorological texts in Serbian,                          format [13]. It contains 125269 lemmas of simple words and
which can be used for different purposes (for example, for                              4378245 simple word forms, as well as 5251 lemmas of
automatic creation of lexicon or annotation of texts). The main                         compounds and 106731 forms of compounds [10].
goal of this research was to provide foundations for developing
electronic resources in Serbian, construction of sublanguages,                          3. THE CHARACTERISTICS OF THE
BCI’12, September 16–20, 2012, Novi Sad, Serbia.                                        TEXT CORPUS
Copyright © 2012 by the paper’s authors. Copying permitted only for private and
academic purposes. This volume is published and copyrighted by its editors.
                                                                                        Meteorological texts have been collected during 2010, 2011, and
Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences,         2012 years from several sources (Republic Hydrometeorological
University of Novi Sad.



                                                                                  105
Service of Serbia,1 the Meteos agency,2 the Politika daily news,3            4. SEMANTIC CLASSES FOR
B92,4 SMedia5 and Internet portal Krstarica6). The created text
corpus contains 13705 text descriptions, which consist of a total            INFORMATION STRUCTURING
of 45862 sentences.                                                          The information contained in the textual descriptions of weather
                                                                             conditions, which were of interest in the research, are grouped
3.1 Weather Forecast Sublanguage                                             into semantic classes of different levels. A semantic class,
The language used for describing weather conditions in textual               together with possible additional classification, should be
reports is very specific and easily recognizable. A limited set of           assigned to each separate fragment of the text. Hierarchical
words from natural language, which is used to describe the                   classes are shown in Table 1.
meteorological phenomenon, can be treated as a sublanguage,
                                                                                Table 1: Class Hierarchy Used to Structure Information
along with its characteristics:
                                                                                                Extracted from the Text
–   limited vocabulary – the same words are used to describe a
                                                                              Type Element               Feature               Value examples
    meteorological phenomenon in almost every weather report;
–   irregular syntax – sentences in meteorological reports                                               TipPadavina           kiša, sneg ...
                                                                                                         (PrecipitationType)   (rain, snow...)
    typically do not contain auxiliary verb, and often do not have                        Padavine
    a predicate (“Vetar slab, jugoistočni.” – “Wind weak,                                 (Precipitation) ObimPadavina         slaba, jaka, ...
                                                                                                         (Precipitation-
    southeast.”) or adverbs;                                                                                                   (weak, strong...)
                                                                                                         Amount)
–   text structure – it is not possible to distinguish different
    statements based only on punctuation, since a sentence often                                         PrisustvoOblaka       sunčano, oblačno
                                                                                          Oblačnost      (CloudPresence)       (sunny, cloudy)
    contains multiple statements, and a few sentences sometimes
                                                                                          (Cloudiness)   ObimOblačnosti        promenljivo, potpuno...
    merges into one separated with commas.
                                                                                                         (CloudAmount)         (variable, fully...)
On the one hand, the existence of such sublanguage facilitates the                                       PravacVetra           jugoistočni, severni ...
text processing, since many syntactic rules are simplified in                                            (WindDirection)       (southeast, north….)
comparison to natural language. On the other hand, it is contempt                         Vetar          JačinaVetra           jak, slab...
of natural language syntax rules that prevents the use of existing                        (Wind)         (WindAmount)          (strong, weak...)
                                                                               Meteo
electronic grammars, developed and available for a given natural                                         BrzinaVetra
language.                                                                                                                      16 m/s
                                                                                                         (WindSpeed)
                                                                                                                               12 stepeni, 12 C, dva
3.2 The Structure of Textual Meteorological                                                              Temperatura           stepena, ispod nule ...
Descriptions                                                                                             (Temperature)         (12 degrees, 12 C, two
The descriptions of weather conditions consist of smaller                                                                      degrees, below zero…)
fragments (sentences and parts of sentences), which carry three                           Temperatura KatTemperature
                                                                                          (Temperature) (Temperature-
                                                                                                                               najviša, jutarnja ...
types of information (meteorological phenomenon, location and                                                                  (maximum, morning…)
time), combined together in a statement. Therefore, every                                               Category)
semantic unit of the text structure (particular statement) can be                                        OpisTemperature
                                                                                                                               hladno,toplije, porast ...
treated as a triple . The ideal                                              (Temperature-
                                                                                                                               (cold, warmer, rising…)
                                                                                                         Description)
information extraction process from the following description in
Serbian “Ujutru i pre podne u nižim delovima grada magla ili                              Pojava         TipPojave             magla, oluja ...
                                                                                          (Phenomenon) (PhenomenonType)        (fog, storm...)
sumaglica.“ (“In the morning and before the noon in the lower
parts of the city fog or haze.”) would extracts the following                                            ImeTeritorije         Srbija, Evropa, Beograd
triples:                                                                                  Teritorija     (TeritoryName)        ...
                                                                               Location




                                                                                          (Teritory)     DeoTeritorije         severoistok, južni delovi
     <“niži delovi grada”, “ujutru”, “magla ili sumaglica”>                                              (TeritoryPart)        (northeast, southern parts)
     <“niži delovi grada”, “pre podne”, “magla ili sumaglica”>                            Lokalitet      Lokalitet             na planinama, lokalno...
(<“the lower parts of the city”, “In the morning”, “ fog or haze”>                        (Locality)     (Locality)            (in the mountains, localy)
<“the lower parts of the city”, “before the noon”, “ fog or haze”>)                                      Datum                 15. januar
The statements mutually overlap in the textual descriptions,                                             (Date)                (January 15th)
usually with no clear boundary between two different statements.                          Dan            ImeDana               ponedeljak, utorak ..
This semantic structure requires a special approach, semantically                                        (DayName)             (Monday, Tuesday...)
                                                                                          (Day)
oriented, in order to resolve coreferences between different parts.                                                            ujutru, posle podne
                                                                               Time




                                                                                                         DeoDana
However, the first steps in this process are the detection and                                                                 (in the morning, in the
                                                                                                         (DayPart)
                                                                                                                               afternoon)
isolation of the values of individual features. This paper describes
exactly this process, while merging isolated pieces of information                                                             sledeće nedelje, tokom
                                                                                          Period         Period                februara
and their values into the statements will be the subject of a future
                                                                                          (Period)       (Period)              (next week, during
research.                                                                                                                      February…)
                                                                             The names of the features, given in Table 1, are used for
1
  http://www.hidmet.gov.rs                                                   annotating pieces of information in the text.
2
  http://www.meteos.rs
3
  http://www.politika.rs                                                             The annotations had the following syntax:
4
  http://www.b92.net                                                                 text segment
5
  http://www.smedia.rs
6
  http://www.krstarica.com


                                                                       106
     Hence, the example sentence “U većem delu zemlje
promenljivo oblačno, mestimično slaba kiša, pljuskovi i
grmljavina.” (“In most of the country variable cloudiness, with
areas of light rain, showers, and thunder.”), should be annotated
as follows:
U većem delu zemlje
promenljivo
oblačno,
mestimično                                                Figure 2: Subgraph vrednost.grf that recognizes numeric
slaba                                                          values written as numbers or text.
kiša ,                                             The lexical mask  recognizes successive digits. The lexical
pljuskovi i                                            mask  recognizes all the words in the dictionary that are
grmljavina.                                            marked with a code NUM (jedan, dva, tri - one, two, three, ...).
                                                                              Thus, this subgraph recognizes, among others, the following
(In most of the country                                  expressions: 10, minus dva – minus two, +5 ili jedanaest – eleven.
variable                                           The main transducer temperatura.grf (Figure 1) contains a
cloudiness,                                    subgraph call stepen.grf. This graph is intended to recognize
with areas of                                            expressions that describe the degrees on the Celsius scale, as the
light                              common unit of temperature measure, in the texts in Serbian
rain,                                  language. Subgraph stepen.grf is shown in Figure 3.
showers and
thunder.)

5. INFORMATION EXTRACTION
PROCESS
We used transducers (FST and RTN) as extraction rules. The
transducer that describes the rule for extracting particular piece of           Figure 3: Subgraph stepen.grf that recognizes phrases for
information was created for each feature given in Table 1. The                            marking degrees on the Celsius scale.
rules were applied through the software system UNITEX, where
                                                                              The lexical mask which refers to a dictionary word ()
the structuring of data was done by annotating text segments that
                                                                              recognizes any form of the word stepen – degree (stepena,
carry information. The application of transducers was performed
                                                                              stepeni, stepenima etc.). Graph temperatura.grf recognizes the
sequentially, one by one. The application order was not important
                                                                              following phrases: oko +8 °C (approximately +8 ° C), - 1C, - 30
for the majority of created transducers, although it is possible to
                                                                              ° C,- 4 stepena (- 4 degrees), od -1 C do 1 C (from -1 C to 1 C), -1
organize the information extraction process so that the successive
                                                                              do +3 stepena (-1 to +3 degrees), -12 do -8 (-12 to -8), od 11 do
application of transducers improves the efficiency of the process
                                                                              15 stepeni (from 11 to 15 degrees), 11 stepeni (11 degrees), od pet
(a cascade of transducers, one operating after the other using the
                                                                              do devet stepeni (from five to nine degrees), oko četiri (about
results of previously applied transducers [3]). In this section, we
                                                                              four), ispod 0 (below 0) etc.
will present one of the transducers that extracts information
related to the temperature.                                                   Similarly, for each feature in the Table 1 a rule extraction is
                                                                              created for annotation of the text segments that carry specific
Temperature data have been presented in the texts as values (12
                                                                              information.
stepeni – 12 degrees, 12°C, 12 C, dva stepena – two degrees,
ispod nule – below zero, minus 5 ...) or descriptive (hladno - cold,
hladnije - colder, toplo - warm, toplije – warmer, pad                        5.1 Analysis of Extracted Information and
temperature – the temperature drop, temperatura u porastu - the               Process Efficiency
temperature rising ...). For each way of representing temperature,            The process of information extraction from the meteorological
a special extraction rule has been created. Figure 1 shows the                texts is in the initial phase. During this phase, the analysis of the
main transducer (temperatura.grf) in the RTN for extracting                   texts from the described corpus was performed and the
information related to the temperature.                                       transducers for extracting simple features were created. Since the
Subgraph calls are marked with gray colour. Subgraph                          extraction rules are still evolving, and the text corpus over which
vrednost.grf recognizes different expressions for the specific                the extraction is carried out is fairly large (45862 sentences with
value (number of degrees) of the temperature. This subgraph is                more than one million tokens), a comprehensive evaluation of the
shown in Figure 2.                                                            system’s efficiency, which would accurately assess the precision
                                                                              and recall, is not currently possible. However, an initial analysis
                                                                              of the created transducers, which would determine the directions
                                                                              for further development, is possible.
                                                                              Table 2 lists the transducers which were used to extract
                                                                              information, in order of their implementation. The number of
                                                                              extracted text segments is shown in the third column of the table,
                                                                              while the evaluation of precision is presented in the fourth.
  Figure 1: The main transducer temperatura.grf within the
  RTN, for extracting information about the temperature.


                                                                        107
   Table 2: Performance Evaluation of Graphs Used for the                    [2] Chevalier, L., Dansereau, J., and Poulin, G. 1978. TAUM-
                  Extraction of Information                                      METEO: Description du Système. Universite de Montreal,
                                                                                 Canada.
                                          Number
                                          of          Evaluation             [3] Friburger, N. and Maurel, D. 2004. Finite-state transducer
 Transducer     Features                  extracted   of                         cascades to extract named entities in texts. Theoretical
                                          text        precision                  Computer Science 313, 1 (2004), 93–104.
                                          segments                           [4] Jurafsky, D. and Martin, J. H. 2008. Speech and language
 opisTemp       OpisTemperature           11518       100%                       processing, 2nd edition. Prentice-Hall Inc.
                (Temperature-
                Description)                                                 [5] Kerpedjiev, S. and Noncheva, V. 1990. Intelligent Handling
 temperature    Temperatura               25618       99.6%                      of Weather Forecasts. In Proceedings of the 13th
                (Temperature)                                                    International Conference on Computational Linguistics
 katTemp        KatTemperature            14817       100%                       COLING-90, 3 (Helsinki, Finland, August 20–25, 1990),
                (Temperature-Category)                                           379–381.
 vetarPre       JacinaVetra               7720        100%                   [6] Kononenko I., Popov I., and Zagorulko Yu. 1999. Approach
                (WindAmount) and                                                 to Understanding Weather Forecast Telegrams with Agent-
                PravacVetra                                                      Based Technique. In Perspectives of System Informatics,
                (WindDirection)                                                  Third International Andrei Ershov Memorial Conference,
 vetarPost      JacinaVetra               1559        100%                       PSI'99 (Novosibirsk, Russia, July 6–9, 1999), 511–516.
                (WindAmount) and
                                                                             [7] Kononenko I., Kononenko S., Popov I., and Zagorulko Yu.
                PravacVetra
                                                                                 2000. Information extraction from non-segmented text (on
                (WindDirection)
                                                                                 the material of weather forecast telegrams). In Proceedings
 padavine       TipPadavina               18878       100%
                                                                                 of the 6th International Conference, RIAO 2000 (College de
                (PrecipitationType) and
                                                                                 France, France, April 12–14, 2000), 1069–1088.
                ObimPadavina
                (Precipitation-Amount)                                       [8] Krstev, C. 2008. Processing of Serbian Automata, Texts and
 oblacnost      ObimOblacnosti            18875       98%                        Electronic dictionaries. Faculty of Philology, University of
                (CloudAmount) and                                                Belgrade, Belgrade, Serbia.
                PrisustvoOblaka                                              [9] Krstev, C. and Vitas, D. 2005. Corpus and Lexicon – Mutual
                (CloudPresence)                                                  Incompleteness. In Proceedings from the Corpus Linguistics
 deoTeritorije DeoTeritorije              4918        99.8%                      Conference Series, 1, 1, ISSN 1747-939 (Birmingham
                (TeritoryPart)                                                   University, UK, July 14–17, 2005).
 imeTeritorije ImeTeritorije              6036        95%
                (TeritoryName)                                               [10] Krstev, C., Vitas, D., Obradović, I., and Utvić, M. 2011. E-
 lokalitet      Lokalitet (Locality)      7623        98%                         Dictionaries and Finite-State Automata for the Recognition
                                                                                  of Named Entities. In Proceedings of the 9th International
 pojava         Pojava (Phenomenon)       3737        100%
                                                                                  Workshop on Finite State Methods and Natural Language
6. CONCLUSION                                                                     Processing (Blois, France, July 12–15, 2011), 48–56.
The high precision of the transducers is expected, given that this           [11] Labsky, M., Nekvasil, M., and Svatek, V. 2007. Towards
is an early stage of the system design and the extraction rules                   web information extraction using extraction ontologies and
creation process. Further development of the process, in order to                 (indirectly) domain ontologies. In Proceedings of the 4th
extract a larger number of individual pieces of information (i.e. to              international conference on Knowledge capture K-CAP '07
increase recall), will surely reduce the precision. However, it is                (Whistler, BC, Canada, October 28–31, 2007), ACM New
expected the transducers will still maintain high efficiency.                     York, NY, USA, 201–202.
We would like to emphasize that the next step in the process, after          [12] Paumier, S. 2008. Unitex 2.1 User Manual.
the extraction of simple features, is merging the extracted data                  http://www-igm.univmlv.fr/~unitex/UnitexManual2.1.pdf.
into classes of higher semantic level. During that process, it will          [13] Silberztein, M. 1993. Dictionnaires électroniques et analyse
be possible to further improve efficiency, by resolving                           automatique de textes: le système INTEX. Edition Masson,
ambiguities or correcting wrongly interpreted text segments.                      Paris.
                                                                             [14] Slocum, J. 1985. A Survey of Machine Translation: its
7. ACKNOWLEDGMENTS                                                                History, Current Status, and Future Prospects. In:
This research was conducted through the projects 178006 and III                   Computational Linguistics 11, 1 (1985), 1–17.
47003, financed by the Serbian Ministry of Science.
                                                                             [15] Vitas, D. 2006. Prevodioci i interpretatori: Uvod u teoriju i
                                                                                  metode kompilacije programskih jezika. Faculty of
8. REFERENCES                                                                     Mathematics, University of Belgrade, Belgrade, Serbia.
[1] Brkić, M., and Matetić, M. 2007. Modeling Natural                        [16] Woods, W. 1970. Transition network grammars for natural
    Language Dialogue for Croatian Weather Forecast System.                       language analysis, In Communications of the ACM 13, 10
    In Proceedings of the 18th International Conference on                        (1970), 591–606.
    Information and Intelligent Systems (Varaždin, Croatia,
    2003), 391–396.



                                                                       108