<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Terminology acquisition and description using lexical resources and local grammars</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>cvetana @matf.bg.ac.rs</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ranka @rgf.bg.ac.rs</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Biljana Lazić University of Belgrade</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Cvetana Krstev University of Belgrade</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ivan Obradović University of Belgrade</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Ranka Stanković University of Belgrade</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>81</fpage>
      <lpage>90</lpage>
      <abstract>
        <p>Acquisition of new terminology from specific domains and its adequate description within terminological dictionaries is a complex task, especially for languages that are morphologically complex such as Serbian. In this paper we present an approach to solving this task semi-automatically on basis of lexical resources and local grammars developed for Serbian. Special attention is given to automatic inflectional class prediction for simple adjectives and nouns and the use of syntactic graphs for extraction of Multi-Word Unit (MWU) candidates for termbases, their lemmatization and assignment of inflectional classes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this paper we present a semi-automatic
procedure for terminology acquisition in Serbian.
Rapid changes in many knowledge domains mean
that new terms are continuously being created
and introduced in Serbian making important the
automation of their retrieval and incorporation in
Serbian terminological dictionaries. Due to
specific features of Serbian grammar, especially its
rich morphology, this is a complex task, and
corresponding language resources in the form of
morphological e-dictionaries and grammars need
to be applied (Vitas et al., 2012). For that reason,
in the case of Serbian, it is not enough to extract
terminology from the domain, but it also has to
be adequately described, for instance, in the form
of e-dictionaries.</p>
      <p>
        The field of terminology is strongly related to
research on multiword terms, which relates
closely to MWEs
        <xref ref-type="bibr" rid="ref2 ref8">(Baldwin &amp; Kim, 2010;
Frantzi et al., 2000)</xref>
        . An analysis of terms from
technical dictionaries for different domains (fiber
optics, medicine, physics and mathematics,
psychology) showed that 97% of multi-words in
these sources consist of nouns and adjectives
only, and more than 99% consist only of nouns,
adjectives, and a preposition.
        <xref ref-type="bibr" rid="ref10">(Justeson &amp; Katz,
1995)</xref>
        Identifying the adjectives and the
prepositional phrase is thus important for terminology
acquisition
        <xref ref-type="bibr" rid="ref5">(Daille, 2000)</xref>
        .
      </p>
      <p>
        There are two mainstream approaches
        <xref ref-type="bibr" rid="ref3 ref7">(Enguehard &amp; Pantera, 1995; Cerbah &amp; Daille,
2007)</xref>
        to terminology acquisition. One relies on
using statistical measures
        <xref ref-type="bibr" rid="ref12 ref14 ref15 ref21">(Nakagawa &amp; Mori,
2003; Ramisch et al., 2012; Quochi et al., 2012;
Zhang et al., 2006)</xref>
        and the other is based on
linguistic rules. A rule-based approach for the
extraction of terms based on a cascade of
transducers using CasSys tool incorporated in
Unitex1 corpus processing platform, as well as
the use of TMF standard for the representation of
terms is proposed in
        <xref ref-type="bibr" rid="ref1">(Ammar et al., 2015)</xref>
        and
applied on Arabic scientific and technical corpus.
In
        <xref ref-type="bibr" rid="ref20">(Savary et al., 2012)</xref>
        terminology extraction in
the domain of ecsonomy is presented for Polish.
It has two modules: a grammatical lexicon of
terminological MWEs and a fully lexicalized
shallow grammar, obtained by an automatic
conversion of the lexicon.
        <xref ref-type="bibr" rid="ref13">Przepiorkowski and
associates (2007</xref>
        ) present results of automatic
extraction of term definitions from unstructured
texts in Bulgarian, Czech and Polish by use of
regular grammars.
      </p>
      <p>
        There are also combinations of the two
approaches (
        <xref ref-type="bibr" rid="ref17">Rodrıguez et al., 2007</xref>
        ). Sag et al.
reported that modern statistical Natural Language
Processing (NLP) is in great need of better
language models and linguistic tools must come to
1 Corpus processing System Unitex:
http://www-igm.univmlv.fr/~unitex/
grip with problems of disambiguation and
MWUs
        <xref ref-type="bibr" rid="ref18">(Sag et al., 2002)</xref>
        .
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Process description</title>
      <p>
        The processing steps (Fig.1) of integrating new
terms from specific domains in terminological
dictionaries using lexical resources and local
grammars in our approach are:
1. Linguistic preprocessing of the input plain
text file from the chosen domain using
Unitex.
2. Analysis of unrecognized words as the most
probable source of terminology and
expanding the dictionary of simple words:
2.1 Retrieval of unrecognized words;
2.2 Manual filtering, preparation of a list of
extracted terms in canonical forms (for
instance, nominative singular for nouns) and
annotating with semantic labels (e.g.
human) and some grammatical categories
(e.g. adding the gender for the nouns);
2.3 Automatic prediction of the inflectional
class and the production of dictionary
entry in DELA format (detailed description
of the algorithm is given in section 3);
2.4 Compiling the dictionaries of newly
acquired terms and integrating them with
other resources for linguistic text
processing;
2.5 Repeated linguistic preprocessing with
expanded dictionaries for verification of
recognition of new lemmas.
3. MWUs extraction
3.1. Application of syntactic graphs to extract
MWUs with different syntactic structures
from the same text (detailed description of
the algorithm is given in section 4);
3.2. Removing duplicate extractions: if a
sequence of words is recognized with
different graphs as having different syntactic
structures the most probable candidate is
chosen according to the pre-established
order of precedence;
3.3. Two-step generation of MWU canonical
forms: in the first step lemmatization of
simple words that form the MWU is
performed, while in the second step the
lemma of the MWU is produced based on the
results from step 1.
4. Selection of terms from new MWUs
4.1. Frequency calculation for all forms of
MWUs and their basic forms with ranking
of results;
4.2. Removing MWUs already in
edictionaries and those with rank under the
specified threshold;
4.3. Linguistic evaluation of grammatical
correctness of remaining MWUs;
4.4. Assessment of domain relevance of each
MWU by comparing its frequency in the
domain text with its frequency in the
Corpus of Contemporary Serbian (Utvić,
2014).
5. Expanding MWU dictionaries
5.1. Creation of complete MWU lemmas in
compliance with DELAC format
        <xref ref-type="bibr" rid="ref19">(Savary,
2009)</xref>
        ;
5.2. Compiling the dictionaries of newly
acquired multi-word terms and integrating
them with other resources for linguistic
text processing;
5.3. Linguistic pre-processing with expanded
dictionaries for verification of recognition
of new MWU lemmas.
      </p>
      <p>The newly acquired terms, both simple and
MWU, can be exported to termbases, TBX and
other standard formats for terminological
resources. In this paper we will focus on (marked
gray in Figure 1): inflectional class prediction
(step 2.3) and extraction of MWU candidates for
termbases using syntactic graphs (step 3).
3</p>
      <p>
        Prediction of inflectional class for
simple words
Prediction of inflectional class for a new word in
Serbian is not an easy task because of complex
inflectional grammar with numerous rules and
exceptions. Morphological electronic dictionaries
of Serbian for NLP are being developed for
many years now. Their development follows the
methodology and format (known as
DELAS/DELAF) presented for French in
        <xref ref-type="bibr" rid="ref4">(Courtois, 1990)</xref>
        . E-dictionaries in the same
format have been produced for many other
languages.
      </p>
      <p>In dictionary of lemmas (DELAS) each
lemma is described in full detail so that a dictionary
of forms containing all necessary grammatical
information (DELAF) can be generated from it,
and subsequently used in various NLP tasks.</p>
      <p>
        Serbian e-dictionaries of simple forms have
reached a considerable size: they have about
135,000 lemmas generating more than 5 million
forms and 13,000 compound lemmas, that is,
multi-word units
        <xref ref-type="bibr" rid="ref11">(Krstev, 2008)</xref>
        . The number of
simple lemmas by Part-Of-Speech (POS) is
depicted in Figure 2 (left).
      </p>
      <p>POS lemmas FSTs
Nouns 81,866 61% 372
Verbs 17,071 13% 372
Adjectives 31,071 23% 69
Other 4,632 3% 41
Total 134,640 854
Figure 2: Statistics of lemmas and inflectional
FSTs
44%
44%
8%
5%</p>
      <p>Inflectional classes are described with
metadata including most important features for
class distinction e.g. for nouns grammatical
gender and number, case, and animateness are given.</p>
      <p>Grammatical inflectional rules are encoded by
854 inflectional Finite-State Transducers (FST)
Inflectional FSTs are a special kind of FSTs used
for modeling inflectional paradigms, that is,
inflectional classes. Each FST of this kind is used
for production of all inflected forms for all
lemmas belonging to the same class. The number of
Inflectional FSTs by POS is depicted in Figure 2
(right).</p>
      <p>Productiveness of all inflectional classes are
not the same: some classes are used for a large
number of regular cases, while other pertain to
(rare) exceptions. Our approach is addressing the
first group, having in mind that terminology
usually inflects regularly. Figure 3 presents the
number of inflectional classes and percent of
lemmas that belong to them. For example, 10
classes for adjectives account for 98% of
lemmas, 10 classes for nouns account for 61.8% of
lemmas, and 10 classes for verbs account for
59.6% of lemmas.
FST class prediction can be divided into two
parts: one is extraction of implicit knowledge
and the other is actual prediction of FST class for
a new lemma. Extraction of implicit knowledge
in the form of a dataset with word endings,
grammatical categories and FST classes proceeds
as follows:
1. Calculate frequencies for each POS and
relative frequencies for each FST class within
POS in the current dictionary of simple
lemmas.
2. Create a dataset from DELAS lemma
endings of length 3,4,5 and 6 characters with
corresponding grammatical categories
retrieved from DELAF (e.g. for nouns in that
dataset: POS, lemma, FST, gender,
animateness, pronunciation).
3. Create another dataset with frequencies for
each combination of FST code and
grammatical category and for each ending of length
3,4,5,6, as an estimate of the probability that
the FST class is the appropriate one. The
dataset includes: ending, POS, gender,
animateness, pronunciation, FST and
probability (chance rank 0-100) for FST (table 1,
column Rel. freq.).</p>
      <p>Analysis of the relation between word
endings and inflectional FST classes shows that the
prediction of inflectional class by the
abovementioned statistical analysis of existing dictionaries
is justified. Figure 4 illustrates this relation for
word endings of length 3, 4, 5 and 6. For
example, in the case of word endings of length 3, for
33% of words from the existing dictionary there
is only one corresponding FST class, for
approximately 20% of words there are two classes, and
so on, whereas for word endings of length 6
there is a single class for as much as 90% of
words.</p>
      <p>In order to facilitate prediction of FST class, a
set of rules based on inflectional class metadata
is used. Distinction between inflectional classes
based on grammatical categories can be done
only to some extent, so implicit knowledge from
the existing dictionary of simple words is used to
improve prediction.</p>
      <p>The process of automatic prediction of
inflectional FST class for a new entry follows a hybrid
approach: one part is rule-driven with explicit
codification of knowledge about FST classes and
the other is statistical, based on existing
dictionary of simple word lemmas with implicit
knowledge about dependence between FST
classes and dictionary entries.</p>
      <p>After preparing the list of new entries in the
form: lemma, POS, Grammatical_Categories
(e.g. grabuljar,N,Rud ‘rake’) the following
procedure is applied:
1. For each candidate lemma filter the dataset
prepared from previous step as follows:
1.1. if the lemma has specific marks for
pronunciation, then retain only dataset
members with the same mark and remove the
rest;
1.2. if the grammatical gender or animateness
is assigned, retain only dataset members
with the same grammatical category and
remove the rest;
1.3. if the first letter of the lemma is in upper
case additional filtering can take place
taking into account FST classes which have
only inflected singular forms.
2. After filtering and ranking the dataset,
prediction (FST assignment) for the lemma is
repeated with threshold from 99 to 95 for
relative frequency, for suffixes 6,5,4, and 3
respectively;
3. For thresholds under 95 and over 80 lemma
prefix (if longer than 2 characters) is used: if
the prefix is in the dictionary of prefixes and
the remainder of the lemma is a word in
DELAS, then the lemma is the inflectional
class of the corresponding DELAS word is
assigned to the lemma.
4. For thresholds 80 and less steps 1 and 2 only
are repeated.</p>
      <p>From a sample of domain texts and
dictionaries we manually filtered 623 new terms from
domains of mining, geology and e-learning and
applied the described procedure for FST class
prediction: to 582 (93%) of them the correct FST
class was assigned, 27 (4%) had a partly correct
class assigned (for instance, inflection is correct
but falsely allows plural forms), and to 14 (2%)
of them an incorrect class was assigned.
4
4.1</p>
      <p>Syntactic graphs for MWU
recognition</p>
    </sec>
    <sec id="sec-3">
      <title>Structure of terms in termbases</title>
      <p>
        In order to analyze the structure of terms in
different domains, primarily the number of
components they consist of, we used samples from
three terminological resources for Serbian. Two
terminological resources, GeolISSTerm 2 and
RudOnto3 have been developed at University of
Belgrade, Faculty of Mining and Geology.
GeolISSTerm is a bilingual thesaurus of
geological terms in Serbian and their English
equivalents (Stankovic et al., 2011), divided in several
subdomains: petrology, mineralogy,
hydrogeology, geophysics, structural geology etc. RudOnto
is covering the larger area of mining engineering
and mine safety terminology (Stankovic et al.,
2012). The third termbase used is the Dictionary
of Library and Information Sciences (RNBS), 4
developed by the National Library of Serbia. It
contains terminology in Serbian, English and
German, related to theory and practice of
librarianship and information sciences and a wide
range of close or related fields.
Table 2 presents the distribution of terms
consisting of 1, 2, 3, 4 and more components for the
three termbases. These results are consistent with
the results presented in
        <xref ref-type="bibr" rid="ref10">(Justeson et al., 1995)</xref>
        , at
least for GeolISSTerm and RNBS, and show that
terms with 5 or more components are much less
frequent than the shorter ones. The results are
somewhat different for RudOnto, as it contains
very specific terms, such as causes of injuries,
employee positions, types of injuries, or
tech2 http://geoliss.mprrpp.gov.rs/term
3 http://rudonto.rgf.bg.ac.rs/
4 http://rbi.nb.rs/en/home.html
nical characteristics of machines, which are often
longer MWUs than the less specific terminology
of the two other termbases. Two examples from
RudOnto can illustrate this: a term for employee
position “Geologist for mineralogy, petrology,
sedimentology and geochemical research” and a
term for technical characteristics of machines
“Length of the caterpillar transporting device
measured from the vertical excavator rotation
axis to the front edge of the caterpillar”.
The extraction of MWUs from a text is preceded
by the retrieval of new simple word terms from it
and their incorporation in the existing system of
morphological e-dictionaries as MWU extraction
relies heavily on existing lexical resources.
      </p>
      <p>In the Serbian e-dictionary of MWUs, all
entries are distributed in classes according to their
syntactic structure, or more precisely, according
to the information needed for their inflection.
The names of classes correspond to the names of
special FSTs that are used for MWU inflection.
For instance, the class AXN pertains to MWUs
with the syntactic structure: an adjective (A)
followed by a noun (N), where the two components
agree in gender, number, case and animateness.
In class names X stands for a component that
does not inflect when a MWU inflects or for a
component separator. In the case of AXN, X
stands for the separator, usually a space.
Sometimes, MWUs with different syntactic structure
belong to the same class. For instance, the class
N4X implies that MWUs belonging to it consist
of a noun followed by two other components
(separated by two separators) that do not inflect.
The syntactic structure of these components can
be a noun followed by two adjectives/nouns in
the genitive case (e.g. eksploatacija mineralnih
sirovina ‘exploitation of mineral resources’) but
also a noun followed by a prepositional phrase
(e.g. bager na šinama ‘excavator on rails’).</p>
      <p>There are 29 such classes for Serbian nominal
MWUs.5 However, 10 of them are used for the
inflection of more than 98% of all nominal
MWUs. Four of these classes are used for the
inflection of two component MWUs, four for the
inflection of 3-component MWUs and two for
the inflection of 4-component MWUs. Given that
5 The number of FSTs (80) is greater than the number of
classes because they deal with other details of inflection:
does the MWU inflect in number, are some components
optional, etc.
they cover the large majority of MWUs, we have
developed syntactic FSTs for the extraction of
MWUs belonging to these 10 classes. They are,
listed in the descending order of their frequency:
1. AXN – an adjective followed by a noun; the
adjective and the noun have to agree in all
four grammatical categories; e.g. zemni gas
‘natural gas’.
2. 2XN – a noun preceded by a word that does
not inflect in the MWU. Usually it is a word
used only in one or a few MWUs, a prefix or
an adverb derived from an adjective, while
the separator is usually a hyphen; e.g.
ankermreža ‘anchor network’.
3. N2X – a noun followed by a word that does
not inflect in the MWU. Usually this word is
a noun in the genitive or in the instrumental
case; e.g. patrona eksploziva ‘explosive
cartridge’ and upravljanje krovinom ‘roof
control’.
4. N4X – a noun followed by two words that do
not inflect in the MWU. Two syntactic
structures are possible:
a. NNgi - A noun followed by two
adjectives/nouns in the genitive case or in the
instrumental case; e.g. otkopavanje širokim
čelom ‘broad forehead excavation’.
b. NprepNp - A noun followed by a
prepositional phrase; e.g. lanac sa grabuljama
‘chain with a rake’.
5. AXN2X – a noun preceded by an adjective
that agrees with it in gender, number, case
and animateness and followed by a word that
does not inflect in the MWU, usually a noun
in the genitive or instrumental case; e.g.
geološko kartiranje terena ‘geological field
mapping’.
6. NXN – a noun followed by a noun that agrees
with it in number and case, where the
separator can be a hyphen; e.g. bager kašikar
‘shovel excavator’.
7. AXAXN – a noun preceded by two adjectives
that agree with it in gender, number, case and
animateness; e.g. površinski istražni radovi
‘surface exploration works’.
8. N6X - a noun followed by three words that do
not inflect in the MWU. Three syntactic
structures are possible:
a. NNgiPrepNp - a noun followed by a noun
in the genitive case and a prepositional
phrase (as in case 4b); e.g. priprema ležišta
za otkopavanje ‘deposit preparation for
mining’.
b. NNgiNgiNgi - a noun followed by three
nouns/adjectives in the genitive case; e.g.
istraživanje ležišta mineralnih sirovina
‘exploration of mineral deposits’.
c. NprepNpNgi - a noun followed by a
prepositional phrase; e.g. bakar sa primesama
zlata ‘copper with a sprinkling of gold’.
9. AXN4X – a noun preceded by an adjective
that agrees with it in gender, number, case
and animateness and followed by two words
that do not inflect in the MWU. Two syntactic
structures are possible:
a. ANPrepNp - A noun preceded by an
adjective and followed by a prepositional phrase
(as in case 4b); e.g. gravitacijska
koncentracija u vodi ‘gravity concentration
in water’.
b. ANNgiNgi - a noun preceded by an
adjective and followed by two adjectives/nouns
in the genitive case or in the instrumental
case (a 4a case); e.g. površinska
eksploatacija mineralnih sirovina ‘surface
exploitation of mineral resources’.
10.2XAXN - an adjective followed by a noun
that agrees in all four grammatical categories
and preceded by a word that does not inflect
in the MWU; e.g. magmatsko-eruptivni masiv
‘magmatic-igneous massif’.</p>
      <p>FST for extraction of MWUs of type AXN with
two paths from one of the subgraphs that
illustrate the agreement between adjectives and nouns
is depicted in Figure 5. Dictionary variable used
for FST output in the form $a.LEMMA$
retrieves a lemma of recognized word form $a$
thus performing the simple word lemmatization.</p>
      <p>Due to high homography of word forms it
may happen that the same sequence of words is
recognized by two or more graphs; naturally,
only one recognition may be correct. For
instance if the MWU bager kašikar (case 6, NXN)
is detected in the analyzed text in the genitive
case bagera kašikara it may be erroneously
interpreted as a MWU of a form NNg (case 3) in
the genitive case. Consequently, all NNg
constructions in an analyzed text that appear in the
genitive case (which happens very frequently)
will be interpreted also as a NXN case. For that
reason, in the case of ambiguous recognition we
always give precedence to the more probable
case. For instance, for 2-component MWUs the
precedence is: AXN, 2XN, N2X, NXN.</p>
      <p>As a rule, we are looking for the longest
match for a MWU, that is, if a text matches an
AXAXN pattern, than we will ignore the match
AXN that is subsumed. However, in certain
cases we take into consideration the shorter matches
as well. For instance, a sequence recognized as
NNgNgNg, may well not be a multi-unit term,
but rather consist of two multi-unit terms of the
form NNg or contain as its part a AXAXN term;
e.g. sprečavanje zagađenja životne sredine
‘prevention of environmental pollution’ may not be
considered a term, while zagađenje životne
sredine ‘environmental pollution’ is. For that
reason, the order of term candidate extraction is:
1. AXAXN, 2XAXN, AXN2X, AXN4X,</p>
      <p>AXN
2. N6X
3. N4X
4. 2XN, N2X, NXN</p>
      <p>At the end of each round duplicates are
eliminated according to the priorityand the union of
all results is performed.</p>
      <p>The output of processing by transducers is the
initial version of the normalized MWU that
consists of simple word lemmatization — inflected
parts of a MWU are replaced by their lemmas, as
they are recorded in e-dictionaries. The list of
produced normalized MWUs is then additionally
processed by a new set of transducers in order to
obtain correct MWU lemmas. The following
adjustments have to be performed:
1. For MWUs with syntactic structure AXN,
AXAXN, AXN2X, AXN4X, and 2XAXN
the form of the adjectives has to be
corrected so that the right gender is selected to
correspond to the gender of the noun (simple
word lemmas are always in the masculine
gender). For example, when simple word
lemmatization offers a lemma minskim
bušotinaf ‘blasting boreholes’ it has to be
corrected to minskaf bušotinaf.
2. For all MWUs, the right number of the
MWU has to be selected: if it appeared in a
text only in singular form or only in plural
form, then the lemma will be in the
respective form (e.g. only singular form jamski
vazduh ‘air in the underground mine’, only
plural form atmosferske padavine
‘atmospheric precipitation’); if it appeared in both
plural and singular forms, then both forms
of lemmas will be offered.</p>
      <p>Production of correct MWU lemmas is a
prerequisite for the successful evaluation. Moreover,
entries for morphological e-dictionary of MWUs
can be produced only from correct MWU
lemmas. Finally, as a byproduct of the whole process
MWU inflectional classes for newly retrieved
MWUs are obtained – they are derived directly
from local grammars used for their extraction.
4.3</p>
      <p>Evaluation of performance of MWU
extraction
In order to evaluate our approach, we applied it
to a collection of 74 papers in Serbian from the
journal Infotheca. 6 The size of the corpus is
6 Infotheca - Journal for Digital Humanities
(http://infoteka.bg.ac.rs/index.php/en/infoteca)
272,557 simple word forms. Our procedure
extracted from it 65,279 MWUs, 86.9% of them
occurring only once, 7.9% occurring twice, 3.8%
occurring 3 to 5 times and 1.9% with more than
5 occurrences.</p>
      <p>The graph 3 (N2X) extracted 31% of all
MWUs with frequency greater than 1. It is
followed by graph 6 (NXN) with 26% MWUs,
graph 4 (N4X) with 22%, graph 1 (AXN) with
16%, and the remaining six graphs with 6%. As
to MWUs with frequency greater than 5, graph 1
(AXN) covers 31%, graph 3 (N2X) 25%, graph 6
(NXN) 22%, graph 4 (N4X) 17%, and the
remaining six graphs 5%.</p>
      <p>Extracted MWUs were manually evaluated on
a subset of 690 entries. The evaluators checked
1) whether proposed lemmas were grammatically
correct and 2) whether MWU terms belong to
domain terminology, in this case library and
information science, or to the general lexica.</p>
      <p>
        For candidate ranking three measures were
used: frequency, C-Value (Franzi et al., 2000)
and log-likelihood
        <xref ref-type="bibr" rid="ref6 ref9">(Dunning, 1993; Gelbukh et
al., 2010)</xref>
        .
      </p>
      <p>For grammatical correctness best precision at
rank n (P@n) measure is very high (figure 6) and
independent of the ranking (the trend is flat).</p>
      <p>In order to calculate the log-likelihood
measure we used an excerpt from the general Corpus
of Contemporary Serbian 7 that consists of 22
million simple word forms.</p>
      <p>Figure 7 presents the precision at rank n for
690 evaluated term candidates for domain
affiliation, where log-likelihood gave best results for
precision at rank n (P@n) measured on a sorted
list of candidates.
7 The Corpus of Contemporary Serbian
(http://www.korpus.matf.bg.ac.rs/)
The research outlined in this paper tackles the
extraction of domain terminology and its
integration into terminological dictionaries using lexical
resources and local grammars. Results obtained
by following this approach justify its basic
assumption that the task of term extraction, both in
the case of simple words and multi-word units,
can be successfully accomplished combining
existing e-dictionaries and FSTs. Moreover,
lexical resources and local grammars alleviate the
task of integrating the newly discovered terms
into terminological dictionaries by simplifying
the task of defining the proper inflectional class
for new terms, a task extremely complex in case
of morphologically rich languages such as
Serbian. By implementing the procedure proposed
within this paper we have considerably sped up
the development of terminological dictionaries
for Serbian.</p>
      <p>
        Further research will address the integration
of inflectional class prediction in existing
software tools used for handling dictionaries
developed at University of Belgrade and creation of a
web tool that would support the entire procedure
described in this paper. Production of dictionary
entries in DELA format for verbs, akin to the one
described for nouns, is also under consideration.
A detailed evaluation will follow with the aim of
further refinement of the presented procedure in
order to reduce to the least possible extent the
necessity for human intervention within the
process of terminology acquisition and description.
Our future work will be oriented towards usage
of Web sites for evaluation of new term
candidates
        <xref ref-type="bibr" rid="ref16">(Robitaille et al., 2006)</xref>
        .
      </p>
      <p>Acknowledgement. This research was supported by
the Serbian Ministry of Education and Science under
the grant #47003 and Parseme COST action IC1207.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Ammar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddar</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Romary</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Automatic Construction of a TMF Terminological Database Using a Transducer Cascade</article-title>
          .
          <source>Proc. of Recent Advances in Natural Language Processing</source>
          . (pp.
          <fpage>17</fpage>
          -
          <lpage>23</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S. N.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Multiword expressions Handbook of Natural Language Processing, second edition</article-title>
          . (
          <volume>267</volume>
          -292): CRC Press.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Cerbah</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Daille</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>A Service Oriented Architecture for Adaptable Terminology Acquisition</article-title>
          . In Z. Kedad,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lammari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Métais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meziane</surname>
          </string-name>
          &amp; Y. Rezgui (Eds.),
          <source>Natural Language Processing and Information Systems</source>
          (Vol.
          <volume>4592</volume>
          :
          <fpage>420</fpage>
          -
          <lpage>426</lpage>
          ): Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Courtois</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silberztein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>1990</year>
          ).
          <article-title>Dictionnaires électroniques du français</article-title>
          .
          <source>Larousse</source>
          , Paris.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Daille</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Morphological rule induction for terminology acquistion</article-title>
          .
          <source>Proc. of the 18th conference on Computational linguistics- (Volume</source>
          <volume>1</volume>
          : pp.
          <fpage>215</fpage>
          -
          <lpage>221</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Dunning</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>1993</year>
          ).
          <article-title>Accurate methods for the statistics of surprise and coincidence</article-title>
          .
          <source>Comput. Linguist.</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <fpage>61</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Enguehard</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pantera</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>Automatic natural acquisition of a terminology</article-title>
          .
          <source>Journal of quantitative linguistics</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>27</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Frantzi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Mima</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Automatic recognition of multi-word terms:. the Cvalue/NC-value method</article-title>
          .
          <source>International Journal on Digital Libraries</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <fpage>115</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavin-Villa</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Chanona-Hernandez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Automatic Term Extraction Using Log-Likelihood Based Comparison with General Reference Corpus</article-title>
          . In C. Hopfe,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rezgui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Métais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Preece &amp; H. Li</surname>
          </string-name>
          (Eds.),
          <source>Natural Language Processing and Information Systems</source>
          (Vol.
          <volume>6177</volume>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          ): Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Justeson</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Katz</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>Technical terminology: some linguistic properties and an algorithm for identification in text</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>1</volume>
          (
          <issue>01</issue>
          ):
          <fpage>9</fpage>
          -
          <lpage>27</lpage>
          . doi:
          <volume>10</volume>
          .1017/S1351324900000048
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Krstev</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <source>Processing of Serbian. Automata</source>
          , Texts and Electronic Dictionaries: Faculty of Philology of the University of Belgrade.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Nakagawa</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Mori</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Automatic term recognition based on statistics of compound nouns and their components</article-title>
          .
          <source>Terminology</source>
          ,
          <volume>9</volume>
          (
          <issue>2</issue>
          ),
          <fpage>201</fpage>
          -
          <lpage>219</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Przepiórkowski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Degórski</surname>
          </string-name>
          , Ł., &amp;
          <string-name>
            <surname>Wójtowicz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>On the evaluation of Polish definition extraction grammars</article-title>
          .
          <source>Proc. of the 3rd Language &amp; Technology Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Quochi</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frontini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rubino</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>A MWE Acquisition and Lexicon Builder Web Service</article-title>
          .
          <source>Proc. of COLING 2012</source>
          (pp.
          <fpage>2291</fpage>
          -
          <lpage>2306</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Ramisch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Araujo</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Villavicencio</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>A broad evaluation of techniques for automatic acquisition of multiword expressions</article-title>
          .
          <source>Proc. of ACL 2012 Student Research Workshop</source>
          (1-
          <fpage>6</fpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Robitaille</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sasaki</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tonoike</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sato</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Utsuro</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Compiling French-Japanese Terminologies from the Web</article-title>
          .
          <article-title>Paper presented at the 11th Conference of the European Chapter of the Association for Computational Linguistics - EACL.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Rodrıguez</surname>
            ,
            <given-names>F. M. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noya</surname>
            ,
            <given-names>E. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Otero</surname>
            ,
            <given-names>P. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martınez</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mato</surname>
            ,
            <given-names>E. M. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rojo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Docıo</surname>
            ,
            <given-names>S. S.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>A Corpus and Lexical Resources for Multi-word Terminology Extraction in the Field of Economy in a Minority Language</article-title>
          .
          <source>Proc. of 3rd Language &amp; Technology Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Sag</surname>
            ,
            <given-names>I. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bond</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Copestake</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Flickinger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Multiword expressions: A pain in the neck for NLP Computational Linguistics</article-title>
          and Intelligent Text Processing (
          <volume>1</volume>
          -15): Springer.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Savary</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Multiflex: A Multilingual FiniteState Tool for Multi-Word Units</article-title>
          . In S. Maneth (Ed.),
          <source>Implementation and Application of Automata</source>
          (Vol.
          <volume>5642</volume>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>240</lpage>
          ): Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Savary</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaborowski</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krawczyk-Wieczorek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Makowiecki</surname>
            ,
            <given-names>F</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>SEJFEK-a Lexicon and a Shallow Grammar of Polish Economic Multi-Word Units</article-title>
          .
          <article-title>Proc. of Cognitive Aspects of the Lexicon (COGALEX-III)</article-title>
          . (pp.
          <fpage>195</fpage>
          -
          <lpage>214</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kordoni</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villavicencio</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Idiart</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Automated multiword expression prediction for grammar engineering</article-title>
          .
          <source>Proc. of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties</source>
          . (pp.
          <fpage>36</fpage>
          -
          <lpage>44</lpage>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>