Challenges in Extracting Terminology from
                               Modern Greek Texts
                       Aristomenis Thanopoulos and Katia Kermanidis and Nikos Fakotakis1

Abstract. This paper describes the automatic extraction of               with various other metrics, like the χ2 score, the t-test, mutual
economic terminology from Modern Greek texts as a first step             information, the Mann-Whitney rank test, the Log Likelihood,
towards creating an ontological thesaurus of economic concepts.          Fisher’s exact test and the TF.IDF (term frequency-inverse
Unlike previous approaches, the domain-specific corpus utilized is       document frequency). Frantzi et al. in [2] present a metric that
varying in genre, and therefore rich in vocabulary and linguistic        combines statistical (frequencies of compound terms and their
structure, while the pre-processing level is relatively low (basic       nested sub-terms) and linguistic (context words are assigned a
morphological tagging, the detection of elementary, non-                 weight of importance) information.
overlapping chunks) and fully automatic. The idiosyncratic                   In this paper we present the first phase of the ongoing work
properties of Modern Greek noun phrases are taken into account:          towards the creation of an ontology hierarchy of economic
the freedom in word ordering, the richness in morphology. Also,          concepts. This phase includes the extraction of economic terms
the peculiarity of the available corpora is dealt with: the large size   automatically from a Modern Greek phrase-analyzed corpus by
of the economic compared to the balanced corpus. A combination           corpora comparison in combination to applying a threshold to the
of statistical filters (relative frequency ratios and log likelihood)    relative frequency ratios.
and smoothing is employed in order to deal with the afore-
                                                                             An important aspect of the present approach is the stylistic
mentioned challenges when filtering out non-terms.
                                                                         nature of the domain-specific (economic) corpus. In most of the
                                                                         previous work, the domain corpus is to a large extent restricted in
1 INTRODUCTION                                                           the vocabulary it contains and in the variety of syntactic structures
                                                                         it presents. Our economic corpus does not consist of syntactically
Terms are the linguistic expression of concepts. Domain-specific
                                                                         standardized taglines of economic news. On the contrary, it
terms capture the knowledge of a given domain and reflect it in the
                                                                         presents a very rich variety in vocabulary, syntactic formulations,
form of words that are commonly acceptable by the members of
                                                                         idiomatic expressions, sentence length, making the process of term
the domain community, enabling the latter to interact and exchange
                                                                         extraction an interesting challenge.
information. In contrast to the use of static dictionaries, acquiring
                                                                             In addition to this, the employed pre-processing tools (shallow
terminology automatically from domain texts leads to a list of
                                                                         phrase chunker) make use of limited resources (see section 2.2)
extracted terms that may be dynamically updated and ranked
                                                                         and the question arises whether the resulting low-level information
according to usage. Term extraction is a first step towards
                                                                         is sufficient to deal with the linguistic complexity of the corpus.
acquiring a domain ontology. An ontology is a thesaurus that
                                                                             Another challenge that has been faced by the present work is the
provides the relationships among the terms, and sorts them in a
                                                                         language itself. In Modern Greek the ordering of the constituents
hierarchical structure, based on their semantic specificity and their
                                                                         of a sentence or a phrase is loose and determined primarily by the
properties.
                                                                         rich morphology. As a result, the extraction of compound terms, as
    Several methods have been employed for the extraction of
                                                                         well as the identification of nested terms, are not straightforward
domain terms. Regarding the linguistic pre-processing of the text
                                                                         and cannot be treated as cases of simple string concatenation, as in
corpora, approaches vary from simple tokenization and part-of-
                                                                         English. Section 2.3 describes an approach for extracting the
speech tagging ([1],[2]), to the use of shallow parsers and higher-
                                                                         counts of candidate terms, which takes into account the freedom in
level linguistic processors ([4],[8]). The latter aim at identifying
                                                                         word ordering.
syntactic patterns, like noun phrases, and their structure (e.g. head-
modifier), in order to rule out tokens that are grammatically                Finally, a peculiar trait of the current work is the corpora that
impossible to constitute terms (e.g. adverbs, verbs, pronouns,           are available to us. While the economic corpus is sufficiently large,
articles, etc).                                                          the balanced corpus is relatively small. As a result, the terms
    Regarding the statistical filters, that have been employed in        (especially bi-grams) that occur in both corpora are few, while
previous work to filter out non-terms , they also vary. Using corpus     many valid terms appear in the domain specific corpus alone. This
comparison, the techniques try to identify words/phrases that            makes it impossible to use the traditional methodology of corpora
present a different statistical behavior in the corpus of the target     comparison alone (that presupposes the appearance of a candidate
domain, compared to their behavior in the rest of the corpora. Such      term in both corpora) in order to filter out non-terms. A smoothing
words/phrases are considered to be terms of the domain in                technique is applied to overcome this problem, which is described
question. In the most simple case, the observed frequencies of the       in section 3.
candidate terms are compared ([1]). Kilgarriff in [6] experiments


1
  Wire Communications Laboratory, University of Patras, Greece. Email:
{aristom, kerman, fakotaki}@wcl.ee.upatras.gr
2 LINGUISTIC PROCESSING                                                 records, technical articles, tourist site descriptions). To indicate the
                                                                        linguistic complexity of the corpus, we mention that the length of
A set of linguistic processing tools have been employed in order to     noun phrases varies from 1 to 53 word tokens.
parse the textual corpora. The first goal is to detect nouns (e.g.         All the corpora have been phrase-analyzed by the chunker
τράπεζα - bank), nominal compounds (αύξηση κεφαλαίου - capital          described in detail in [11]. Noun, verb, prepositional, adverbial
increase) and named entities (Τράπεζα της Ελλάδος - Bank of             phrases and conjunctions are detected via multi-pass parsing. From
Greece). All the above structures appear in the noun and                the above phrases, noun and prepositional phrases only are taken
prepositional phrases in a sentence. These types of phrases need to     into account for the present task, as they are the only types of
be detected, non-content words that appear in them have to be           phrases that may include terms. Regarding the phrases of interest,
disregarded, and the candidate economic terms need to be formed.        precision and recall reach 85.6% and 94.5% for noun phrases, and
This process is described in detail in the rest of this section.        99.1% and 93.9% for prepositional phrases respectively. The
                                                                        robustness of the chunker and its independence on extravagant
2.1 Modern Greek                                                        information makes it suitable to deal with a style-varying and
                                                                        complicated in linguistic structure corpus like DELOS.
Regarding the properties of the language that are strongly related to      It should be noted that phrases are non-overlapping. Embedded
the current task, it has to be taken into account that Modern Greek     phrased are flatly split into distinct phrases. Nominal modifiers in
is highly inflectional. The rich morphology allows for a larger         the genitive case are included in the same phrase with the noun
degree of freedom in the ordering of the constituents of a phrase       they modify; nouns joined by a coordinating conjunction are
(headword and modifiers), compared to other languages such as           grouped into one phrase. The chunker identifies basic phrase
English or German. More specifically, modifiers like adjectives,        constructions during the first passes (e.g. adjective-nouns, article
numerals and pronouns may precede or follow the head noun.              nouns), and combines smaller phrases into longer ones in later
   Another common property of noun phrases is the presence of           passes (e.g. coordination, inclusion of genitive modifiers,
nominal modifiers in the genitive case that denote possession,          compound phrases). As a result, named entities, proper nouns,
quality, quantity or origin. They are nouns and usually follow the      compound nominal constructions are identified during chunking
head noun they modify.                                                  among the rest of the noun phrases.
   The following two examples show the afore-mentioned                     The most significant sources of error during the automatic
freedom. The two phrases have exactly the same meaning (bank            chunking process, which also affect the performance of the term
account). The first phrase is an adjective-noun construction, while     extraction process, are:
the second is a noun-genitive modifier construction.
                                                                        1. Excessive phrase cut-up, usually due to erroneous part-of-speech
τραπεζικός λογαριασμός           bank[ADJECTIVE] account[NOUN]          tagging of a word (the word πλήρες - full - in the following
                                                                        example is erroneously tagged as a noun and not as an adjective)
λογαριασμός τράπεζας             account[NOUN] bank[NOUN-GENITIVE]
                                                                             NP[To πλήρες] NP[κείμενο της ανακοίνωσης] instead of
2.2 Corpora and processing tools
                                                                                      NP[To πλήρες κείμενο της ανακοίνωσης]
The corpora used in our experiments were:
   1. The ILSP/ELEFTHEROTYPIA ([3]) and ESPRIT 860 ([9])                               (NP[the full text of the announcement])
Corpora (a total of 300,000 words). Both these corpora are
balanced and manually annotated with complete morphological             2. Erroneous NP tagging (unidentifiable adverbs, like όντως – in
information. Further (phrase structure) information is obtained         fact – in the following example, are marked as nouns)
automatically.
   2. The DELOS Corpus, [5], is a collection of economic domain                      NP[όντως]         instead of        ADV[όντως]
texts of approximately five million words and of varying genre. It
has been automatically annotated from the ground up.                       In order to detect simple phrases inside larger coordination
Morphological tagging on DELOS was performed by the analyzer            constructions, we applied the following simple empirical grammar
of [10]. Accuracy in part-of-speech and case tagging reaches 98%        to every noun and prepositional phrase extracted by the chunker.
and 94% accuracy respectively. Further (phrase structure)               The grammar, which directly identifies conjunctive expressions
information is again obtained automatically.                            and produces a list of simple noun phrases, employs the following
   All of the above corpora (including DELOS) are collections of        rules:
newspaper and journal articles. More specifically, regarding
DELOS, the collection consists of texts taken from the financial
                                                                        conjunctive_phrase:   phrase       conjunction      phrase
newspaper EXPRESS, reports from the Foundation for Economic
and Industrial Research, research papers from the Athens
University of Economics and several reports from the Bank of
Greece. The documents are of varying genre like press reportage,
                                                                        conjunctive_phrase:    phrase        comma           conjunctive_phrase
news, articles, interviews and scientific studies and cover all the
basic areas of the economic domain, i.e. microeconomics,
macroeconomics, international economics, finance, business                       Figure 1.    The rules for splitting coordinated phrases.
administration, economic history, economic law, public economics
etc. Therefore, it presents a richness in vocabulary, in linguistic     2.3 Candidate terms
structure, in the use of idiomatic expressions and colloquialisms,
which is not encountered in the highly domain- and language-            As mentioned before, the noun and prepositional phrases of the
restricted texts used normally for term extraction (e.g. medical        two corpora are selected, as only these phrases are likely to contain
terms. Words of no semantic content (i.e. introductory articles,                 Filtering was then performed in two stages: First the relative
adverbs, prepositions, punctuation marks and symbols) are                     frequencies are calculated for each candidate term w, as
removed from the phrases.
    Coordination schemes are detected within the phrases, and the                                       RFw=fw(D)/fw(B),                         (1)
latter are split into smaller phrases respectively according to the                                      fw(D)= cw(D)/N                          (2)
grammar depicted in Figure 1. The occurrences of words and N-                                            fw(B)= cw(B)/M                          (3)
grams, pure as well as nested, are counted. Longer candidate terms
are split into smaller units (tri-grams into bi-grams and uni-grams,          N and M denote the counts of all candidate terms in D and B
bi-grams into uni-grams).                                                     respectively.
    Regarding the bi-grams, in order to overcome the freedom in                  In the next step, for those candidate terms that present an
the word ordering, as discussed in section 2.1, we considered bi-             RFw>1, LLR is calculated (according to the formula of [6]) as
gram A B (A and B being the two lemmata forming the bi-gram) to
be identical to bi-gram B A, if the bi-gram is not a named entity.
                                                                              LLRw = 2⋅(cw(D)⋅log(cw(D)) + cw(B)⋅log(cw(B)) +
Their joint count in the corpora is calculated and taken into
                                                                                 (N–cw(D))⋅log(N−cw(D)) + (M−cw(B))⋅log(M−cw(B)) −
account. The resulting uni-grams and bi-grams are the candidate
                                                                                 (cw(D)+cw(B))⋅log(cw(D)+cw(B)) – M⋅logM – NlogN −               (4)
terms. The candidate term counts in the corpora are then used in
the statistical filters described in the next section.                           (N+M−cw(D)−cw(B))⋅log(N+M−cw(D)−cw(B)) +
    Figure 2 shows the count calculation for the nested candidate                (N+M)⋅log(N+M) )
terms. The two tri-grams, A B C and B C D occur in a corpus three
and four times respectively. The accumulative counts of the nested                The LLR metric detects how surprising (or not) it is for a
terms are shown in parentheses.                                               candidate term to appear in DELOS or in the balanced corpus
                                                                              (compared to its expected appearance count), and therefore
          A B C (3)                                 C B D (4)                 constitute an economic domain term (or not). Unlike other statistics
                                                                              (like the χ2 and mutual information), it is an accurate measure even
                                                                              for rare candidate terms, and for this reason it was selected for the
                                                                              present task. It is asymptotically χ2 distributed. So, for one degree
                                                                              of freedom, candidate terms that present an LLR value greater than
    A B (3)           B C (3)                  C B (4)           B D (4)      7.88 (critical value) can be considered as valid terms with a
                                                                              confidence level of 0.005.

                                   B C (3+4)                                  4 EXPERIMENTAL RESULTS
                                                                              The final list of extracted terms was evaluated by a group of three
A (3)         B (3)             C (3)      B (4)         C (4)        D (4)   experts in economics and finance. The evaluators were in constant
                                                                              contact to agree upon ambiguous cases of terms. The most
                                                                              important factor for this ambiguity is the lack of context
                                                                              information, especially for uni-grams. In other words, there are
A (4)             B (4+3)                 C (3+4)                    D (4)    several cases of words that may or may not be economic terms
                                                                              depending on the context in which they appear.
Figure 2.    Calculation of n-gram frequencies, given the phrase-chunked         Table 1 lists a window from the list of the candidate terms,
  corpus. The finally extracted n-gram frequencies are indicated in bold.     selected by chance. Their counts in both corpora are also shown
                                                                              (original counts, prior to smoothing), along with their RF value,
3 TERM FILTERING                                                              and the tags that were given to them by the experts. These are
                                                                              terms with either RF<<1 or RF >>1, i.e. terms that present a
In this section we describe the statistical filters that have been used       significant difference between their frequencies in the two corpora,
to filter out non-terms. With D we denote Delos and with B the                and so they vary from strongly economic (e.g. tax-related) to non-
balanced corpus. As a first step, the occurrences of each candidate           economic (island).
term w (cw(D) and cw(B)) are counted in the two corpora separately.              As the LLR threshold value decreases (the N-best number
   A particularity of the present work is that, unlike in most                increases), the number of non-economic and mostly non-economic
previous approaches to term extraction, the domain-specific corpus            terms that enters into the N-best terms also increases causing the
available to us is quite large compared to the balanced corpus. As a          precision to drop.
result, several terms that appear in DELOS do not appear in the                  The results cannot be easily compared to those of previous
balanced corpus, making it impossible for the LLR statistic to                approaches, due to the many differences in resources and pre-
detect them. In other words, these terms cannot be identified by              processing. Merely as an indication, these results are comparable to
                                                                              the ones reported in [1] (73% to 86% precision, using a threshold
traditional corpora comparison.
                                                                              on term frequencies in technical corpora on fiber optic networks,
   In order to deal with this phenomenon, we applied a smoothing
                                                                              depending on the specific domain corpus and the size of the
technique to take into account terms that do not appear in the                extracted list of candidate terms, which is similar to the list size in
balanced corpus. More specifically, we applied Lidstone’s law                 the current work).
([7]) to our candidate terms, i.e. we augmented each candidate term              Figure 3 shows the percentage of terms that have been correctly
count by a value of λ=0.5 in both corpora. Thereby, terms that                labeled as valid terms (y-axis) when taking into account the N-best
actually do not appear in the balanced corpus at all, end up having           labeled terms (x-axis) (i.e. for different LLR thresholds). This
cw(B)=0.5.This value was chosen for λ because, due to the small               graph refers to terms that appear in both corpora and for which
size of the balanced corpus, the probability of coming across a               RFw>1. Strongly economic are terms that are characteristic of the
previously unseen word is significant.
  Table 1. The 24 terms with the highest LLR scores along with their                                                   economic term, while “πολιτισμός” (“culture”) is characterised as
                 counts and their domain relevance.                                                                    possibly important to the domain of economics, since it often
                                                                                   Important Possibly
                                                                                                                       involves a financial level.
                                              DILOS IEL Relative Freq.                                   Unimportant
        word                translation
                                              Freq. Freq.   Ratio
                                                                           LLR       to the Important to
                                                                                                          to Domain       Figure 4 shows the precision achieved for the terms appearing
                                                                                    Domain    Domain
 φορολογικός                 tax-related        352    13           4,63    49,0      9          -           -
                                                                                                                       in both corpora that present an RF<1. It is an interesting graph to
    παρών                       present          13    24           0,09    48,5      -          -           9         observe, in combination with Figure 3, as it shows how the method
   γλώσσα                     language           13    24           0,09    48,5      -          -           9         performs for the terms that are more frequent in the balanced
  αριστερός                   left, leftist       7    20           0,06    48,3      -          9           -
                             intra-party
                                                                                                                       corpus in comparison to DELOS.
εσωκομματικός                                    10    22           0,08    48,1      -          9            -
                               (political)
    διάλογος                     dialog         131    68           0,33    47,4      -          -           9
   πετρέλαιο                 oil (petrol)       213     3          12,14    47,2      9          -           -                                        Strongly Economic (LLR)             Economic (LLR)
  κερδοφορία                 profitability      164     0              -    47,1      9          -           -                                        mostly non-Economic (LLR)           non-Economic (LLR)
   πρόβλεψη                   prediction        283     8           6,05    46,9      9          -           -
      νησί                       island          14    24           0,10    46,8      -          -           9                         1
     άγκυρα                     anchor            4    17           0,04    46,2      -          -           9
                                                                                                                                      0.9
       γιεν                        yen          161     0              -    46,1      9          -           -
     στόχος                      target         821    64           2,19    46,1      9          -           -                        0.8
   αστυνομία                     police          45    38           0,20    46,0      -          9           -                        0.7
    εργάτης                factory worker         3    16           0,03    45,9      -          9           -


                                                                                                                         Precision
                                                                                                                                      0.6
  προοπτική                    prospect         446    23           3,32    45,8      9          -           -
                                                                                                                                      0.5
         OTE              HTO (company)         149     0              -              9          -            -                       0.4
                                                                            45,8
  συμφωνία                  agreement           654    45           2,49    45,8      9          -            -                       0.3
  γερμανικός                  German            238     5           8,14    45,7      -          9            -                       0.2
  πολιτισμός                   culture           31    32           0,17    45,6      -          9            -
    δουλειά                  job, work           38    35           0,19    45,6      -          9            -
                                                                                                                                      0.1
                                                                                                                                       0
  διευθύνων               chief (executive)     199     3          11,43              9          -            -
                                                                            45,6
                                                                                                                                            0       100       200         300       400          500         600         700
  διοικητικός              administrative       278     8           5,94    45,6      9          -            -
    ισοτιμία                 currency           182     2          15,68    45,4      9          -            -                                                     N-best candidate non-terms

domain and necessary for understanding domain texts. Economic                                                          Figure 4.                 Precision (y-axis) for the N-best terms (x-axis) that appear in
are terms that function as economic within a context of this                                                                                           both corpora and that present RF<1.
domain, but may also have a different meaning outside this
domain. As regards to the aforementioned labeling, this category                                                          Figure 5 depicts comparative results between LLR and term
includes terms connected both directly and indirectly to the                                                           extraction based on simple frequency counts on DELOS only. This
domain. Mostly non-economic are words that are connected to the                                                        experiment was performed to show the importance of corpora
specific domain only indirectly, or more general terms that                                                            comparison for term extraction, compared to using only a domain-
normally appear outside the economic domain, but may carry an                                                          specific corpus and applying simple frequencies to the candidate
economic sense in certain limited cases. Non-economic are terms                                                        terms appearing in it. As expected, corpus comparison (LLR) leads
that never appear in an economic sense or can be related to the                                                        to better results as it is concluded by the increased distance
                                                                                                                       between the Economic term curves and the non-Economic term
domain in any way. For example, referring to Table 1,
                                                                                                                       ones. Simple frequency counts tend to include many undesired N-
“φορολογικός” (“tax” [adjective]) is considered as a strongly
                                                                                                                       grams among the candidate terms with the highest ranks, simply
                                                                                                                       because these N-grams appear frequently in the corpus. As a result,
                                                                                                                       the precision values with frequencies on one corpus only,
                              Strongly Economic                            Economic
                                                                                                                       inevitably drop.
                              Mostly non-Economic                          non-Economic

                1                                                                                                                                Strongly Economic (F)              Strongly Economic (LLR)
               0,9                                                                                                                               Economic (F)                       Economic (LLR)
                                                                                                                                                 non-Economic (F)                   non-Economic (LLR)
               0,8
                                                                                                                                        1
               0,7
                                                                                                                                      0,9
               0,6
   Precision


                                                                                                                                      0,8
               0,5                                                                                                                    0,7
                                                                                                                          Precision


                                                                                                                                      0,6
               0,4                                                                                                                    0,5
               0,3                                                                                                                    0,4
                                                                                                                                      0,3
               0,2
                                                                                                                                      0,2
               0,1                                                                                                                    0,1
                                                                                                                                        0
                0
                     50            150           250         350           450            550          650                                  50       150       250          350       450              550         650

                                                  N-best candidate terms                                                                                            N-best candidate terms


  Figure 3.                  Precision (y-axis) for the N-best candidate terms (x-axis)                                 Figure 5.                 Comparative precision between LLR and simple frequency
                          that appear in both corpora and that present RF>1.                                                                                  counts on DELOS.
    Table 2 shows the RF and LLR scores of the 20 most highly                                  5 CONCLUSION
ranked economic terms, ordered by their LLR value. The depicted
counts are the original ones, prior to smoothing. An interesting                               In this paper we have presented the process of automatically
term is “υψηλός”, the ancient Greek form for “high”, used today                                extracting economic terminology from Modern Greek texts. The
almost exclusively in the context of the degree of performance,                                properties of the language are taken into account by utilizing
growth, rise, profit, cost, drop (i.e. the appropriate form in                                 appropriate pre-processing tools. The linguistic complexity of the
economic context), as opposed to its modern form “ψηλός”, which                                domain-specific corpus is addressed by adjusting the traditional
is used in the concept of the degree of actual height.                                         candidate term formation methodology to deal with the freedom in
                                                                                               word ordering. Finally, the unusual size difference between the two
                      Table 2. The 20 most highly ranked economic terms                        corpora (domain-specific and general) leads to a sparse data
Rank     word                        translation      Cw(D)     Cw(B)        RFw       LLR     problem, which is dealt with satisfactorily by applying Lidstone’s
  1     εταιρία                       company          5396        0       1845,9     852,0
                                                                                               smoothing law.
  2       δρχ                          drachma         3003        1        342,5     465,5
  3     μετοχή                           stock         2827        6         74,4     414,0
  4     αγορά                             buy          2330       33         11,9     257,2    ACKNOWLEDGEMENTS
  5    αύξηση                        growth, rise      2746       66          7,1     247,6
  6     κέρδος                           profit        1820       15         20,1     228,2    We thank the European Social Fund (ESF), Operational Program
  7   τράπεζα                            bank          1367       11         20,3     171,8    for Educational and Vocational Training II (EPEAEK II), and
  8  επιχείρηση                       enterprise       1969       56          6,0     162,1    particularly the Program PYTHAGORAS II, for funding the above
  9   κεφάλαιο                          capital        1325       14         15,6     157,3    work.
 10 σημαντικός                        important        1872       56          5,7     149,3
 11   πώληση                              sell         1203       11         17,9     147,3
 12     προϊόν                          product        1282       16         13,3     146,0    REFERENCES
 13     όμιλος                   (company) group       1036        5         32,2     140,0
 14      Α.Ε.                          INC              820        0        280,7     126,4
                                                                                               [1] P. Drouin, ‘Detection of Domain Specific Terminology Using Corpora
 15   μετοχικός                     stocking            790        2         54,1     112,8        Comparison’, 4th International Conference on Language Resources and
 16      τιμή                         price            1722       70          4,2     110,9        Evaluation (LREC), 79−82, Lisbon, (2004).
 17    επιτόκιο                 interest (financ.)      821        4         31,2     110,0    [2] K.Frantzi, S. Ananiadou, and H. Mima, ‘Automatic Recognition of
 18    υψηλός                    high (old form)        711        0        243,4     109,2        Multi-word Terms: the C-value/NC-value Method’, International
 19     κόστος                        cost             1031       19          9,0     103,4        Journal on Digital Libraries, 3 (2), 117−132, (2000).
 20    κλάδος                        branch             833        7         19,0     103,2    [3] N. Hatzigeorgiu, M. Gavrilidou, S. Piperidis, G. Carayannis, A.
                                                                                                   Papakostopoulou, A. Spiliotopoulou, A. Vacalopoulou, P.
   Figure 6 shows the difference in precision with LLR for the N-                                  Labropoulou, E. Mantzari, H. Papageorgiou, and I. Demiros, ‘Design
best terms with and without the application of smoothing. When                                     and Implementation of the online ILSP Greek Corpus’, 2nd
smoothing is not applied, the drop in performance is significant                                   International Conference on Language Resources and Evaluation
(around 20%). The expected performance improvement due to the                                      (LREC), Athens, 1737−1742, (2000).
smoothing process is further enhanced, because the terms that                                  [4] A. Hulth, ‘Improved Automatic Keyword Extraction Given More
appear only in DELOS (and not in the balanced corpus) are not                                      Linguistic Knowledge’, International Conference on Empirical
taken into account when smoothing is not performed.                                                Methods in Natural Language Processing (EMNLP), Sapporo, 216-
                                                                                                   223, (2003).
                                                                                               [5] K. Kermanidis, N. Fakotakis and G. Kokkinakis, ‘DELOS: An
                                        Strongly Economic (smoothed)
                                        Economic (smoothed)                                        Automatically Tagged Economic Corpus for Modern Greek’, 3rd
                                        Strongly Economic (no smoothing)                           International Conference on Language Resources and Evaluation
                                        Economic (no smoothing)                                    (LREC), Las Palmas de Gran Canaria, 93-100, (2002).
                1                                                                              [6] Kilgarriff, ‘Comparing Corpora’, International Journal of Corpus
               0,9                                                                                 Linguistics, 6 (1), 1-37, (2001).
                                                                                               [7] C. Manning and H. Schuetze, Foundations of Statistical Natural
               0,8
                                                                                                   Language Processing, MIT Press, 1999.
               0,7                                                                             [8] R. Navigli and P. Velardi, ‘Learning Domain Ontologies from
                                                                                                   Document Warehouses and Dedicated Web Sites’, Computational
   Precision


               0,6
                                                                                                   Linguistics, 30 (2), 151−179, (2004).
               0,5
                                                                                               [9] Partners of ESPRIT-291/860, Unification of the Word Classes of the
               0,4                                                                                 ESPRIT Project 860, Internal Report BU-WKL-0376, (1986).
               0,3                                                                            [10] K. Sgarbas, N. Fakotakis and G. Kokkinakis, ‘A Straightforward
               0,2                                                                                 Approach to Morphological Analysis and Synthesis’, Proceedings of
                                                                                                   the Workshop on Computational Lexicography and Multimedia
               0,1
                                                                                                   Dictionaries (COMLEX), Kato Achaia, Greece, 31−34, (2000).
                0                                                                             [11] E. Stamatatos, N. Fakotakis and G. Kokkinakis, ‘A practical chunker
                     50        150        250        350      450       550         650            for unrestricted text’, Proceedings of the Conference on Natural
                                           N-best candidate terms                                  Language Processing (NLP), Patras, Greece, 139−150, (2000).


Figure 6.                 Comparative precision using the LLR metric with and without
                                           smoothing.