Challenges in Extracting Terminology from Modern Greek Texts Aristomenis Thanopoulos and Katia Kermanidis and Nikos Fakotakis1 Abstract. This paper describes the automatic extraction of with various other metrics, like the χ2 score, the t-test, mutual economic terminology from Modern Greek texts as a first step information, the Mann-Whitney rank test, the Log Likelihood, towards creating an ontological thesaurus of economic concepts. Fisher’s exact test and the TF.IDF (term frequency-inverse Unlike previous approaches, the domain-specific corpus utilized is document frequency). Frantzi et al. in [2] present a metric that varying in genre, and therefore rich in vocabulary and linguistic combines statistical (frequencies of compound terms and their structure, while the pre-processing level is relatively low (basic nested sub-terms) and linguistic (context words are assigned a morphological tagging, the detection of elementary, non- weight of importance) information. overlapping chunks) and fully automatic. The idiosyncratic In this paper we present the first phase of the ongoing work properties of Modern Greek noun phrases are taken into account: towards the creation of an ontology hierarchy of economic the freedom in word ordering, the richness in morphology. Also, concepts. This phase includes the extraction of economic terms the peculiarity of the available corpora is dealt with: the large size automatically from a Modern Greek phrase-analyzed corpus by of the economic compared to the balanced corpus. A combination corpora comparison in combination to applying a threshold to the of statistical filters (relative frequency ratios and log likelihood) relative frequency ratios. and smoothing is employed in order to deal with the afore- An important aspect of the present approach is the stylistic mentioned challenges when filtering out non-terms. nature of the domain-specific (economic) corpus. In most of the previous work, the domain corpus is to a large extent restricted in 1 INTRODUCTION the vocabulary it contains and in the variety of syntactic structures it presents. Our economic corpus does not consist of syntactically Terms are the linguistic expression of concepts. Domain-specific standardized taglines of economic news. On the contrary, it terms capture the knowledge of a given domain and reflect it in the presents a very rich variety in vocabulary, syntactic formulations, form of words that are commonly acceptable by the members of idiomatic expressions, sentence length, making the process of term the domain community, enabling the latter to interact and exchange extraction an interesting challenge. information. In contrast to the use of static dictionaries, acquiring In addition to this, the employed pre-processing tools (shallow terminology automatically from domain texts leads to a list of phrase chunker) make use of limited resources (see section 2.2) extracted terms that may be dynamically updated and ranked and the question arises whether the resulting low-level information according to usage. Term extraction is a first step towards is sufficient to deal with the linguistic complexity of the corpus. acquiring a domain ontology. An ontology is a thesaurus that Another challenge that has been faced by the present work is the provides the relationships among the terms, and sorts them in a language itself. In Modern Greek the ordering of the constituents hierarchical structure, based on their semantic specificity and their of a sentence or a phrase is loose and determined primarily by the properties. rich morphology. As a result, the extraction of compound terms, as Several methods have been employed for the extraction of well as the identification of nested terms, are not straightforward domain terms. Regarding the linguistic pre-processing of the text and cannot be treated as cases of simple string concatenation, as in corpora, approaches vary from simple tokenization and part-of- English. Section 2.3 describes an approach for extracting the speech tagging ([1],[2]), to the use of shallow parsers and higher- counts of candidate terms, which takes into account the freedom in level linguistic processors ([4],[8]). The latter aim at identifying word ordering. syntactic patterns, like noun phrases, and their structure (e.g. head- modifier), in order to rule out tokens that are grammatically Finally, a peculiar trait of the current work is the corpora that impossible to constitute terms (e.g. adverbs, verbs, pronouns, are available to us. While the economic corpus is sufficiently large, articles, etc). the balanced corpus is relatively small. As a result, the terms Regarding the statistical filters, that have been employed in (especially bi-grams) that occur in both corpora are few, while previous work to filter out non-terms , they also vary. Using corpus many valid terms appear in the domain specific corpus alone. This comparison, the techniques try to identify words/phrases that makes it impossible to use the traditional methodology of corpora present a different statistical behavior in the corpus of the target comparison alone (that presupposes the appearance of a candidate domain, compared to their behavior in the rest of the corpora. Such term in both corpora) in order to filter out non-terms. A smoothing words/phrases are considered to be terms of the domain in technique is applied to overcome this problem, which is described question. In the most simple case, the observed frequencies of the in section 3. candidate terms are compared ([1]). Kilgarriff in [6] experiments 1 Wire Communications Laboratory, University of Patras, Greece. Email: {aristom, kerman, fakotaki}@wcl.ee.upatras.gr 2 LINGUISTIC PROCESSING records, technical articles, tourist site descriptions). To indicate the linguistic complexity of the corpus, we mention that the length of A set of linguistic processing tools have been employed in order to noun phrases varies from 1 to 53 word tokens. parse the textual corpora. The first goal is to detect nouns (e.g. All the corpora have been phrase-analyzed by the chunker τράπεζα - bank), nominal compounds (αύξηση κεφαλαίου - capital described in detail in [11]. Noun, verb, prepositional, adverbial increase) and named entities (Τράπεζα της Ελλάδος - Bank of phrases and conjunctions are detected via multi-pass parsing. From Greece). All the above structures appear in the noun and the above phrases, noun and prepositional phrases only are taken prepositional phrases in a sentence. These types of phrases need to into account for the present task, as they are the only types of be detected, non-content words that appear in them have to be phrases that may include terms. Regarding the phrases of interest, disregarded, and the candidate economic terms need to be formed. precision and recall reach 85.6% and 94.5% for noun phrases, and This process is described in detail in the rest of this section. 99.1% and 93.9% for prepositional phrases respectively. The robustness of the chunker and its independence on extravagant 2.1 Modern Greek information makes it suitable to deal with a style-varying and complicated in linguistic structure corpus like DELOS. Regarding the properties of the language that are strongly related to It should be noted that phrases are non-overlapping. Embedded the current task, it has to be taken into account that Modern Greek phrased are flatly split into distinct phrases. Nominal modifiers in is highly inflectional. The rich morphology allows for a larger the genitive case are included in the same phrase with the noun degree of freedom in the ordering of the constituents of a phrase they modify; nouns joined by a coordinating conjunction are (headword and modifiers), compared to other languages such as grouped into one phrase. The chunker identifies basic phrase English or German. More specifically, modifiers like adjectives, constructions during the first passes (e.g. adjective-nouns, article numerals and pronouns may precede or follow the head noun. nouns), and combines smaller phrases into longer ones in later Another common property of noun phrases is the presence of passes (e.g. coordination, inclusion of genitive modifiers, nominal modifiers in the genitive case that denote possession, compound phrases). As a result, named entities, proper nouns, quality, quantity or origin. They are nouns and usually follow the compound nominal constructions are identified during chunking head noun they modify. among the rest of the noun phrases. The following two examples show the afore-mentioned The most significant sources of error during the automatic freedom. The two phrases have exactly the same meaning (bank chunking process, which also affect the performance of the term account). The first phrase is an adjective-noun construction, while extraction process, are: the second is a noun-genitive modifier construction. 1. Excessive phrase cut-up, usually due to erroneous part-of-speech τραπεζικός λογαριασμός bank[ADJECTIVE] account[NOUN] tagging of a word (the word πλήρες - full - in the following example is erroneously tagged as a noun and not as an adjective) λογαριασμός τράπεζας account[NOUN] bank[NOUN-GENITIVE] NP[To πλήρες] NP[κείμενο της ανακοίνωσης] instead of 2.2 Corpora and processing tools NP[To πλήρες κείμενο της ανακοίνωσης] The corpora used in our experiments were: 1. The ILSP/ELEFTHEROTYPIA ([3]) and ESPRIT 860 ([9]) (NP[the full text of the announcement]) Corpora (a total of 300,000 words). Both these corpora are balanced and manually annotated with complete morphological 2. Erroneous NP tagging (unidentifiable adverbs, like όντως – in information. Further (phrase structure) information is obtained fact – in the following example, are marked as nouns) automatically. 2. The DELOS Corpus, [5], is a collection of economic domain NP[όντως] instead of ADV[όντως] texts of approximately five million words and of varying genre. It has been automatically annotated from the ground up. In order to detect simple phrases inside larger coordination Morphological tagging on DELOS was performed by the analyzer constructions, we applied the following simple empirical grammar of [10]. Accuracy in part-of-speech and case tagging reaches 98% to every noun and prepositional phrase extracted by the chunker. and 94% accuracy respectively. Further (phrase structure) The grammar, which directly identifies conjunctive expressions information is again obtained automatically. and produces a list of simple noun phrases, employs the following All of the above corpora (including DELOS) are collections of rules: newspaper and journal articles. More specifically, regarding DELOS, the collection consists of texts taken from the financial conjunctive_phrase: phrase conjunction phrase newspaper EXPRESS, reports from the Foundation for Economic and Industrial Research, research papers from the Athens University of Economics and several reports from the Bank of Greece. The documents are of varying genre like press reportage, conjunctive_phrase: phrase comma conjunctive_phrase news, articles, interviews and scientific studies and cover all the basic areas of the economic domain, i.e. microeconomics, macroeconomics, international economics, finance, business Figure 1. The rules for splitting coordinated phrases. administration, economic history, economic law, public economics etc. Therefore, it presents a richness in vocabulary, in linguistic 2.3 Candidate terms structure, in the use of idiomatic expressions and colloquialisms, which is not encountered in the highly domain- and language- As mentioned before, the noun and prepositional phrases of the restricted texts used normally for term extraction (e.g. medical two corpora are selected, as only these phrases are likely to contain terms. Words of no semantic content (i.e. introductory articles, Filtering was then performed in two stages: First the relative adverbs, prepositions, punctuation marks and symbols) are frequencies are calculated for each candidate term w, as removed from the phrases. Coordination schemes are detected within the phrases, and the RFw=fw(D)/fw(B), (1) latter are split into smaller phrases respectively according to the fw(D)= cw(D)/N (2) grammar depicted in Figure 1. The occurrences of words and N- fw(B)= cw(B)/M (3) grams, pure as well as nested, are counted. Longer candidate terms are split into smaller units (tri-grams into bi-grams and uni-grams, N and M denote the counts of all candidate terms in D and B bi-grams into uni-grams). respectively. Regarding the bi-grams, in order to overcome the freedom in In the next step, for those candidate terms that present an the word ordering, as discussed in section 2.1, we considered bi- RFw>1, LLR is calculated (according to the formula of [6]) as gram A B (A and B being the two lemmata forming the bi-gram) to be identical to bi-gram B A, if the bi-gram is not a named entity. LLRw = 2⋅(cw(D)⋅log(cw(D)) + cw(B)⋅log(cw(B)) + Their joint count in the corpora is calculated and taken into (N–cw(D))⋅log(N−cw(D)) + (M−cw(B))⋅log(M−cw(B)) − account. The resulting uni-grams and bi-grams are the candidate (cw(D)+cw(B))⋅log(cw(D)+cw(B)) – M⋅logM – NlogN − (4) terms. The candidate term counts in the corpora are then used in the statistical filters described in the next section. (N+M−cw(D)−cw(B))⋅log(N+M−cw(D)−cw(B)) + Figure 2 shows the count calculation for the nested candidate (N+M)⋅log(N+M) ) terms. The two tri-grams, A B C and B C D occur in a corpus three and four times respectively. The accumulative counts of the nested The LLR metric detects how surprising (or not) it is for a terms are shown in parentheses. candidate term to appear in DELOS or in the balanced corpus (compared to its expected appearance count), and therefore A B C (3) C B D (4) constitute an economic domain term (or not). Unlike other statistics (like the χ2 and mutual information), it is an accurate measure even for rare candidate terms, and for this reason it was selected for the present task. It is asymptotically χ2 distributed. So, for one degree of freedom, candidate terms that present an LLR value greater than A B (3) B C (3) C B (4) B D (4) 7.88 (critical value) can be considered as valid terms with a confidence level of 0.005. B C (3+4) 4 EXPERIMENTAL RESULTS The final list of extracted terms was evaluated by a group of three A (3) B (3) C (3) B (4) C (4) D (4) experts in economics and finance. The evaluators were in constant contact to agree upon ambiguous cases of terms. The most important factor for this ambiguity is the lack of context information, especially for uni-grams. In other words, there are A (4) B (4+3) C (3+4) D (4) several cases of words that may or may not be economic terms depending on the context in which they appear. Figure 2. Calculation of n-gram frequencies, given the phrase-chunked Table 1 lists a window from the list of the candidate terms, corpus. The finally extracted n-gram frequencies are indicated in bold. selected by chance. Their counts in both corpora are also shown (original counts, prior to smoothing), along with their RF value, 3 TERM FILTERING and the tags that were given to them by the experts. These are terms with either RF<<1 or RF >>1, i.e. terms that present a In this section we describe the statistical filters that have been used significant difference between their frequencies in the two corpora, to filter out non-terms. With D we denote Delos and with B the and so they vary from strongly economic (e.g. tax-related) to non- balanced corpus. As a first step, the occurrences of each candidate economic (island). term w (cw(D) and cw(B)) are counted in the two corpora separately. As the LLR threshold value decreases (the N-best number A particularity of the present work is that, unlike in most increases), the number of non-economic and mostly non-economic previous approaches to term extraction, the domain-specific corpus terms that enters into the N-best terms also increases causing the available to us is quite large compared to the balanced corpus. As a precision to drop. result, several terms that appear in DELOS do not appear in the The results cannot be easily compared to those of previous balanced corpus, making it impossible for the LLR statistic to approaches, due to the many differences in resources and pre- detect them. In other words, these terms cannot be identified by processing. Merely as an indication, these results are comparable to the ones reported in [1] (73% to 86% precision, using a threshold traditional corpora comparison. on term frequencies in technical corpora on fiber optic networks, In order to deal with this phenomenon, we applied a smoothing depending on the specific domain corpus and the size of the technique to take into account terms that do not appear in the extracted list of candidate terms, which is similar to the list size in balanced corpus. More specifically, we applied Lidstone’s law the current work). ([7]) to our candidate terms, i.e. we augmented each candidate term Figure 3 shows the percentage of terms that have been correctly count by a value of λ=0.5 in both corpora. Thereby, terms that labeled as valid terms (y-axis) when taking into account the N-best actually do not appear in the balanced corpus at all, end up having labeled terms (x-axis) (i.e. for different LLR thresholds). This cw(B)=0.5.This value was chosen for λ because, due to the small graph refers to terms that appear in both corpora and for which size of the balanced corpus, the probability of coming across a RFw>1. Strongly economic are terms that are characteristic of the previously unseen word is significant. Table 1. The 24 terms with the highest LLR scores along with their economic term, while “πολιτισμός” (“culture”) is characterised as counts and their domain relevance. possibly important to the domain of economics, since it often Important Possibly involves a financial level. DILOS IEL Relative Freq. Unimportant word translation Freq. Freq. Ratio LLR to the Important to to Domain Figure 4 shows the precision achieved for the terms appearing Domain Domain φορολογικός tax-related 352 13 4,63 49,0 9 - - in both corpora that present an RF<1. It is an interesting graph to παρών present 13 24 0,09 48,5 - - 9 observe, in combination with Figure 3, as it shows how the method γλώσσα language 13 24 0,09 48,5 - - 9 performs for the terms that are more frequent in the balanced αριστερός left, leftist 7 20 0,06 48,3 - 9 - intra-party corpus in comparison to DELOS. εσωκομματικός 10 22 0,08 48,1 - 9 - (political) διάλογος dialog 131 68 0,33 47,4 - - 9 πετρέλαιο oil (petrol) 213 3 12,14 47,2 9 - - Strongly Economic (LLR) Economic (LLR) κερδοφορία profitability 164 0 - 47,1 9 - - mostly non-Economic (LLR) non-Economic (LLR) πρόβλεψη prediction 283 8 6,05 46,9 9 - - νησί island 14 24 0,10 46,8 - - 9 1 άγκυρα anchor 4 17 0,04 46,2 - - 9 0.9 γιεν yen 161 0 - 46,1 9 - - στόχος target 821 64 2,19 46,1 9 - - 0.8 αστυνομία police 45 38 0,20 46,0 - 9 - 0.7 εργάτης factory worker 3 16 0,03 45,9 - 9 - Precision 0.6 προοπτική prospect 446 23 3,32 45,8 9 - - 0.5 OTE HTO (company) 149 0 - 9 - - 0.4 45,8 συμφωνία agreement 654 45 2,49 45,8 9 - - 0.3 γερμανικός German 238 5 8,14 45,7 - 9 - 0.2 πολιτισμός culture 31 32 0,17 45,6 - 9 - δουλειά job, work 38 35 0,19 45,6 - 9 - 0.1 0 διευθύνων chief (executive) 199 3 11,43 9 - - 45,6 0 100 200 300 400 500 600 700 διοικητικός administrative 278 8 5,94 45,6 9 - - ισοτιμία currency 182 2 15,68 45,4 9 - - N-best candidate non-terms domain and necessary for understanding domain texts. Economic Figure 4. Precision (y-axis) for the N-best terms (x-axis) that appear in are terms that function as economic within a context of this both corpora and that present RF<1. domain, but may also have a different meaning outside this domain. As regards to the aforementioned labeling, this category Figure 5 depicts comparative results between LLR and term includes terms connected both directly and indirectly to the extraction based on simple frequency counts on DELOS only. This domain. Mostly non-economic are words that are connected to the experiment was performed to show the importance of corpora specific domain only indirectly, or more general terms that comparison for term extraction, compared to using only a domain- normally appear outside the economic domain, but may carry an specific corpus and applying simple frequencies to the candidate economic sense in certain limited cases. Non-economic are terms terms appearing in it. As expected, corpus comparison (LLR) leads that never appear in an economic sense or can be related to the to better results as it is concluded by the increased distance between the Economic term curves and the non-Economic term domain in any way. For example, referring to Table 1, ones. Simple frequency counts tend to include many undesired N- “φορολογικός” (“tax” [adjective]) is considered as a strongly grams among the candidate terms with the highest ranks, simply because these N-grams appear frequently in the corpus. As a result, the precision values with frequencies on one corpus only, Strongly Economic Economic inevitably drop. Mostly non-Economic non-Economic 1 Strongly Economic (F) Strongly Economic (LLR) 0,9 Economic (F) Economic (LLR) non-Economic (F) non-Economic (LLR) 0,8 1 0,7 0,9 0,6 Precision 0,8 0,5 0,7 Precision 0,6 0,4 0,5 0,3 0,4 0,3 0,2 0,2 0,1 0,1 0 0 50 150 250 350 450 550 650 50 150 250 350 450 550 650 N-best candidate terms N-best candidate terms Figure 3. Precision (y-axis) for the N-best candidate terms (x-axis) Figure 5. Comparative precision between LLR and simple frequency that appear in both corpora and that present RF>1. counts on DELOS. Table 2 shows the RF and LLR scores of the 20 most highly 5 CONCLUSION ranked economic terms, ordered by their LLR value. The depicted counts are the original ones, prior to smoothing. An interesting In this paper we have presented the process of automatically term is “υψηλός”, the ancient Greek form for “high”, used today extracting economic terminology from Modern Greek texts. The almost exclusively in the context of the degree of performance, properties of the language are taken into account by utilizing growth, rise, profit, cost, drop (i.e. the appropriate form in appropriate pre-processing tools. The linguistic complexity of the economic context), as opposed to its modern form “ψηλός”, which domain-specific corpus is addressed by adjusting the traditional is used in the concept of the degree of actual height. candidate term formation methodology to deal with the freedom in word ordering. Finally, the unusual size difference between the two Table 2. The 20 most highly ranked economic terms corpora (domain-specific and general) leads to a sparse data Rank word translation Cw(D) Cw(B) RFw LLR problem, which is dealt with satisfactorily by applying Lidstone’s 1 εταιρία company 5396 0 1845,9 852,0 smoothing law. 2 δρχ drachma 3003 1 342,5 465,5 3 μετοχή stock 2827 6 74,4 414,0 4 αγορά buy 2330 33 11,9 257,2 ACKNOWLEDGEMENTS 5 αύξηση growth, rise 2746 66 7,1 247,6 6 κέρδος profit 1820 15 20,1 228,2 We thank the European Social Fund (ESF), Operational Program 7 τράπεζα bank 1367 11 20,3 171,8 for Educational and Vocational Training II (EPEAEK II), and 8 επιχείρηση enterprise 1969 56 6,0 162,1 particularly the Program PYTHAGORAS II, for funding the above 9 κεφάλαιο capital 1325 14 15,6 157,3 work. 10 σημαντικός important 1872 56 5,7 149,3 11 πώληση sell 1203 11 17,9 147,3 12 προϊόν product 1282 16 13,3 146,0 REFERENCES 13 όμιλος (company) group 1036 5 32,2 140,0 14 Α.Ε. INC 820 0 280,7 126,4 [1] P. Drouin, ‘Detection of Domain Specific Terminology Using Corpora 15 μετοχικός stocking 790 2 54,1 112,8 Comparison’, 4th International Conference on Language Resources and 16 τιμή price 1722 70 4,2 110,9 Evaluation (LREC), 79−82, Lisbon, (2004). 17 επιτόκιο interest (financ.) 821 4 31,2 110,0 [2] K.Frantzi, S. Ananiadou, and H. Mima, ‘Automatic Recognition of 18 υψηλός high (old form) 711 0 243,4 109,2 Multi-word Terms: the C-value/NC-value Method’, International 19 κόστος cost 1031 19 9,0 103,4 Journal on Digital Libraries, 3 (2), 117−132, (2000). 20 κλάδος branch 833 7 19,0 103,2 [3] N. Hatzigeorgiu, M. Gavrilidou, S. Piperidis, G. Carayannis, A. Papakostopoulou, A. Spiliotopoulou, A. Vacalopoulou, P. Figure 6 shows the difference in precision with LLR for the N- Labropoulou, E. Mantzari, H. Papageorgiou, and I. Demiros, ‘Design best terms with and without the application of smoothing. When and Implementation of the online ILSP Greek Corpus’, 2nd smoothing is not applied, the drop in performance is significant International Conference on Language Resources and Evaluation (around 20%). The expected performance improvement due to the (LREC), Athens, 1737−1742, (2000). smoothing process is further enhanced, because the terms that [4] A. Hulth, ‘Improved Automatic Keyword Extraction Given More appear only in DELOS (and not in the balanced corpus) are not Linguistic Knowledge’, International Conference on Empirical taken into account when smoothing is not performed. Methods in Natural Language Processing (EMNLP), Sapporo, 216- 223, (2003). [5] K. Kermanidis, N. Fakotakis and G. Kokkinakis, ‘DELOS: An Strongly Economic (smoothed) Economic (smoothed) Automatically Tagged Economic Corpus for Modern Greek’, 3rd Strongly Economic (no smoothing) International Conference on Language Resources and Evaluation Economic (no smoothing) (LREC), Las Palmas de Gran Canaria, 93-100, (2002). 1 [6] Kilgarriff, ‘Comparing Corpora’, International Journal of Corpus 0,9 Linguistics, 6 (1), 1-37, (2001). [7] C. Manning and H. Schuetze, Foundations of Statistical Natural 0,8 Language Processing, MIT Press, 1999. 0,7 [8] R. Navigli and P. Velardi, ‘Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites’, Computational Precision 0,6 Linguistics, 30 (2), 151−179, (2004). 0,5 [9] Partners of ESPRIT-291/860, Unification of the Word Classes of the 0,4 ESPRIT Project 860, Internal Report BU-WKL-0376, (1986). 0,3 [10] K. Sgarbas, N. Fakotakis and G. Kokkinakis, ‘A Straightforward 0,2 Approach to Morphological Analysis and Synthesis’, Proceedings of the Workshop on Computational Lexicography and Multimedia 0,1 Dictionaries (COMLEX), Kato Achaia, Greece, 31−34, (2000). 0 [11] E. Stamatatos, N. Fakotakis and G. Kokkinakis, ‘A practical chunker 50 150 250 350 450 550 650 for unrestricted text’, Proceedings of the Conference on Natural N-best candidate terms Language Processing (NLP), Patras, Greece, 139−150, (2000). Figure 6. Comparative precision using the LLR metric with and without smoothing.