=Paper=
{{Paper
|id=None
|storemode=property
|title=Automatic recognition of domain-specific terms: an experimental evaluation
|pdfUrl=https://ceur-ws.org/Vol-1031/paper3.pdf
|volume=Vol-1031
|dblpUrl=https://dblp.org/rec/conf/syrcodis/FedorenkoAT13
}}
==Automatic recognition of domain-specific terms: an experimental evaluation==
Automatic recognition of domain-specific terms: an experimental evaluation c Denis Fedorenko Nikita Astrakhantsev Denis Turdakov Institute for System Programming of Russian Academy of Sciences fedorenko@ispras.ru, astrakhantsev@ispras.ru, turdakov@ispras.ru Abstract There are several studies comparing different ap- proaches for ATR. In [18] authors compare different sin- This paper presents an experimental evaluation gle statistical features by their effectiveness for term can- of the state-of-the-art approaches for automatic didates ranking. In [24] the same comparison is ex- term recognition based on multiple features: tended by voting algorithm that combines multiple fea- machine learning method and voting algorithm. tures. Studies [17], [15] compare supervised machine We show that in most cases machine learning learning method with the approach based on single fea- approach obtains the best results and needs little ture again. data for training; we also find the best subsets of In turn, the present study experimentally evaluates all popular features. the ranking methods combining multiple features: super- vised machine learning approach and voting algorithm. We pay most of the attention to the supervised method in 1 Introduction order to explore its applicability to ATR. Automatic term recognition (ATR) is an actual problem The purposes of the study are the following: of text processing. The task is to recognize and extract terminological units from different domain-specific text • To compare results of machine learning approach collections. Resulting terms can be useful in more com- and voting algorithm; plex tasks such as semantic search, question-answering, • To compare different machine learning algorithms ontology construction, word sense induction, etc. applied to ATR; There have been a lot of studies of ATR. Most of them split the task into three common steps: • To explore how much training data is needed to rank terms; 1. Extracting term candidates. At this step special algorithm extracts words and word sequences ad- • To find the most valuable features for the methods; missible to be terms. In most cases researches use This study is organized as follows. At the beginning we predefined or generated part-of-speech patterns to describe the approaches more detailed. Section 3 is de- filter out word sequences that do not match such the voted to the performed experiments: firstly, we describe patterns. The rest of word sequences becomes term evaluation methodology, then report the obtained results, candidates. and, finally, discuss them. In Section 4 we conclude the study and consider the further research. 2. Extracting features of term candidates. Feature is a measurable characteristic of a candidate that is used to recognize terms. There are a lot of statistical 2 Related work and linguistic features that can be useful for term In this section we describe some of the approaches to recognition. ATR. Most of them have the same extracting algorithm but consider different feature sets, so the final results de- 3. Extracting final terms from candidates. This step pend only on the used features. We also briefly describe varies depending upon the way in which researches features used in the task. For more detailed survey of use features to recognize terms. In some studies au- ATR see [10], [2]. thors filter out non-terms by comparing feature val- ues with thresholds: if feature values lies in specific 2.1 Extracting term candidates overview ranges, then candidate is considered to be a term. Others try to rank candidates and expect the top-N Strictly, all of the word sequences, or n-grams, occur- ones to be terms. At last, few studies apply super- ring in text collections can be term candidates. But in vised machine learning methods in order to com- most cases researchers consider only unigrams and bi- bine features effectively. grams [18]. Of course, only the little part of such the candidates are terms, because the candidates’ list mainly Proceedings of the Ninth Spring Researcher’s Colloquium consists of sequences like ”a”, ”the”, ”some of”, ”so the”, on Database and Information Systems, Kazan, Russia, 2013 etc. Hence such the noise should be filtered out. One of the first methods for such the filtering was de- where p - hypothesis of independence, N - a number scribed in [12]. The algorithm extracts term candidates of bigrams in the corpus. by matching the text collection with predefined Part-of- Speech (PoS) patterns, such as: The assumption of this feature is that the text is a Bernoulli process, where meeting of bigram t is a ”suc- • Noun cess”, while meeting of other bigrams is a ”failure”. • Adjective Noun Hypothesis of independence is usually expressed as follows: p = P (w1 w2 ) = P (w1 ) · P (w2 ), where P (w1 ) • Adjective Noun Noun - a probability to encounter the first word of the bigram, As was reported in [12], such the patterns cut off P (w2 ) - a probability to encounter the second one. This much of the noise (word sequences that are not terms) expression can be assessed by replacing the probabilities but retain real terms, because in most cases terms are of words to their normalized frequencies within a text: noun phrases [5]. Filtering of term candidates that do p = T FN(w1 ) · T FN(w2 ) , where N - an overall number of not satisfy some of the morphological properties of word words in the text. sequences is known as lingustic step of ATR. If words are independently distributed in text collec- In work [17] the authors do not use predefined pat- tion, then they do not form persistent collocation. It is terns appealing to the fact that PoS tagger can be not pre- assumed that any domain-specific term is a collocation, cise enough on some texts; they instead generate patterns while not any collocation is a specific term. So consid- for each text collection. In study [7] no linguistic step is ering features like T-test, we can increase the confidence used: the algorithm considers all n-grams from text col- in that candidate is a collocation, but not necessarily spe- lection. cific term. There are much more features that are used in ATR. 2.2 Features overview C-Value [8] has higher values for candidates that are not parts of other word sequences: Having a lot of term candidates, it is necessary to recog- nize domain specific ones among them. It can be done by using the statistical features computed on the basis of C-V alue(t) = log2 |t| · T F (t) − the text collection or some another resource, for example 1 X general corpus [12], domain ontology [23] or Web [6]. − T F (seq) (3) |{seq : t ∈ seq|} t∈seq This part of ATR algorithm is known as statistical step. Term Frequency is a number of occurrences of the word sequence in the text collection. This feature is Domain Consensus [14] recognizes terms that are uni- based on the assumption that if the word sequence is formly distributed on the whole dataset: specific for some domain, then it often occurs in such domain texts. In some studies frequency is also used as X T Fd (t) T Fd (t) an initial filter of term candidates [3]: if a candidate has DC(t) = − log2 (4) T F (t) T F (t) a very low frequency, then it is filtered out. It helps to d∈Docs reduce much of the noise and improves precision of the Domain Relevance [20] compares frequencies of the results. term in two datasets - target and general: TF*IDF has high values for terms that often occur only in few documents: TF is a term frequency and IDF is an inversed number of documents, where the term oc- T Ftarget (t) DR(t) = (5) curs: T Ftarget (t) + T Fref erence (t) Lexical Cohesion [16] is the unithood feature that |Docs| T F ∗ IDF (t) = T F (t) · log (1) compares frequency of term and frequency of words |{Doc : t ∈ Doc}| from which it consists: To find domain-specific terms that are distributed on |t| · T F (t) · log10 T F (t) the whole text collection, in [12] IDF is considered as LC(t) = P (6) an inversed number of documents in reference corpus, w∈t T F (w) where the term occurs. Reference corpus is a some gen- Loglikelihood [12] is the analogue of T-test but with- eral, i.e. not specific, text collection. out assumption about how words in a text are distributed: The described features shows how the word sequence is related to the text collection, or termhood of a candi- date. There is another class of features that show inner b(c12 ; c1 , p)b(c2 − c12 ; N − c1 , p) LL(t) = log (7) strength of words cohesion, or unithood [10]. One of the b(c12 ; c1 , p1 )b(c2 − c12 ; N − c1 , p2 ) first features of this class is T-test. T-test [12] is a statistical test that was initialy designed where c12 - a frequency of bigram t, c1 - a frequency for bigrams and checks the hypothesis of independence of the bigram’s the first word, c2 - a frequency of the of words constituting a term: second one, p = cN2 , p1 = cc12 1 , p2 = cN2 −c −c1 , b(·; ·, ·) - 12 binomial distribution. T F (t) −p T -stat(t) = qN (2) Relevance [19] is the more sophisticated analogue of p(1−p) N Domain Relevance: Dataset Algorithm AvP GENIA Random Forest 0.54 1 R(t) = 1 − TF (t)·DFtarget (t) (8) GENIA Logistic Regression 0.55 log2 (2 + target T Fref erence (t) ) GENIA Voting 0.53 Bio1 Random Forest 0.35 Weirdness [1] also compares frequencies in different Bio1 Logistic Regression 0.40 collections but also takes into account sizes of such the Bio1 Voting 0.23 collections: Table 1: Results of cross-validation without frequency filter T Ftarget (t) · |Corpusref erence | W (t) = (9) Dataset Algorithm AvP T Fref erence (t) · |Corpustarget | GENIA Random Forest 0.66 The described feature list includes termhood, unit- GENIA Logistic Regression 0.70 hood and hybrid features. The termhood features are GENIA Voting 0.65 Domain Consensus, Domain Relevance, Relevance, and Bio1 Random Forest 0.52 Weirdness. The unithood features are Lexical Cohesion Bio1 Logistic Regression 0.58 and Loglikelihood. The hybrid feature, or feature that Bio1 Voting 0.31 shows both termhood and unithood, is C-Value. A lot of works still concentrate on feature engineer- Table 2: Results of cross-validation with frequency filter ing, trying to find more informative features. Neverthe- less, recent trend is to combine all these features effec- 3 Evaluation tively. For our experiments we implemented two approaches for ATR. We used voting algorithm as the first one, while in 2.3 Recognizing terms overview supervised case we trained two classifiers: Random For- Having feature values, final results can be produced. The est and Logistic Regression from WEKA library 1 . These studies [8], [12], [1] use ranking algorithm to provide the classifiers were chosen because of their effectiveness and most probable terms, but this algorithm considers only good generalization ability of the resulting model. Fur- one feature. The studies [20], [16] describe the simplest thermore, these classifiers are able to produce classifica- way of how multiple features can be considered: all val- tion confidence - a numeric score that can be used to rank ues are simply reduced in a one weighted average value an example in overall test set. It is an important property that then is used during ranking. of the selected algorithms that allows to compare their In work [21] authors introduce special rules based on results with results produced by other ranking methods. thresholds for feature values. An example of such a rule is the following: 3.1 Evaluation methology The quality of the algorithms is usually assessed by two Rulei (t) = Fi (t) > a and Fi (t) < b (10) common metrics: precision and recall [11]. Precision is where Fi is a i-th feature; a, b are thresholds for fea- the fraction of retrieved instances that are relevant: ture values. |correct returned results| Note that the thresholds are selected manually or com- P = (12) |all returned results| puted from the marked-up corpora, so this method can not be considered as purely automatic and unsupervised. Recall is the fraction of relevant instances that are re- Effective way of combining multiple features was in- trieved: troduced in [24]. It combines the features in a voting manner using the following formula: |correct returned results| R= (13) |all correct results| X n 1 V (t) = (11) In addition to precision and recall scores, Average i rank(Fi (t)) Precision (AvP) [12] is commonly used [24] to assess ranked results. It defines as: where n is a number of considered features, rank(Fi (t)) is a rank of the term t among values of other X N terms considering feature Fi . P (i)∆R(i) (14) In addition, study [24] shows that the described voting i=1 method in general outperforms most of the methods that consider only one feature or reduce them in a weighted where P (i) is the precision of top-i results, ∆R(i) average value. Another important advantage of the vot- change in recall from top-(i-1) to top-i results. ing algorithm is that it does not require normalization of Obviously, this score tends to be higher for algorithms feature values. that print out correct terms on top positions of the result. There are several studies that apply supervised meth- In our experiments we considered only the AvP score, ods for term recognition. In [17] authors apply Ada while precision and recall are omitted. For voting algo- Boost meta-classifier, while in [7] Ripper system is used. rithm it is no simple way to compute recall, because it is The study [22] describes hybrid approach including both 1 Official website of the project: unsupervised and supervised methods. http://www.cs.waikato.ac.nz/ml/weka/ not obvious what number of top results should be consid- Trainset Testset Algorithm AvP ered as correct terms. Also in a general case the overall GENIA Bio1 Random Forest 0.30 number of terms in dataset is unknown. GENIA Bio1 Logistic Regression 0.35 – Bio1 Voting 0.25 3.2 Features Bio1 GENIA Random Forest 0.44 Bio1 GENIA Logistic Regression 0.42 For our experiments we implemented the following fea- – GENIA Voting 0.55 tures: C-Value, Domain Consensus, Domain Relevance, Frequency, Lexical Cohesion, Loglikelihood, Relevance, Table 3: Results of evaluation on separated train and test TF*IDF, Weirdness and Words Count. Words Count is sets without frequency filter the simple feature that shows a number of words in a word sequence. This feature may be useful for the clas- Trainset Testset Algorithm AvP sifier since values of other features may have different GENIA Bio1 Random Forest 0.34 meanings for single- and multi-word terms [2]. GENIA Bio1 Logistic Regression 0.48 Most of these features are capable to recognize both – Bio1 Voting 0.31 single- and multi-word terms, except T-test and Log- likelihood that are designed to recognize only two-word Bio1 GENIA Random Forest 0.60 terms (bigrams). We generalize them to the case of n- Bio1 GENIA Logistic Regression 0.62 grams according to the study [4]. – GENIA Voting 0.65 Some of the features consider information from the Table 4: Results of evaluation on separated train and test collection of general-domain texts (reference corpus), sets with frequency filter in our case these features are Domain Relevance, Rele- vance, Weirdness. For this purpose we use statistics from In the following tests the whole feature set was con- Corpus of Contemporary American English 2 . sidered and the overall ranked result was assessed. For extracting term candidates we implemented sim- ple approach based on predefined part-of-speech pat- terns. For simplicity, we extracted only unigrams, bi- Cross-validation grams and trigrams by using patterns such as: We performed 4-fold cross-validation of the algorithms 1. Noun on both the corpora. We extracted term candidates from the whole dataset and divided them on train and test sets. 2. Noun Noun In other words, we considered the case when having 3. Adjective Noun some marked-up examples (train set) we should recog- nize terms in the rest of data (test set) extracted from the 4. Noun Noun Noun same corpus. So in case of voting algorithm the training set was simply omitted. 5. Adjective Noun Noun The results of cross-validation are shown in the Ta- 6. Noun Adjective Noun bles 1, 2. The Table 2 presents results of cross-validation on term candidates that appears at least two times in the 3.3 Datasets corpus. As we can see, in both the cases machine learning ap- Evaluation of the approaches was performed on two proach outperformed voting algorithm. Moreover, in the datasets of medical and biological domains consisting of case without rare terms a difference of scores is higher. short English texts with marked-up specific terms: It can be explained by the following: feature values of Corpus Documents Words Terms rare terms (especially Frequency, Domain Consensus) GENIA 2000 400000 35000 are useless for the classification and add a noise to the Bio1 100 20000 1200 model. When such the terms are omitted, the model be- comes more clear. The last one (Bio1) has common texts with the first Also in most cases Logistic Regression algorithm out- (GENIA), so we filtered out the texts that occur in both performed Random Forest, so in most of further tests we the corpora. We left GENIA without any modifications, used only the best one. while 20 texts were removed from Bio1 as common texts of the corpora. Separate train and test datasets 3.4 Experimental results Having two datasets of the same field, the idea is to check 3.4.1 Machine learning method versus Voting algo- how the model trained on the one can predict the data rithm from the other. For this purpose we used GENIA as a training set and Bio1 as a test one, then visa versa. We considered two test scenarios in order to compare quality of the implemented algorithms. For each scenario The results are shown in the Tables 3, 4. In the case we performed two kinds of tests: with and without filter- when Bio1 was used as a training set, voting algorithm ing of rare term candidates. outperformed trained classifier. It could happen due to the fact that the training data from Bio1 does not fully 2 Statistics available at www.ngrams.info reflect properties of terms in GENIA. Figure 1: Dependency of AvP from top results given by Figure 3: Dependency of AvP from train set size on sep- cross-validation arated train and test sets that the number of test folds was being increased at each step. So we started with nine folds used for training and one fold used for the test. At the next step we moved one fold from training set to the test set and evaluated again. The results are presented on the Figures 9–13. The interesting observation is that higher values of AvP correspond to the bigger sizes of the test set. It could hap- pen because with increasing of the test set the number of high-confident terms is also growing: such the terms take most of the top positions of the list and improve AvP. In case of GENIA and Bio1 the top of the list mainly consists from the highly domain-specific terms that take high values for the features like Domain Relevance, Rel- Figure 2: Dependency of AvP from top results on sepa- evance, Weirdness: such the terms occur in the corpora rated train and test sets frequently enough. As we can see, in all of the cases the gain of AvP 3.4.2 Dependency of average precision from num- stopped quickly. So, in case of GENIA, it is enough to ber of top results train on 10% of candidates to rank the rest 90% with the In previous tests we considered overall results produced same performance. It could happen because of the rela- by the algorithms. Descending from the top to the bottom tively small number of features are used and their speci- of the ranked list, AvP score can significantly change, ficity: most of them designed to have high magnitude for so one algorithm can outperform another one on top-100 terms and low for non-terms. So, the data can be easily results but lose on top-1000. In order to explore this de- separated by the classifier having few training examples. pendency, we measured AvP for different slices of the top results. 3.5 Feature selection The Figure 1 shows the dependency of AvP from number of top results given by 4-fold cross-validation. Feature selection (FS) is the process of finding the most We also considered a scenario when GENIA was used relevant features for the task. Having a lot of different for training and Bio1 for testing. The results are pre- features, the goal is to exclude redundant and irrelevant sented on the Figure 2. ones from the feature set. Redundant features provide no useful information as compared with the current feature set, while irrelevant features do not provide information 3.4.3 Dependency of classifier performance from in any context. training set size There are different algorithms of FS. Some of them In order to explore dependency between the amount of rank separate features by relevance to the task, while oth- data used for training and average precision, we consid- ers search subsets of features that get the best model for ered three test scenarios. the predictor [9]. Also the algorithms differ by their com- At first, we trained the classifiers on GENIA dataset plexity. Because of big amount of features used in some and tested it on Bio1. At each step the amount of training tasks, it is not possible to do exhaustive search, so fea- data was being decreased, while the test data remained tures are selected by greedy algorithms [13]. without any modifications. The results of the test are pre- In our task we concentrated on searching the subsets sented on the Figure 3. of features that get the best results for the task. For such Next, we started with 10-fold cross-validation on purpose we ran quality tests for all possible feature sub- GENIA and at each step decreased the number of folds sets, or, in other words, performed the exhaustive search. used for training of Logistic Regression and did not Having 10 features, we check 210 − 1 different combina- change the number of folds used for testing. The results tions of them. In case of the machine learning method, are shown on the Figures 4–8. we used 9 folds for test and one fold for train. The reason The last test is the same as the previous one, except of such the configuration is that the classifier needs little Top count All features The best features Top count All features The best features 100 0.9256 0.9915 100 0.8997 0.9856 1000 0.8138 0.8761 1000 0.8414 0.8757 5000 0.7128 0.7885 5000 0.7694 0.7875 10000 0.667 0.7380 10000 0.7309 0.7329 20000 0.6174 0.6804 20000 0.6623 0.6714 Table 5: Results of FS for voting algorithm Table 6: Results of FS for Logistic Regression data for training to rank terms with the same performance 3.6 Discussion (see the previous section). For voting algorithm, we sim- Despite the fact that filtering of the candidates occurring ply ranked candidates and then assessed overall list. All only once in the corpus improves average precision of of the tests were performed on GENIA corpus and only the methods, it is not always a good idea to exclude such the Logistic Regression was used as the machine learning the candidates. The reason is that a lot of specific terms algorithm. can occur only once in a dataset: for example, in GENIA The AvP score was computed for different slices of there are 50% of considered terms that occur only once. the top terms: 100, 1000, 5000, 10000, and 20000. The Of course, omitting such the terms extremely affects re- same slices are used in [24]. The best results for the al- call of the result. Thus such the cases should be consid- gorithms are presented in the Tables 5, 6. These tables ered for the ATR task. shows that voting algorithm has better scores then ma- chine learning method, but such the results are not fully One of the interesting observations is that the amount comparable: FS for voting algorithm was performed on of training data is needed to rank terms without sufficient the whole dataset, while Logistic Regression was trained performance drop is extremely low. It leads to the idea on 10% of term candidates. The average performance of applying the bootstrapping approach for ATR: gain for voting algorithm is about 7%, while for machine learning it is only about 3%. 1. Having few marked-up examples, train the classifier The best features for voting algorithm: 2. Use the classifier to extract new terms 1. Top-100: Relevance, TF*IDF 3. Use the most confident terms as initial data at step 1. 2. Top-1000: Relevance, Weirdness, TF*IDF 4. Iterate until all of confident terms will be extracted 3. Top-5000: Weirdness This is a semi-supervised method, because only lit- tle marked-up data is needed to run the algorithm. Also 4. Top-10000: Weirdness the method can be transformed into fully unsupervised, if initial data will be extracted by some unsupervised ap- 5. Top-20000: CValue, Frequency, Domain Rele- proach (for example, by voting algorithm). The similar vance, Weirdness idea is implemented in study [22]. The best features for the machine learning approach: 4 Conclusion and Future work 1. Top-100: Words Count, Domain Consensus, Nor- In this paper we have compared the performance of two malized Frequency, Domain Relevance, TF*IDF approaches for ATR: machine learning method and vot- ing algorithm. For this purpose we implemented the set 2. Top-1000: Words Count, Domain Relevance, of features that include linguistic, statistical, termhood Weirdness, TF*IDF and unithood feature types. All of the algorithms pro- duced ranked list of terms that then was assessed by av- 3. Top-5000: Words Count, Frequency, Lexical Cohe- erage precision score. sion, Relevance, Weirdness In most tests machine learning method outperforms 4. Top-10000: Words Count, CValue, Domain Con- voting algorithm. Moreover it was explored that for the sensus, Frequency, Weirdness, TF*IDF supervised method it is enough to have few marked-up examples, about 10% in case of GENIA dataset, to rank 5. Top-20000: Words Count, CValue, Domain Rele- terms with good performance. It leads to the idea of ap- vance, Weirdness, TF*IDF plying bootstrapping to ATR. Furthermore, initial data for bootstrapping can be obtained by voting algorithm As we can see, most of the subsets contain features because its top results are precise enough (see the Fig- based on a general domain. The reason can be that the ure 1) target corpus has high specificity, so the most of terms do not occur in a general corpus. The best feature subsets for the task were also ex- The next observation is that in case of the machine plored. Most of these features are based on a compar- learning algorithm, Words Count feature occurs in all of ison between domain-specific documents collection and the subsets. This observation confirms an assumption a reference general corpus. In case of the supervised ap- that this feature is useful for algorithms that recognize proach, the feature Words Count occurs in all of the sub- both the single- and multi-word terms. sets, so this feature is useful for the classifier, because values of other features may have different meanings for [8] K.T. Frantzi and S. Ananiadou. Extracting nested single- and multi-word terms. collocations. In Proceedings of the 16th confer- ence on Computational linguistics-Volume 1, pages In cases when one dataset is used for training and an- 41–46. Association for Computational Linguistics, other to test, we could not get stable performance gain 1996. using machine learning. Even the datasets are of the same field, a distribution of terms can be different. So it [9] I. Guyon and A. Elisseeff. An introduction to vari- is still unclear if it is possible to recognize terms from un- able and feature selection. The Journal of Machine seen data of the same field having the once-trained clas- Learning Research, 3:1157–1182, 2003. sifier. [10] K. Kageura and B. Umino. Methods of auto- matic term recognition: A review. Terminology, For our experiments we implemented the simple 3(2):259–289, 1996. method of term candidates extraction: we filter out n- grams that do not match predefined part-of-speech pat- [11] C.D. Manning and P. Raghavan. Introduction to terns. This step of ATR can be performed in other ways, information retrieval, volume 1. for example by shallow parsing, or chunking 3 , gener- ating patterns from the dataset [17] or recognizing term [12] C.D. Manning and H. Schütze. Foundations of sta- variants. tistical natural language processing. MIT press, 1999. Another direction of further research is related to the [13] L.C. Molina, L. Belanche, and À. Nebot. Feature evaluation of the algorithms on more datasets of different selection algorithms: A survey and experimental languages and researching the ability of cross-domain evaluation. In Data Mining, 2002. ICDM 2003. term recognition, i.e. using a dataset of one domain to Proceedings. 2002 IEEE International Conference recognize terms from others. on, pages 306–313. IEEE, 2002. Also of particular interest is the implementation and [14] R. Navigli and P. Velardi. Semantic interpretation evaluation of semi- and unsupervised methods that in- of terminological strings. In Proc. 6th Intl Conf. volve machine learning techniques. Terminology and Knowledge Eng, pages 95–100, 2002. References [15] M.A. Nokel, E.I. Bolshakova, and N.V. [1] K. Ahmad, L. Gillam, L. Tostevin, et al. University Loukachevitch. Combining multiple features of surrey participation in trec8: Weirdness index- for single- word term extraction. 2012. ing for logical document extrapolation and retrieval [16] Y. Park, R.J. Byrd, and B.K. Boguraev. Automatic (wilder). In The Eighth Text REtrieval Conference glossary extraction: beyond terminology identifi- (TREC-8), 1999. cation. In Proceedings of the 19th international [2] Lars Ahrenberg. Term extraction: A review draft conference on Computational linguistics-Volume 1, version 091221. 2009. pages 1–7. Association for Computational Linguis- tics, 2002. [3] K.W. Church and P. Hanks. Word associa- [17] A. Patry and P. Langlais. Corpus-based ter- tion norms, mutual information, and lexicography. minology extraction. In Terminology and Con- Computational linguistics, 16(1):22–29, 1990. tent Development–Proceedings of 7th International [4] B. Daille. Study and implementation of combined Conference On Terminology and Knowledge Engi- techniques for automatic extraction of terminology. neering, Litera, Copenhagen, 2005. The balancing act: Combining symbolic and statis- [18] M. Pazienza, M. Pennacchiotti, and F. Zanzotto. tical approaches to language, 1:49–66, 1996. Terminology extraction: an analysis of linguis- [5] B. Daille, B. Habert, C. Jacquemin, and J. Royauté. tic and statistical approaches. Knowledge Mining, Empirical observation of term variations and prin- pages 255–279, 2005. ciples for their description. Terminology, 3(2):197– [19] A. Peñas, F. Verdejo, J. Gonzalo, et al. Corpus- 257, 1996. based terminology extraction applied to informa- [6] B. Dobrov and N. Loukachevitch. Multiple evi- tion access. In Proceedings of Corpus Linguistics, dence for term extraction in broad domains. In volume 2001. Citeseer, 2001. Proceedings of the 8th Recent Advances in Natural [20] F. Sclano and P. Velardi. Termextractor: a web ap- Language Processing Conference (RANLP 2011). plication to learn the shared terminology of emer- Hissar, Bulgaria, pages 710–715, 2011. gent web communities. Enterprise Interoperability [7] J. Foo. Term extraction using machine learning. II, pages 287–290, 2007. 2009. [21] P. Velardi, M. Missikoff, and R. Basili. Identifica- 3 Free chunker can be found in OpenNLP project: tion of relevant terms to support the construction of http://opennlp.apache.org domain ontologies. In Proceedings of the workshop on Human Language Technology and Knowledge Management-Volume 2001, page 5. Association for Computational Linguistics, 2001. [22] Y. Yang, H. Yu, Y. Meng, Y. Lu, and Y. Xia. Fault- tolerant learning for term extraction. 2011. [23] W. Zhang, T. Yoshida, and X. Tang. Using ontol- ogy to improve precision of terminology extraction from documents. Expert Systems with Applications, 36(5):9333–9339, 2009. [24] Ziqi Zhang, Christopher Brewster, and Fabio Ciravegna. A comparative evaluation of term recognition algorithms. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC08), Marrakech, Morocco, 2008. Figure 9: Dependency of AvP from num- Figure 4: Dependency of AvP from number ber of excluded folds with changing testset of excluded folds with fixed testset size: 10- size: 10-fold cross-validation with 1 to 9 fold cross-validation with 1 test fold and 9 test folds and 9 to 1 train folds: Top-100 to 1 train folds: Top-100 terms terms Figure 5: Dependency of AvP from num- ber of excluded folds with fixed testset size: Top-1000 terms Figure 10: Dependency of AvP from num- ber of excluded folds with changing testset size: Top-1000 terms Figure 6: Dependency of AvP from num- ber of excluded folds with fixed testset size: Figure 11: Dependency of AvP from num- Top-5000 terms ber of excluded folds with changing testset size: Top-5000 terms Figure 7: Dependency of AvP from num- ber of excluded folds with fixed testset size: Figure 12: Dependency of AvP from num- Top-10000 terms ber of excluded folds with changing testset size: Top-10000 terms Figure 8: Dependency of AvP from num- ber of excluded folds with fixed testset size: Figure 13: Dependency of AvP from num- Top-20000 terms ber of excluded folds with changing testset size: Top-20000 terms