=Paper= {{Paper |id=None |storemode=property |title=Automatic recognition of domain-specific terms: an experimental evaluation |pdfUrl=https://ceur-ws.org/Vol-1031/paper3.pdf |volume=Vol-1031 |dblpUrl=https://dblp.org/rec/conf/syrcodis/FedorenkoAT13 }} ==Automatic recognition of domain-specific terms: an experimental evaluation== https://ceur-ws.org/Vol-1031/paper3.pdf

Automatic recognition of domain-specific terms: an
experimental evaluation

c Denis Fedorenko Nikita Astrakhantsev Denis Turdakov
Institute for System Programming of Russian Academy of Sciences
fedorenko@ispras.ru, astrakhantsev@ispras.ru, turdakov@ispras.ru

Abstract There are several studies comparing different ap-
proaches for ATR. In [18] authors compare different sin-
This paper presents an experimental evaluation gle statistical features by their effectiveness for term can-
of the state-of-the-art approaches for automatic didates ranking. In [24] the same comparison is ex-
term recognition based on multiple features: tended by voting algorithm that combines multiple fea-
machine learning method and voting algorithm. tures. Studies [17], [15] compare supervised machine
We show that in most cases machine learning learning method with the approach based on single fea-
approach obtains the best results and needs little ture again.
data for training; we also find the best subsets of In turn, the present study experimentally evaluates
all popular features. the ranking methods combining multiple features: super-
vised machine learning approach and voting algorithm.
We pay most of the attention to the supervised method in
1 Introduction order to explore its applicability to ATR.
Automatic term recognition (ATR) is an actual problem The purposes of the study are the following:
of text processing. The task is to recognize and extract
terminological units from different domain-specific text • To compare results of machine learning approach
collections. Resulting terms can be useful in more com- and voting algorithm;
plex tasks such as semantic search, question-answering, • To compare different machine learning algorithms
ontology construction, word sense induction, etc. applied to ATR;
There have been a lot of studies of ATR. Most of them
split the task into three common steps: • To explore how much training data is needed to rank
terms;
1. Extracting term candidates. At this step special
algorithm extracts words and word sequences ad- • To find the most valuable features for the methods;
missible to be terms. In most cases researches use
This study is organized as follows. At the beginning we
predefined or generated part-of-speech patterns to
describe the approaches more detailed. Section 3 is de-
filter out word sequences that do not match such the
voted to the performed experiments: firstly, we describe
patterns. The rest of word sequences becomes term
evaluation methodology, then report the obtained results,
candidates.
and, finally, discuss them. In Section 4 we conclude the
study and consider the further research.
2. Extracting features of term candidates. Feature
is a measurable characteristic of a candidate that is
used to recognize terms. There are a lot of statistical 2 Related work
and linguistic features that can be useful for term In this section we describe some of the approaches to
recognition. ATR. Most of them have the same extracting algorithm
but consider different feature sets, so the final results de-
3. Extracting final terms from candidates. This step pend only on the used features. We also briefly describe
varies depending upon the way in which researches features used in the task. For more detailed survey of
use features to recognize terms. In some studies au- ATR see [10], [2].
thors filter out non-terms by comparing feature val-
ues with thresholds: if feature values lies in specific 2.1 Extracting term candidates overview
ranges, then candidate is considered to be a term.
Others try to rank candidates and expect the top-N Strictly, all of the word sequences, or n-grams, occur-
ones to be terms. At last, few studies apply super- ring in text collections can be term candidates. But in
vised machine learning methods in order to com- most cases researchers consider only unigrams and bi-
bine features effectively. grams [18]. Of course, only the little part of such the
candidates are terms, because the candidates’ list mainly
Proceedings of the Ninth Spring Researcher’s Colloquium consists of sequences like ”a”, ”the”, ”some of”, ”so the”,
on Database and Information Systems, Kazan, Russia, 2013 etc. Hence such the noise should be filtered out.
One of the first methods for such the filtering was de- where p - hypothesis of independence, N - a number
scribed in [12]. The algorithm extracts term candidates of bigrams in the corpus.
by matching the text collection with predefined Part-of-
Speech (PoS) patterns, such as: The assumption of this feature is that the text is a
Bernoulli process, where meeting of bigram t is a ”suc-
• Noun cess”, while meeting of other bigrams is a ”failure”.
• Adjective Noun Hypothesis of independence is usually expressed as
follows: p = P (w1 w2 ) = P (w1 ) · P (w2 ), where P (w1 )
• Adjective Noun Noun - a probability to encounter the first word of the bigram,
As was reported in [12], such the patterns cut off P (w2 ) - a probability to encounter the second one. This
much of the noise (word sequences that are not terms) expression can be assessed by replacing the probabilities
but retain real terms, because in most cases terms are of words to their normalized frequencies within a text:
noun phrases [5]. Filtering of term candidates that do p = T FN(w1 ) · T FN(w2 ) , where N - an overall number of
not satisfy some of the morphological properties of word words in the text.
sequences is known as lingustic step of ATR. If words are independently distributed in text collec-
In work [17] the authors do not use predefined pat- tion, then they do not form persistent collocation. It is
terns appealing to the fact that PoS tagger can be not pre- assumed that any domain-specific term is a collocation,
cise enough on some texts; they instead generate patterns while not any collocation is a specific term. So consid-
for each text collection. In study [7] no linguistic step is ering features like T-test, we can increase the confidence
used: the algorithm considers all n-grams from text col- in that candidate is a collocation, but not necessarily spe-
lection. cific term.
There are much more features that are used in ATR.
2.2 Features overview C-Value [8] has higher values for candidates that are
not parts of other word sequences:
Having a lot of term candidates, it is necessary to recog-
nize domain specific ones among them. It can be done
by using the statistical features computed on the basis of C-V alue(t) = log2 |t| · T F (t) −
the text collection or some another resource, for example 1 X
general corpus [12], domain ontology [23] or Web [6]. − T F (seq) (3)
|{seq : t ∈ seq|} t∈seq
This part of ATR algorithm is known as statistical step.
Term Frequency is a number of occurrences of the
word sequence in the text collection. This feature is Domain Consensus [14] recognizes terms that are uni-
based on the assumption that if the word sequence is formly distributed on the whole dataset:
specific for some domain, then it often occurs in such
domain texts. In some studies frequency is also used as X T Fd (t) T Fd (t)
an initial filter of term candidates [3]: if a candidate has DC(t) = − log2 (4)
T F (t) T F (t)
a very low frequency, then it is filtered out. It helps to d∈Docs
reduce much of the noise and improves precision of the Domain Relevance [20] compares frequencies of the
results. term in two datasets - target and general:
TF*IDF has high values for terms that often occur
only in few documents: TF is a term frequency and IDF
is an inversed number of documents, where the term oc- T Ftarget (t)
DR(t) = (5)
curs: T Ftarget (t) + T Fref erence (t)
Lexical Cohesion [16] is the unithood feature that
|Docs|
T F ∗ IDF (t) = T F (t) · log (1) compares frequency of term and frequency of words
|{Doc : t ∈ Doc}| from which it consists:
To find domain-specific terms that are distributed on
|t| · T F (t) · log10 T F (t)
the whole text collection, in [12] IDF is considered as LC(t) = P (6)
an inversed number of documents in reference corpus, w∈t T F (w)
where the term occurs. Reference corpus is a some gen- Loglikelihood [12] is the analogue of T-test but with-
eral, i.e. not specific, text collection. out assumption about how words in a text are distributed:
The described features shows how the word sequence
is related to the text collection, or termhood of a candi-
date. There is another class of features that show inner b(c12 ; c1 , p)b(c2 − c12 ; N − c1 , p)
LL(t) = log (7)
strength of words cohesion, or unithood [10]. One of the b(c12 ; c1 , p1 )b(c2 − c12 ; N − c1 , p2 )
first features of this class is T-test.
T-test [12] is a statistical test that was initialy designed where c12 - a frequency of bigram t, c1 - a frequency
for bigrams and checks the hypothesis of independence of the bigram’s the first word, c2 - a frequency of the
of words constituting a term: second one, p = cN2 , p1 = cc12
1
, p2 = cN2 −c
−c1 , b(·; ·, ·) -
12

binomial distribution.
T F (t)
−p
T -stat(t) = qN (2) Relevance [19] is the more sophisticated analogue of
p(1−p)
N Domain Relevance:
Dataset Algorithm AvP
GENIA Random Forest 0.54
1
R(t) = 1 − TF (t)·DFtarget (t)
(8) GENIA Logistic Regression 0.55
log2 (2 + target
T Fref erence (t) ) GENIA Voting 0.53
Bio1 Random Forest 0.35
Weirdness [1] also compares frequencies in different Bio1 Logistic Regression 0.40
collections but also takes into account sizes of such the Bio1 Voting 0.23
collections:
Table 1: Results of cross-validation without frequency
filter
T Ftarget (t) · |Corpusref erence |
W (t) = (9) Dataset Algorithm AvP
T Fref erence (t) · |Corpustarget |
GENIA Random Forest 0.66
The described feature list includes termhood, unit- GENIA Logistic Regression 0.70
hood and hybrid features. The termhood features are GENIA Voting 0.65
Domain Consensus, Domain Relevance, Relevance, and Bio1 Random Forest 0.52
Weirdness. The unithood features are Lexical Cohesion Bio1 Logistic Regression 0.58
and Loglikelihood. The hybrid feature, or feature that Bio1 Voting 0.31
shows both termhood and unithood, is C-Value.
A lot of works still concentrate on feature engineer- Table 2: Results of cross-validation with frequency filter
ing, trying to find more informative features. Neverthe-
less, recent trend is to combine all these features effec- 3 Evaluation
tively. For our experiments we implemented two approaches for
ATR. We used voting algorithm as the first one, while in
2.3 Recognizing terms overview supervised case we trained two classifiers: Random For-
Having feature values, final results can be produced. The est and Logistic Regression from WEKA library 1 . These
studies [8], [12], [1] use ranking algorithm to provide the classifiers were chosen because of their effectiveness and
most probable terms, but this algorithm considers only good generalization ability of the resulting model. Fur-
one feature. The studies [20], [16] describe the simplest thermore, these classifiers are able to produce classifica-
way of how multiple features can be considered: all val- tion confidence - a numeric score that can be used to rank
ues are simply reduced in a one weighted average value an example in overall test set. It is an important property
that then is used during ranking. of the selected algorithms that allows to compare their
In work [21] authors introduce special rules based on results with results produced by other ranking methods.
thresholds for feature values. An example of such a rule
is the following: 3.1 Evaluation methology
The quality of the algorithms is usually assessed by two
Rulei (t) = Fi (t) > a and Fi (t) < b (10) common metrics: precision and recall [11]. Precision is
where Fi is a i-th feature; a, b are thresholds for fea- the fraction of retrieved instances that are relevant:
ture values. |correct returned results|
Note that the thresholds are selected manually or com- P = (12)
|all returned results|
puted from the marked-up corpora, so this method can
not be considered as purely automatic and unsupervised. Recall is the fraction of relevant instances that are re-
Effective way of combining multiple features was in- trieved:
troduced in [24]. It combines the features in a voting
manner using the following formula: |correct returned results|
R= (13)
|all correct results|
X
n
1
V (t) = (11) In addition to precision and recall scores, Average
i
rank(Fi (t)) Precision (AvP) [12] is commonly used [24] to assess
ranked results. It defines as:
where n is a number of considered features,
rank(Fi (t)) is a rank of the term t among values of other X
N
terms considering feature Fi . P (i)∆R(i) (14)
In addition, study [24] shows that the described voting i=1
method in general outperforms most of the methods that
consider only one feature or reduce them in a weighted where P (i) is the precision of top-i results, ∆R(i)
average value. Another important advantage of the vot- change in recall from top-(i-1) to top-i results.
ing algorithm is that it does not require normalization of Obviously, this score tends to be higher for algorithms
feature values. that print out correct terms on top positions of the result.
There are several studies that apply supervised meth- In our experiments we considered only the AvP score,
ods for term recognition. In [17] authors apply Ada while precision and recall are omitted. For voting algo-
Boost meta-classifier, while in [7] Ripper system is used. rithm it is no simple way to compute recall, because it is
The study [22] describes hybrid approach including both 1 Official website of the project:
unsupervised and supervised methods. http://www.cs.waikato.ac.nz/ml/weka/
not obvious what number of top results should be consid- Trainset Testset Algorithm AvP
ered as correct terms. Also in a general case the overall GENIA Bio1 Random Forest 0.30
number of terms in dataset is unknown. GENIA Bio1 Logistic Regression 0.35
– Bio1 Voting 0.25
3.2 Features Bio1 GENIA Random Forest 0.44
Bio1 GENIA Logistic Regression 0.42
For our experiments we implemented the following fea-
– GENIA Voting 0.55
tures: C-Value, Domain Consensus, Domain Relevance,
Frequency, Lexical Cohesion, Loglikelihood, Relevance, Table 3: Results of evaluation on separated train and test
TF*IDF, Weirdness and Words Count. Words Count is sets without frequency filter
the simple feature that shows a number of words in a
word sequence. This feature may be useful for the clas-
Trainset Testset Algorithm AvP
sifier since values of other features may have different
GENIA Bio1 Random Forest 0.34
meanings for single- and multi-word terms [2].
GENIA Bio1 Logistic Regression 0.48
Most of these features are capable to recognize both
– Bio1 Voting 0.31
single- and multi-word terms, except T-test and Log-
likelihood that are designed to recognize only two-word Bio1 GENIA Random Forest 0.60
terms (bigrams). We generalize them to the case of n- Bio1 GENIA Logistic Regression 0.62
grams according to the study [4]. – GENIA Voting 0.65
Some of the features consider information from the
Table 4: Results of evaluation on separated train and test
collection of general-domain texts (reference corpus),
sets with frequency filter
in our case these features are Domain Relevance, Rele-
vance, Weirdness. For this purpose we use statistics from
In the following tests the whole feature set was con-
Corpus of Contemporary American English 2 .
sidered and the overall ranked result was assessed.
For extracting term candidates we implemented sim-
ple approach based on predefined part-of-speech pat-
terns. For simplicity, we extracted only unigrams, bi- Cross-validation
grams and trigrams by using patterns such as:
We performed 4-fold cross-validation of the algorithms
1. Noun on both the corpora. We extracted term candidates from
the whole dataset and divided them on train and test sets.
2. Noun Noun In other words, we considered the case when having
3. Adjective Noun some marked-up examples (train set) we should recog-
nize terms in the rest of data (test set) extracted from the
4. Noun Noun Noun same corpus. So in case of voting algorithm the training
set was simply omitted.
5. Adjective Noun Noun The results of cross-validation are shown in the Ta-
6. Noun Adjective Noun bles 1, 2. The Table 2 presents results of cross-validation
on term candidates that appears at least two times in the
3.3 Datasets corpus.
As we can see, in both the cases machine learning ap-
Evaluation of the approaches was performed on two proach outperformed voting algorithm. Moreover, in the
datasets of medical and biological domains consisting of case without rare terms a difference of scores is higher.
short English texts with marked-up specific terms: It can be explained by the following: feature values of
Corpus Documents Words Terms rare terms (especially Frequency, Domain Consensus)
GENIA 2000 400000 35000 are useless for the classification and add a noise to the
Bio1 100 20000 1200 model. When such the terms are omitted, the model be-
comes more clear.
The last one (Bio1) has common texts with the first Also in most cases Logistic Regression algorithm out-
(GENIA), so we filtered out the texts that occur in both performed Random Forest, so in most of further tests we
the corpora. We left GENIA without any modifications, used only the best one.
while 20 texts were removed from Bio1 as common texts
of the corpora.
Separate train and test datasets
3.4 Experimental results
Having two datasets of the same field, the idea is to check
3.4.1 Machine learning method versus Voting algo- how the model trained on the one can predict the data
rithm from the other. For this purpose we used GENIA as a
training set and Bio1 as a test one, then visa versa.
We considered two test scenarios in order to compare
quality of the implemented algorithms. For each scenario The results are shown in the Tables 3, 4. In the case
we performed two kinds of tests: with and without filter- when Bio1 was used as a training set, voting algorithm
ing of rare term candidates. outperformed trained classifier. It could happen due to
the fact that the training data from Bio1 does not fully
2 Statistics available at www.ngrams.info reflect properties of terms in GENIA.
Figure 1: Dependency of AvP from top results given by Figure 3: Dependency of AvP from train set size on sep-
cross-validation arated train and test sets
that the number of test folds was being increased at each
step. So we started with nine folds used for training and
one fold used for the test. At the next step we moved
one fold from training set to the test set and evaluated
again. The results are presented on the Figures 9–13.
The interesting observation is that higher values of AvP
correspond to the bigger sizes of the test set. It could hap-
pen because with increasing of the test set the number of
high-confident terms is also growing: such the terms take
most of the top positions of the list and improve AvP.
In case of GENIA and Bio1 the top of the list mainly
consists from the highly domain-specific terms that take
high values for the features like Domain Relevance, Rel-
Figure 2: Dependency of AvP from top results on sepa- evance, Weirdness: such the terms occur in the corpora
rated train and test sets frequently enough.
As we can see, in all of the cases the gain of AvP
3.4.2 Dependency of average precision from num- stopped quickly. So, in case of GENIA, it is enough to
ber of top results train on 10% of candidates to rank the rest 90% with the
In previous tests we considered overall results produced same performance. It could happen because of the rela-
by the algorithms. Descending from the top to the bottom tively small number of features are used and their speci-
of the ranked list, AvP score can significantly change, ficity: most of them designed to have high magnitude for
so one algorithm can outperform another one on top-100 terms and low for non-terms. So, the data can be easily
results but lose on top-1000. In order to explore this de- separated by the classifier having few training examples.
pendency, we measured AvP for different slices of the
top results. 3.5 Feature selection
The Figure 1 shows the dependency of AvP from
number of top results given by 4-fold cross-validation. Feature selection (FS) is the process of finding the most
We also considered a scenario when GENIA was used relevant features for the task. Having a lot of different
for training and Bio1 for testing. The results are pre- features, the goal is to exclude redundant and irrelevant
sented on the Figure 2. ones from the feature set. Redundant features provide no
useful information as compared with the current feature
set, while irrelevant features do not provide information
3.4.3 Dependency of classifier performance from
in any context.
training set size
There are different algorithms of FS. Some of them
In order to explore dependency between the amount of rank separate features by relevance to the task, while oth-
data used for training and average precision, we consid- ers search subsets of features that get the best model for
ered three test scenarios. the predictor [9]. Also the algorithms differ by their com-
At first, we trained the classifiers on GENIA dataset plexity. Because of big amount of features used in some
and tested it on Bio1. At each step the amount of training tasks, it is not possible to do exhaustive search, so fea-
data was being decreased, while the test data remained tures are selected by greedy algorithms [13].
without any modifications. The results of the test are pre- In our task we concentrated on searching the subsets
sented on the Figure 3. of features that get the best results for the task. For such
Next, we started with 10-fold cross-validation on purpose we ran quality tests for all possible feature sub-
GENIA and at each step decreased the number of folds sets, or, in other words, performed the exhaustive search.
used for training of Logistic Regression and did not Having 10 features, we check 210 − 1 different combina-
change the number of folds used for testing. The results tions of them. In case of the machine learning method,
are shown on the Figures 4–8. we used 9 folds for test and one fold for train. The reason
The last test is the same as the previous one, except of such the configuration is that the classifier needs little
Top count All features The best features Top count All features The best features
100 0.9256 0.9915 100 0.8997 0.9856
1000 0.8138 0.8761 1000 0.8414 0.8757
5000 0.7128 0.7885 5000 0.7694 0.7875
10000 0.667 0.7380 10000 0.7309 0.7329
20000 0.6174 0.6804 20000 0.6623 0.6714

Table 5: Results of FS for voting algorithm Table 6: Results of FS for Logistic Regression

data for training to rank terms with the same performance 3.6 Discussion
(see the previous section). For voting algorithm, we sim- Despite the fact that filtering of the candidates occurring
ply ranked candidates and then assessed overall list. All only once in the corpus improves average precision of
of the tests were performed on GENIA corpus and only the methods, it is not always a good idea to exclude such
the Logistic Regression was used as the machine learning the candidates. The reason is that a lot of specific terms
algorithm. can occur only once in a dataset: for example, in GENIA
The AvP score was computed for different slices of there are 50% of considered terms that occur only once.
the top terms: 100, 1000, 5000, 10000, and 20000. The Of course, omitting such the terms extremely affects re-
same slices are used in [24]. The best results for the al- call of the result. Thus such the cases should be consid-
gorithms are presented in the Tables 5, 6. These tables ered for the ATR task.
shows that voting algorithm has better scores then ma-
chine learning method, but such the results are not fully One of the interesting observations is that the amount
comparable: FS for voting algorithm was performed on of training data is needed to rank terms without sufficient
the whole dataset, while Logistic Regression was trained performance drop is extremely low. It leads to the idea
on 10% of term candidates. The average performance of applying the bootstrapping approach for ATR:
gain for voting algorithm is about 7%, while for machine
learning it is only about 3%. 1. Having few marked-up examples, train the classifier
The best features for voting algorithm:
2. Use the classifier to extract new terms
1. Top-100: Relevance, TF*IDF 3. Use the most confident terms as initial data at step 1.
2. Top-1000: Relevance, Weirdness, TF*IDF 4. Iterate until all of confident terms will be extracted

3. Top-5000: Weirdness This is a semi-supervised method, because only lit-
tle marked-up data is needed to run the algorithm. Also
4. Top-10000: Weirdness the method can be transformed into fully unsupervised,
if initial data will be extracted by some unsupervised ap-
5. Top-20000: CValue, Frequency, Domain Rele- proach (for example, by voting algorithm). The similar
vance, Weirdness idea is implemented in study [22].

The best features for the machine learning approach: 4 Conclusion and Future work
1. Top-100: Words Count, Domain Consensus, Nor- In this paper we have compared the performance of two
malized Frequency, Domain Relevance, TF*IDF approaches for ATR: machine learning method and vot-
ing algorithm. For this purpose we implemented the set
2. Top-1000: Words Count, Domain Relevance, of features that include linguistic, statistical, termhood
Weirdness, TF*IDF and unithood feature types. All of the algorithms pro-
duced ranked list of terms that then was assessed by av-
3. Top-5000: Words Count, Frequency, Lexical Cohe- erage precision score.
sion, Relevance, Weirdness
In most tests machine learning method outperforms
4. Top-10000: Words Count, CValue, Domain Con- voting algorithm. Moreover it was explored that for the
sensus, Frequency, Weirdness, TF*IDF supervised method it is enough to have few marked-up
examples, about 10% in case of GENIA dataset, to rank
5. Top-20000: Words Count, CValue, Domain Rele- terms with good performance. It leads to the idea of ap-
vance, Weirdness, TF*IDF plying bootstrapping to ATR. Furthermore, initial data
for bootstrapping can be obtained by voting algorithm
As we can see, most of the subsets contain features because its top results are precise enough (see the Fig-
based on a general domain. The reason can be that the ure 1)
target corpus has high specificity, so the most of terms do
not occur in a general corpus. The best feature subsets for the task were also ex-
The next observation is that in case of the machine plored. Most of these features are based on a compar-
learning algorithm, Words Count feature occurs in all of ison between domain-specific documents collection and
the subsets. This observation confirms an assumption a reference general corpus. In case of the supervised ap-
that this feature is useful for algorithms that recognize proach, the feature Words Count occurs in all of the sub-
both the single- and multi-word terms. sets, so this feature is useful for the classifier, because
values of other features may have different meanings for [8] K.T. Frantzi and S. Ananiadou. Extracting nested
single- and multi-word terms. collocations. In Proceedings of the 16th confer-
ence on Computational linguistics-Volume 1, pages
In cases when one dataset is used for training and an- 41–46. Association for Computational Linguistics,
other to test, we could not get stable performance gain 1996.
using machine learning. Even the datasets are of the
same field, a distribution of terms can be different. So it [9] I. Guyon and A. Elisseeff. An introduction to vari-
is still unclear if it is possible to recognize terms from un- able and feature selection. The Journal of Machine
seen data of the same field having the once-trained clas- Learning Research, 3:1157–1182, 2003.
sifier. [10] K. Kageura and B. Umino. Methods of auto-
matic term recognition: A review. Terminology,
For our experiments we implemented the simple 3(2):259–289, 1996.
method of term candidates extraction: we filter out n-
grams that do not match predefined part-of-speech pat- [11] C.D. Manning and P. Raghavan. Introduction to
terns. This step of ATR can be performed in other ways, information retrieval, volume 1.
for example by shallow parsing, or chunking 3 , gener-
ating patterns from the dataset [17] or recognizing term [12] C.D. Manning and H. Schütze. Foundations of sta-
variants. tistical natural language processing. MIT press,
1999.
Another direction of further research is related to the [13] L.C. Molina, L. Belanche, and À. Nebot. Feature
evaluation of the algorithms on more datasets of different selection algorithms: A survey and experimental
languages and researching the ability of cross-domain evaluation. In Data Mining, 2002. ICDM 2003.
term recognition, i.e. using a dataset of one domain to Proceedings. 2002 IEEE International Conference
recognize terms from others. on, pages 306–313. IEEE, 2002.

Also of particular interest is the implementation and [14] R. Navigli and P. Velardi. Semantic interpretation
evaluation of semi- and unsupervised methods that in- of terminological strings. In Proc. 6th Intl Conf.
volve machine learning techniques. Terminology and Knowledge Eng, pages 95–100,
2002.
References [15] M.A. Nokel, E.I. Bolshakova, and N.V.
[1] K. Ahmad, L. Gillam, L. Tostevin, et al. University Loukachevitch. Combining multiple features
of surrey participation in trec8: Weirdness index- for single- word term extraction. 2012.
ing for logical document extrapolation and retrieval [16] Y. Park, R.J. Byrd, and B.K. Boguraev. Automatic
(wilder). In The Eighth Text REtrieval Conference glossary extraction: beyond terminology identifi-
(TREC-8), 1999. cation. In Proceedings of the 19th international
[2] Lars Ahrenberg. Term extraction: A review draft conference on Computational linguistics-Volume 1,
version 091221. 2009. pages 1–7. Association for Computational Linguis-
tics, 2002.
[3] K.W. Church and P. Hanks. Word associa-
[17] A. Patry and P. Langlais. Corpus-based ter-
tion norms, mutual information, and lexicography.
minology extraction. In Terminology and Con-
Computational linguistics, 16(1):22–29, 1990.
tent Development–Proceedings of 7th International
[4] B. Daille. Study and implementation of combined Conference On Terminology and Knowledge Engi-
techniques for automatic extraction of terminology. neering, Litera, Copenhagen, 2005.
The balancing act: Combining symbolic and statis-
[18] M. Pazienza, M. Pennacchiotti, and F. Zanzotto.
tical approaches to language, 1:49–66, 1996.
Terminology extraction: an analysis of linguis-
[5] B. Daille, B. Habert, C. Jacquemin, and J. Royauté. tic and statistical approaches. Knowledge Mining,
Empirical observation of term variations and prin- pages 255–279, 2005.
ciples for their description. Terminology, 3(2):197– [19] A. Peñas, F. Verdejo, J. Gonzalo, et al. Corpus-
257, 1996. based terminology extraction applied to informa-
[6] B. Dobrov and N. Loukachevitch. Multiple evi- tion access. In Proceedings of Corpus Linguistics,
dence for term extraction in broad domains. In volume 2001. Citeseer, 2001.
Proceedings of the 8th Recent Advances in Natural [20] F. Sclano and P. Velardi. Termextractor: a web ap-
Language Processing Conference (RANLP 2011). plication to learn the shared terminology of emer-
Hissar, Bulgaria, pages 710–715, 2011. gent web communities. Enterprise Interoperability
[7] J. Foo. Term extraction using machine learning. II, pages 287–290, 2007.
2009. [21] P. Velardi, M. Missikoff, and R. Basili. Identifica-
3 Free chunker can be found in OpenNLP project: tion of relevant terms to support the construction of
http://opennlp.apache.org domain ontologies. In Proceedings of the workshop
on Human Language Technology and Knowledge
Management-Volume 2001, page 5. Association for
Computational Linguistics, 2001.
[22] Y. Yang, H. Yu, Y. Meng, Y. Lu, and Y. Xia. Fault-
tolerant learning for term extraction. 2011.
[23] W. Zhang, T. Yoshida, and X. Tang. Using ontol-
ogy to improve precision of terminology extraction
from documents. Expert Systems with Applications,
36(5):9333–9339, 2009.
[24] Ziqi Zhang, Christopher Brewster, and Fabio
Ciravegna. A comparative evaluation of term
recognition algorithms. In Proceedings of the Sixth
International Conference on Language Resources
and Evaluation (LREC08), Marrakech, Morocco,
2008.
Figure 9: Dependency of AvP from num-
Figure 4: Dependency of AvP from number ber of excluded folds with changing testset
of excluded folds with fixed testset size: 10- size: 10-fold cross-validation with 1 to 9
fold cross-validation with 1 test fold and 9 test folds and 9 to 1 train folds: Top-100
to 1 train folds: Top-100 terms terms

Figure 5: Dependency of AvP from num-
ber of excluded folds with fixed testset size:
Top-1000 terms Figure 10: Dependency of AvP from num-
ber of excluded folds with changing testset
size: Top-1000 terms

Figure 6: Dependency of AvP from num-
ber of excluded folds with fixed testset size: Figure 11: Dependency of AvP from num-
Top-5000 terms ber of excluded folds with changing testset
size: Top-5000 terms

Figure 7: Dependency of AvP from num-
ber of excluded folds with fixed testset size: Figure 12: Dependency of AvP from num-
Top-10000 terms ber of excluded folds with changing testset
size: Top-10000 terms

Figure 8: Dependency of AvP from num-
ber of excluded folds with fixed testset size: Figure 13: Dependency of AvP from num-
Top-20000 terms ber of excluded folds with changing testset
size: Top-20000 terms