=Paper= {{Paper |id=None |storemode=property |title=Automatic recognition of domain-specific terms: an experimental evaluation |pdfUrl=https://ceur-ws.org/Vol-1031/paper3.pdf |volume=Vol-1031 |dblpUrl=https://dblp.org/rec/conf/syrcodis/FedorenkoAT13 }} ==Automatic recognition of domain-specific terms: an experimental evaluation== https://ceur-ws.org/Vol-1031/paper3.pdf
           Automatic recognition of domain-specific terms: an
                        experimental evaluation

                 c Denis Fedorenko                 Nikita Astrakhantsev              Denis Turdakov
                         Institute for System Programming of Russian Academy of Sciences
                         fedorenko@ispras.ru, astrakhantsev@ispras.ru, turdakov@ispras.ru



                       Abstract                                  There are several studies comparing different ap-
                                                              proaches for ATR. In [18] authors compare different sin-
    This paper presents an experimental evaluation            gle statistical features by their effectiveness for term can-
    of the state-of-the-art approaches for automatic          didates ranking. In [24] the same comparison is ex-
    term recognition based on multiple features:              tended by voting algorithm that combines multiple fea-
    machine learning method and voting algorithm.             tures. Studies [17], [15] compare supervised machine
    We show that in most cases machine learning               learning method with the approach based on single fea-
    approach obtains the best results and needs little        ture again.
    data for training; we also find the best subsets of          In turn, the present study experimentally evaluates
    all popular features.                                     the ranking methods combining multiple features: super-
                                                              vised machine learning approach and voting algorithm.
                                                              We pay most of the attention to the supervised method in
1 Introduction                                                order to explore its applicability to ATR.
Automatic term recognition (ATR) is an actual problem            The purposes of the study are the following:
of text processing. The task is to recognize and extract
terminological units from different domain-specific text        • To compare results of machine learning approach
collections. Resulting terms can be useful in more com-           and voting algorithm;
plex tasks such as semantic search, question-answering,         • To compare different machine learning algorithms
ontology construction, word sense induction, etc.                 applied to ATR;
   There have been a lot of studies of ATR. Most of them
split the task into three common steps:                         • To explore how much training data is needed to rank
                                                                  terms;
 1. Extracting term candidates. At this step special
    algorithm extracts words and word sequences ad-             • To find the most valuable features for the methods;
    missible to be terms. In most cases researches use
                                                              This study is organized as follows. At the beginning we
    predefined or generated part-of-speech patterns to
                                                              describe the approaches more detailed. Section 3 is de-
    filter out word sequences that do not match such the
                                                              voted to the performed experiments: firstly, we describe
    patterns. The rest of word sequences becomes term
                                                              evaluation methodology, then report the obtained results,
    candidates.
                                                              and, finally, discuss them. In Section 4 we conclude the
                                                              study and consider the further research.
 2. Extracting features of term candidates. Feature
    is a measurable characteristic of a candidate that is
    used to recognize terms. There are a lot of statistical   2 Related work
    and linguistic features that can be useful for term       In this section we describe some of the approaches to
    recognition.                                              ATR. Most of them have the same extracting algorithm
                                                              but consider different feature sets, so the final results de-
 3. Extracting final terms from candidates. This step         pend only on the used features. We also briefly describe
    varies depending upon the way in which researches         features used in the task. For more detailed survey of
    use features to recognize terms. In some studies au-      ATR see [10], [2].
    thors filter out non-terms by comparing feature val-
    ues with thresholds: if feature values lies in specific   2.1 Extracting term candidates overview
    ranges, then candidate is considered to be a term.
    Others try to rank candidates and expect the top-N        Strictly, all of the word sequences, or n-grams, occur-
    ones to be terms. At last, few studies apply super-       ring in text collections can be term candidates. But in
    vised machine learning methods in order to com-           most cases researchers consider only unigrams and bi-
    bine features effectively.                                grams [18]. Of course, only the little part of such the
                                                              candidates are terms, because the candidates’ list mainly
Proceedings of the Ninth Spring Researcher’s Colloquium       consists of sequences like ”a”, ”the”, ”some of”, ”so the”,
on Database and Information Systems, Kazan, Russia, 2013      etc. Hence such the noise should be filtered out.
   One of the first methods for such the filtering was de-            where p - hypothesis of independence, N - a number
scribed in [12]. The algorithm extracts term candidates            of bigrams in the corpus.
by matching the text collection with predefined Part-of-
Speech (PoS) patterns, such as:                                        The assumption of this feature is that the text is a
                                                                   Bernoulli process, where meeting of bigram t is a ”suc-
  • Noun                                                           cess”, while meeting of other bigrams is a ”failure”.
  • Adjective Noun                                                     Hypothesis of independence is usually expressed as
                                                                   follows: p = P (w1 w2 ) = P (w1 ) · P (w2 ), where P (w1 )
  • Adjective Noun Noun                                            - a probability to encounter the first word of the bigram,
   As was reported in [12], such the patterns cut off              P (w2 ) - a probability to encounter the second one. This
much of the noise (word sequences that are not terms)              expression can be assessed by replacing the probabilities
but retain real terms, because in most cases terms are             of words to their normalized frequencies within a text:
noun phrases [5]. Filtering of term candidates that do             p = T FN(w1 ) · T FN(w2 ) , where N - an overall number of
not satisfy some of the morphological properties of word           words in the text.
sequences is known as lingustic step of ATR.                           If words are independently distributed in text collec-
   In work [17] the authors do not use predefined pat-             tion, then they do not form persistent collocation. It is
terns appealing to the fact that PoS tagger can be not pre-        assumed that any domain-specific term is a collocation,
cise enough on some texts; they instead generate patterns          while not any collocation is a specific term. So consid-
for each text collection. In study [7] no linguistic step is       ering features like T-test, we can increase the confidence
used: the algorithm considers all n-grams from text col-           in that candidate is a collocation, but not necessarily spe-
lection.                                                           cific term.
                                                                       There are much more features that are used in ATR.
2.2 Features overview                                                  C-Value [8] has higher values for candidates that are
                                                                   not parts of other word sequences:
Having a lot of term candidates, it is necessary to recog-
nize domain specific ones among them. It can be done
by using the statistical features computed on the basis of                  C-V alue(t) = log2 |t| · T F (t) −
the text collection or some another resource, for example                             1          X
general corpus [12], domain ontology [23] or Web [6].                       −                         T F (seq)               (3)
                                                                              |{seq : t ∈ seq|} t∈seq
This part of ATR algorithm is known as statistical step.
    Term Frequency is a number of occurrences of the
word sequence in the text collection. This feature is                 Domain Consensus [14] recognizes terms that are uni-
based on the assumption that if the word sequence is               formly distributed on the whole dataset:
specific for some domain, then it often occurs in such
domain texts. In some studies frequency is also used as                                   X T Fd (t)      T Fd (t)
an initial filter of term candidates [3]: if a candidate has               DC(t) = −                 log2                     (4)
                                                                                            T F (t)       T F (t)
a very low frequency, then it is filtered out. It helps to                              d∈Docs
reduce much of the noise and improves precision of the                Domain Relevance [20] compares frequencies of the
results.                                                           term in two datasets - target and general:
    TF*IDF has high values for terms that often occur
only in few documents: TF is a term frequency and IDF
is an inversed number of documents, where the term oc-                                          T Ftarget (t)
                                                                           DR(t) =                                            (5)
curs:                                                                                 T Ftarget (t) + T Fref erence (t)
                                                                      Lexical Cohesion [16] is the unithood feature that
                                         |Docs|
  T F ∗ IDF (t) = T F (t) · log                       (1)          compares frequency of term and frequency of words
                                    |{Doc : t ∈ Doc}|              from which it consists:
    To find domain-specific terms that are distributed on
                                                                                         |t| · T F (t) · log10 T F (t)
the whole text collection, in [12] IDF is considered as                       LC(t) =           P                             (6)
an inversed number of documents in reference corpus,                                               w∈t T F (w)
where the term occurs. Reference corpus is a some gen-                Loglikelihood [12] is the analogue of T-test but with-
eral, i.e. not specific, text collection.                          out assumption about how words in a text are distributed:
    The described features shows how the word sequence
is related to the text collection, or termhood of a candi-
date. There is another class of features that show inner                            b(c12 ; c1 , p)b(c2 − c12 ; N − c1 , p)
                                                                    LL(t) = log                                                (7)
strength of words cohesion, or unithood [10]. One of the                           b(c12 ; c1 , p1 )b(c2 − c12 ; N − c1 , p2 )
first features of this class is T-test.
    T-test [12] is a statistical test that was initialy designed      where c12 - a frequency of bigram t, c1 - a frequency
for bigrams and checks the hypothesis of independence              of the bigram’s the first word, c2 - a frequency of the
of words constituting a term:                                      second one, p = cN2 , p1 = cc12
                                                                                                 1
                                                                                                   , p2 = cN2 −c
                                                                                                              −c1 , b(·; ·, ·) -
                                                                                                                12

                                                                   binomial distribution.
                                 T F (t)
                                           −p
                  T -stat(t) = qN                           (2)      Relevance [19] is the more sophisticated analogue of
                                     p(1−p)
                                       N                           Domain Relevance:
                                                                         Dataset      Algorithm                   AvP
                                                                         GENIA        Random Forest               0.54
                                   1
    R(t) = 1 −             TF      (t)·DFtarget (t)
                                                         (8)             GENIA        Logistic Regression         0.55
                  log2 (2 + target
                              T Fref erence (t)     )                    GENIA        Voting                      0.53
                                                                         Bio1         Random Forest               0.35
   Weirdness [1] also compares frequencies in different                  Bio1         Logistic Regression         0.40
collections but also takes into account sizes of such the                Bio1         Voting                      0.23
collections:
                                                               Table 1: Results of cross-validation without frequency
                                                               filter
                 T Ftarget (t) · |Corpusref erence |
       W (t) =                                           (9)             Dataset      Algorithm                   AvP
                 T Fref erence (t) · |Corpustarget |
                                                                         GENIA        Random Forest               0.66
   The described feature list includes termhood, unit-                   GENIA        Logistic Regression         0.70
hood and hybrid features. The termhood features are                      GENIA        Voting                      0.65
Domain Consensus, Domain Relevance, Relevance, and                       Bio1         Random Forest               0.52
Weirdness. The unithood features are Lexical Cohesion                    Bio1         Logistic Regression         0.58
and Loglikelihood. The hybrid feature, or feature that                   Bio1         Voting                      0.31
shows both termhood and unithood, is C-Value.
   A lot of works still concentrate on feature engineer-       Table 2: Results of cross-validation with frequency filter
ing, trying to find more informative features. Neverthe-
less, recent trend is to combine all these features effec-     3 Evaluation
tively.                                                        For our experiments we implemented two approaches for
                                                               ATR. We used voting algorithm as the first one, while in
2.3 Recognizing terms overview                                 supervised case we trained two classifiers: Random For-
Having feature values, final results can be produced. The      est and Logistic Regression from WEKA library 1 . These
studies [8], [12], [1] use ranking algorithm to provide the    classifiers were chosen because of their effectiveness and
most probable terms, but this algorithm considers only         good generalization ability of the resulting model. Fur-
one feature. The studies [20], [16] describe the simplest      thermore, these classifiers are able to produce classifica-
way of how multiple features can be considered: all val-       tion confidence - a numeric score that can be used to rank
ues are simply reduced in a one weighted average value         an example in overall test set. It is an important property
that then is used during ranking.                              of the selected algorithms that allows to compare their
    In work [21] authors introduce special rules based on      results with results produced by other ranking methods.
thresholds for feature values. An example of such a rule
is the following:                                              3.1 Evaluation methology
                                                               The quality of the algorithms is usually assessed by two
          Rulei (t) = Fi (t) > a and Fi (t) < b         (10)   common metrics: precision and recall [11]. Precision is
   where Fi is a i-th feature; a, b are thresholds for fea-    the fraction of retrieved instances that are relevant:
ture values.                                                                         |correct returned results|
   Note that the thresholds are selected manually or com-                     P =                                          (12)
                                                                                       |all returned results|
puted from the marked-up corpora, so this method can
not be considered as purely automatic and unsupervised.            Recall is the fraction of relevant instances that are re-
   Effective way of combining multiple features was in-        trieved:
troduced in [24]. It combines the features in a voting
manner using the following formula:                                                  |correct returned results|
                                                                              R=                                           (13)
                                                                                        |all correct results|
                           X
                           n
                                   1
                 V (t) =                                (11)      In addition to precision and recall scores, Average
                           i
                               rank(Fi (t))                    Precision (AvP) [12] is commonly used [24] to assess
                                                               ranked results. It defines as:
   where n is a number of considered features,
rank(Fi (t)) is a rank of the term t among values of other                             X
                                                                                       N
terms considering feature Fi .                                                               P (i)∆R(i)                    (14)
   In addition, study [24] shows that the described voting                             i=1
method in general outperforms most of the methods that
consider only one feature or reduce them in a weighted             where P (i) is the precision of top-i results, ∆R(i)
average value. Another important advantage of the vot-         change in recall from top-(i-1) to top-i results.
ing algorithm is that it does not require normalization of         Obviously, this score tends to be higher for algorithms
feature values.                                                that print out correct terms on top positions of the result.
   There are several studies that apply supervised meth-           In our experiments we considered only the AvP score,
ods for term recognition. In [17] authors apply Ada            while precision and recall are omitted. For voting algo-
Boost meta-classifier, while in [7] Ripper system is used.     rithm it is no simple way to compute recall, because it is
The study [22] describes hybrid approach including both            1 Official      website          of      the          project:
unsupervised and supervised methods.                           http://www.cs.waikato.ac.nz/ml/weka/
not obvious what number of top results should be consid-       Trainset    Testset     Algorithm               AvP
ered as correct terms. Also in a general case the overall      GENIA       Bio1        Random Forest           0.30
number of terms in dataset is unknown.                         GENIA       Bio1        Logistic Regression     0.35
                                                               –           Bio1        Voting                  0.25
3.2 Features                                                   Bio1        GENIA       Random Forest           0.44
                                                               Bio1        GENIA       Logistic Regression     0.42
For our experiments we implemented the following fea-
                                                               –           GENIA       Voting                  0.55
tures: C-Value, Domain Consensus, Domain Relevance,
Frequency, Lexical Cohesion, Loglikelihood, Relevance,      Table 3: Results of evaluation on separated train and test
TF*IDF, Weirdness and Words Count. Words Count is           sets without frequency filter
the simple feature that shows a number of words in a
word sequence. This feature may be useful for the clas-
                                                               Trainset    Testset     Algorithm               AvP
sifier since values of other features may have different
                                                               GENIA       Bio1        Random Forest           0.34
meanings for single- and multi-word terms [2].
                                                               GENIA       Bio1        Logistic Regression     0.48
    Most of these features are capable to recognize both
                                                               –           Bio1        Voting                  0.31
single- and multi-word terms, except T-test and Log-
likelihood that are designed to recognize only two-word        Bio1        GENIA       Random Forest           0.60
terms (bigrams). We generalize them to the case of n-          Bio1        GENIA       Logistic Regression     0.62
grams according to the study [4].                              –           GENIA       Voting                  0.65
    Some of the features consider information from the
                                                            Table 4: Results of evaluation on separated train and test
collection of general-domain texts (reference corpus),
                                                            sets with frequency filter
in our case these features are Domain Relevance, Rele-
vance, Weirdness. For this purpose we use statistics from
                                                               In the following tests the whole feature set was con-
Corpus of Contemporary American English 2 .
                                                            sidered and the overall ranked result was assessed.
    For extracting term candidates we implemented sim-
ple approach based on predefined part-of-speech pat-
terns. For simplicity, we extracted only unigrams, bi-      Cross-validation
grams and trigrams by using patterns such as:
                                                            We performed 4-fold cross-validation of the algorithms
 1. Noun                                                    on both the corpora. We extracted term candidates from
                                                            the whole dataset and divided them on train and test sets.
 2. Noun Noun                                               In other words, we considered the case when having
 3. Adjective Noun                                          some marked-up examples (train set) we should recog-
                                                            nize terms in the rest of data (test set) extracted from the
 4. Noun Noun Noun                                          same corpus. So in case of voting algorithm the training
                                                            set was simply omitted.
 5. Adjective Noun Noun                                         The results of cross-validation are shown in the Ta-
 6. Noun Adjective Noun                                     bles 1, 2. The Table 2 presents results of cross-validation
                                                            on term candidates that appears at least two times in the
3.3 Datasets                                                corpus.
                                                                As we can see, in both the cases machine learning ap-
Evaluation of the approaches was performed on two           proach outperformed voting algorithm. Moreover, in the
datasets of medical and biological domains consisting of    case without rare terms a difference of scores is higher.
short English texts with marked-up specific terms:          It can be explained by the following: feature values of
       Corpus        Documents         Words    Terms       rare terms (especially Frequency, Domain Consensus)
       GENIA         2000              400000   35000       are useless for the classification and add a noise to the
       Bio1          100               20000    1200        model. When such the terms are omitted, the model be-
                                                            comes more clear.
   The last one (Bio1) has common texts with the first          Also in most cases Logistic Regression algorithm out-
(GENIA), so we filtered out the texts that occur in both    performed Random Forest, so in most of further tests we
the corpora. We left GENIA without any modifications,       used only the best one.
while 20 texts were removed from Bio1 as common texts
of the corpora.
                                                            Separate train and test datasets
3.4 Experimental results
                                                            Having two datasets of the same field, the idea is to check
3.4.1 Machine learning method versus Voting algo-           how the model trained on the one can predict the data
      rithm                                                 from the other. For this purpose we used GENIA as a
                                                            training set and Bio1 as a test one, then visa versa.
We considered two test scenarios in order to compare
quality of the implemented algorithms. For each scenario        The results are shown in the Tables 3, 4. In the case
we performed two kinds of tests: with and without filter-   when Bio1 was used as a training set, voting algorithm
ing of rare term candidates.                                outperformed trained classifier. It could happen due to
                                                            the fact that the training data from Bio1 does not fully
  2 Statistics available at www.ngrams.info                 reflect properties of terms in GENIA.
Figure 1: Dependency of AvP from top results given by         Figure 3: Dependency of AvP from train set size on sep-
cross-validation                                              arated train and test sets
                                                              that the number of test folds was being increased at each
                                                              step. So we started with nine folds used for training and
                                                              one fold used for the test. At the next step we moved
                                                              one fold from training set to the test set and evaluated
                                                              again. The results are presented on the Figures 9–13.
                                                              The interesting observation is that higher values of AvP
                                                              correspond to the bigger sizes of the test set. It could hap-
                                                              pen because with increasing of the test set the number of
                                                              high-confident terms is also growing: such the terms take
                                                              most of the top positions of the list and improve AvP.
                                                              In case of GENIA and Bio1 the top of the list mainly
                                                              consists from the highly domain-specific terms that take
                                                              high values for the features like Domain Relevance, Rel-
Figure 2: Dependency of AvP from top results on sepa-         evance, Weirdness: such the terms occur in the corpora
rated train and test sets                                     frequently enough.
                                                                  As we can see, in all of the cases the gain of AvP
3.4.2 Dependency of average precision from num-               stopped quickly. So, in case of GENIA, it is enough to
      ber of top results                                      train on 10% of candidates to rank the rest 90% with the
In previous tests we considered overall results produced      same performance. It could happen because of the rela-
by the algorithms. Descending from the top to the bottom      tively small number of features are used and their speci-
of the ranked list, AvP score can significantly change,       ficity: most of them designed to have high magnitude for
so one algorithm can outperform another one on top-100        terms and low for non-terms. So, the data can be easily
results but lose on top-1000. In order to explore this de-    separated by the classifier having few training examples.
pendency, we measured AvP for different slices of the
top results.                                                  3.5 Feature selection
   The Figure 1 shows the dependency of AvP from
number of top results given by 4-fold cross-validation.       Feature selection (FS) is the process of finding the most
   We also considered a scenario when GENIA was used          relevant features for the task. Having a lot of different
for training and Bio1 for testing. The results are pre-       features, the goal is to exclude redundant and irrelevant
sented on the Figure 2.                                       ones from the feature set. Redundant features provide no
                                                              useful information as compared with the current feature
                                                              set, while irrelevant features do not provide information
3.4.3 Dependency of classifier performance from
                                                              in any context.
      training set size
                                                                 There are different algorithms of FS. Some of them
In order to explore dependency between the amount of          rank separate features by relevance to the task, while oth-
data used for training and average precision, we consid-      ers search subsets of features that get the best model for
ered three test scenarios.                                    the predictor [9]. Also the algorithms differ by their com-
   At first, we trained the classifiers on GENIA dataset      plexity. Because of big amount of features used in some
and tested it on Bio1. At each step the amount of training    tasks, it is not possible to do exhaustive search, so fea-
data was being decreased, while the test data remained        tures are selected by greedy algorithms [13].
without any modifications. The results of the test are pre-      In our task we concentrated on searching the subsets
sented on the Figure 3.                                       of features that get the best results for the task. For such
   Next, we started with 10-fold cross-validation on          purpose we ran quality tests for all possible feature sub-
GENIA and at each step decreased the number of folds          sets, or, in other words, performed the exhaustive search.
used for training of Logistic Regression and did not          Having 10 features, we check 210 − 1 different combina-
change the number of folds used for testing. The results      tions of them. In case of the machine learning method,
are shown on the Figures 4–8.                                 we used 9 folds for test and one fold for train. The reason
   The last test is the same as the previous one, except      of such the configuration is that the classifier needs little
      Top count    All features    The best features                Top count    All features    The best features
      100          0.9256          0.9915                           100          0.8997          0.9856
      1000         0.8138          0.8761                           1000         0.8414          0.8757
      5000         0.7128          0.7885                           5000         0.7694          0.7875
      10000        0.667           0.7380                           10000        0.7309          0.7329
      20000        0.6174          0.6804                           20000        0.6623          0.6714

      Table 5: Results of FS for voting algorithm                  Table 6: Results of FS for Logistic Regression

data for training to rank terms with the same performance     3.6 Discussion
(see the previous section). For voting algorithm, we sim-     Despite the fact that filtering of the candidates occurring
ply ranked candidates and then assessed overall list. All     only once in the corpus improves average precision of
of the tests were performed on GENIA corpus and only          the methods, it is not always a good idea to exclude such
the Logistic Regression was used as the machine learning      the candidates. The reason is that a lot of specific terms
algorithm.                                                    can occur only once in a dataset: for example, in GENIA
   The AvP score was computed for different slices of         there are 50% of considered terms that occur only once.
the top terms: 100, 1000, 5000, 10000, and 20000. The         Of course, omitting such the terms extremely affects re-
same slices are used in [24]. The best results for the al-    call of the result. Thus such the cases should be consid-
gorithms are presented in the Tables 5, 6. These tables       ered for the ATR task.
shows that voting algorithm has better scores then ma-
chine learning method, but such the results are not fully         One of the interesting observations is that the amount
comparable: FS for voting algorithm was performed on          of training data is needed to rank terms without sufficient
the whole dataset, while Logistic Regression was trained      performance drop is extremely low. It leads to the idea
on 10% of term candidates. The average performance            of applying the bootstrapping approach for ATR:
gain for voting algorithm is about 7%, while for machine
learning it is only about 3%.                                  1. Having few marked-up examples, train the classifier
   The best features for voting algorithm:
                                                               2. Use the classifier to extract new terms
 1. Top-100: Relevance, TF*IDF                                 3. Use the most confident terms as initial data at step 1.
 2. Top-1000: Relevance, Weirdness, TF*IDF                     4. Iterate until all of confident terms will be extracted

 3. Top-5000: Weirdness                                           This is a semi-supervised method, because only lit-
                                                              tle marked-up data is needed to run the algorithm. Also
 4. Top-10000: Weirdness                                      the method can be transformed into fully unsupervised,
                                                              if initial data will be extracted by some unsupervised ap-
 5. Top-20000: CValue, Frequency, Domain Rele-                proach (for example, by voting algorithm). The similar
    vance, Weirdness                                          idea is implemented in study [22].

   The best features for the machine learning approach:       4 Conclusion and Future work
 1. Top-100: Words Count, Domain Consensus, Nor-              In this paper we have compared the performance of two
    malized Frequency, Domain Relevance, TF*IDF               approaches for ATR: machine learning method and vot-
                                                              ing algorithm. For this purpose we implemented the set
 2. Top-1000: Words Count, Domain Relevance,                  of features that include linguistic, statistical, termhood
    Weirdness, TF*IDF                                         and unithood feature types. All of the algorithms pro-
                                                              duced ranked list of terms that then was assessed by av-
 3. Top-5000: Words Count, Frequency, Lexical Cohe-           erage precision score.
    sion, Relevance, Weirdness
                                                                 In most tests machine learning method outperforms
 4. Top-10000: Words Count, CValue, Domain Con-               voting algorithm. Moreover it was explored that for the
    sensus, Frequency, Weirdness, TF*IDF                      supervised method it is enough to have few marked-up
                                                              examples, about 10% in case of GENIA dataset, to rank
 5. Top-20000: Words Count, CValue, Domain Rele-              terms with good performance. It leads to the idea of ap-
    vance, Weirdness, TF*IDF                                  plying bootstrapping to ATR. Furthermore, initial data
                                                              for bootstrapping can be obtained by voting algorithm
   As we can see, most of the subsets contain features        because its top results are precise enough (see the Fig-
based on a general domain. The reason can be that the         ure 1)
target corpus has high specificity, so the most of terms do
not occur in a general corpus.                                   The best feature subsets for the task were also ex-
   The next observation is that in case of the machine        plored. Most of these features are based on a compar-
learning algorithm, Words Count feature occurs in all of      ison between domain-specific documents collection and
the subsets. This observation confirms an assumption          a reference general corpus. In case of the supervised ap-
that this feature is useful for algorithms that recognize     proach, the feature Words Count occurs in all of the sub-
both the single- and multi-word terms.                        sets, so this feature is useful for the classifier, because
values of other features may have different meanings for            [8] K.T. Frantzi and S. Ananiadou. Extracting nested
single- and multi-word terms.                                           collocations. In Proceedings of the 16th confer-
                                                                        ence on Computational linguistics-Volume 1, pages
    In cases when one dataset is used for training and an-              41–46. Association for Computational Linguistics,
other to test, we could not get stable performance gain                 1996.
using machine learning. Even the datasets are of the
same field, a distribution of terms can be different. So it         [9] I. Guyon and A. Elisseeff. An introduction to vari-
is still unclear if it is possible to recognize terms from un-          able and feature selection. The Journal of Machine
seen data of the same field having the once-trained clas-               Learning Research, 3:1157–1182, 2003.
sifier.                                                            [10] K. Kageura and B. Umino. Methods of auto-
                                                                        matic term recognition: A review. Terminology,
   For our experiments we implemented the simple                        3(2):259–289, 1996.
method of term candidates extraction: we filter out n-
grams that do not match predefined part-of-speech pat-             [11] C.D. Manning and P. Raghavan. Introduction to
terns. This step of ATR can be performed in other ways,                 information retrieval, volume 1.
for example by shallow parsing, or chunking 3 , gener-
ating patterns from the dataset [17] or recognizing term           [12] C.D. Manning and H. Schütze. Foundations of sta-
variants.                                                               tistical natural language processing. MIT press,
                                                                        1999.
   Another direction of further research is related to the         [13] L.C. Molina, L. Belanche, and À. Nebot. Feature
evaluation of the algorithms on more datasets of different              selection algorithms: A survey and experimental
languages and researching the ability of cross-domain                   evaluation. In Data Mining, 2002. ICDM 2003.
term recognition, i.e. using a dataset of one domain to                 Proceedings. 2002 IEEE International Conference
recognize terms from others.                                            on, pages 306–313. IEEE, 2002.

   Also of particular interest is the implementation and           [14] R. Navigli and P. Velardi. Semantic interpretation
evaluation of semi- and unsupervised methods that in-                   of terminological strings. In Proc. 6th Intl Conf.
volve machine learning techniques.                                      Terminology and Knowledge Eng, pages 95–100,
                                                                        2002.
References                                                         [15] M.A. Nokel, E.I. Bolshakova, and N.V.
 [1] K. Ahmad, L. Gillam, L. Tostevin, et al. University                Loukachevitch.      Combining multiple features
     of surrey participation in trec8: Weirdness index-                 for single- word term extraction. 2012.
     ing for logical document extrapolation and retrieval          [16] Y. Park, R.J. Byrd, and B.K. Boguraev. Automatic
     (wilder). In The Eighth Text REtrieval Conference                  glossary extraction: beyond terminology identifi-
     (TREC-8), 1999.                                                    cation. In Proceedings of the 19th international
 [2] Lars Ahrenberg. Term extraction: A review draft                    conference on Computational linguistics-Volume 1,
     version 091221. 2009.                                              pages 1–7. Association for Computational Linguis-
                                                                        tics, 2002.
 [3] K.W. Church and P. Hanks.          Word associa-
                                                                   [17] A. Patry and P. Langlais.      Corpus-based ter-
     tion norms, mutual information, and lexicography.
                                                                        minology extraction. In Terminology and Con-
     Computational linguistics, 16(1):22–29, 1990.
                                                                        tent Development–Proceedings of 7th International
 [4] B. Daille. Study and implementation of combined                    Conference On Terminology and Knowledge Engi-
     techniques for automatic extraction of terminology.                neering, Litera, Copenhagen, 2005.
     The balancing act: Combining symbolic and statis-
                                                                   [18] M. Pazienza, M. Pennacchiotti, and F. Zanzotto.
     tical approaches to language, 1:49–66, 1996.
                                                                        Terminology extraction: an analysis of linguis-
 [5] B. Daille, B. Habert, C. Jacquemin, and J. Royauté.               tic and statistical approaches. Knowledge Mining,
     Empirical observation of term variations and prin-                 pages 255–279, 2005.
     ciples for their description. Terminology, 3(2):197–          [19] A. Peñas, F. Verdejo, J. Gonzalo, et al. Corpus-
     257, 1996.                                                         based terminology extraction applied to informa-
 [6] B. Dobrov and N. Loukachevitch. Multiple evi-                      tion access. In Proceedings of Corpus Linguistics,
     dence for term extraction in broad domains. In                     volume 2001. Citeseer, 2001.
     Proceedings of the 8th Recent Advances in Natural             [20] F. Sclano and P. Velardi. Termextractor: a web ap-
     Language Processing Conference (RANLP 2011).                       plication to learn the shared terminology of emer-
     Hissar, Bulgaria, pages 710–715, 2011.                             gent web communities. Enterprise Interoperability
 [7] J. Foo. Term extraction using machine learning.                    II, pages 287–290, 2007.
     2009.                                                         [21] P. Velardi, M. Missikoff, and R. Basili. Identifica-
   3 Free   chunker can     be   found   in   OpenNLP   project:        tion of relevant terms to support the construction of
http://opennlp.apache.org                                               domain ontologies. In Proceedings of the workshop
     on Human Language Technology and Knowledge
     Management-Volume 2001, page 5. Association for
     Computational Linguistics, 2001.
[22] Y. Yang, H. Yu, Y. Meng, Y. Lu, and Y. Xia. Fault-
     tolerant learning for term extraction. 2011.
[23] W. Zhang, T. Yoshida, and X. Tang. Using ontol-
     ogy to improve precision of terminology extraction
     from documents. Expert Systems with Applications,
     36(5):9333–9339, 2009.
[24] Ziqi Zhang, Christopher Brewster, and Fabio
     Ciravegna. A comparative evaluation of term
     recognition algorithms. In Proceedings of the Sixth
     International Conference on Language Resources
     and Evaluation (LREC08), Marrakech, Morocco,
     2008.
                                                 Figure 9: Dependency of AvP from num-
Figure 4: Dependency of AvP from number          ber of excluded folds with changing testset
of excluded folds with fixed testset size: 10-   size: 10-fold cross-validation with 1 to 9
fold cross-validation with 1 test fold and 9     test folds and 9 to 1 train folds: Top-100
to 1 train folds: Top-100 terms                  terms




Figure 5: Dependency of AvP from num-
ber of excluded folds with fixed testset size:
Top-1000 terms                                   Figure 10: Dependency of AvP from num-
                                                 ber of excluded folds with changing testset
                                                 size: Top-1000 terms




Figure 6: Dependency of AvP from num-
ber of excluded folds with fixed testset size:   Figure 11: Dependency of AvP from num-
Top-5000 terms                                   ber of excluded folds with changing testset
                                                 size: Top-5000 terms




Figure 7: Dependency of AvP from num-
ber of excluded folds with fixed testset size:   Figure 12: Dependency of AvP from num-
Top-10000 terms                                  ber of excluded folds with changing testset
                                                 size: Top-10000 terms




Figure 8: Dependency of AvP from num-
ber of excluded folds with fixed testset size:   Figure 13: Dependency of AvP from num-
Top-20000 terms                                  ber of excluded folds with changing testset
                                                 size: Top-20000 terms