=Paper= {{Paper |id=Vol-1801/paper8 |storemode=property |title=Combining Dictionary- and Corpus-Based Concept Extraction |pdfUrl=https://ceur-ws.org/Vol-1801/paper8.pdf |volume=Vol-1801 |authors=Joan Codina-Filbà,Leo Wanner |dblpUrl=https://dblp.org/rec/conf/ecai/Codina-FilbaW16 }} ==Combining Dictionary- and Corpus-Based Concept Extraction== https://ceur-ws.org/Vol-1801/paper8.pdf
        Combining Dictionary- and Corpus-Based Concept
                          Extraction
                                                  Joan Codina-Filbà1 and Leo Wanner2


Abstract. Concept extraction is an increasingly popular topic in             Consider, for instance, BabelNet http://www.babelnet.org [21] and
deep text analysis. Concepts are individual content elements. Their          BabelFy http://www.babelfy.org [20]. BabelNet captures the terms
extraction offers thus an overview of the content of the material            from Wikipedia3 , WikiData4 , OmegaWiki5 , Wiktionary6 and Word-
from which they were extracted. In the case of domain-specific ma-           net [19] and disambiguates and structures them in terms of an ontol-
terial, concept extraction boils down to term identification. The most       ogy. Wikipedia is nowadays a crowd-sourced multilingual encyclo-
straightforward strategy for term identification is a look up in ex-         pedia that is constantly being updated by more than 100,000 active
isting terminological resources. In recent research, this strategy has a     editors only for the English version. There are studies, cf., e.g., [11],
poor reputation because it is prone to scaling limitations due to neolo-     which show that observing edits in the Wikipedia, one can learn what
gisms, lexical variation, synonymy, etc., which make the terminology         is happening around the globe. BabelFy is a tool that scans a text in
to be submitted to a constant change. For this reason, many works de-        search of terms and named entities (NEs) that are present in Babel-
veloped statistical techniques to extract concepts. But the existence        Net. Once the terms and NEs are detected, it uses the text as context
of a crowdsourced resource such as Wikipedia is changing the land-           in order to disambiguate them.
scape. We present a hybrid approach that combines state-of-the-art              In the light of this significant change of the terminological dic-
statistical techniques with the use of the large scale term acquisition      tionary landscape, it is time to assess whether dictionary-driven con-
tool BabelFy to perform concept extraction. The combination of both          cept extraction cannot be factored in into linguistic and corpus-driven
allows us to boost the performance, compared to approaches that use          concept extraction to improve the performance of the overall task.
these techniques separately.                                                 The three techniques complement each other: while linguistic cri-
                                                                             teria filter term candidates, statistical measures help detect domain-
                                                                             specific terms from these candidates, and dictionaries provide terms
1   Introduction                                                             from which we can assume that they are semantically meaningful.
                                                                                In what follows, we present our work in which we incorporate Ba-
Concept extraction is an increasingly popular topic in deep text anal-
                                                                             belFy, and by extension BabelNet and Wikipedia, into the process of
ysis. Concepts are individual content elements, such that their extrac-
                                                                             domain-specific linguistic and statistical term recognition. This work
tion from textual material offers an overview of the content of this
                                                                             has been carried out in the context of the MULTISENSOR Project,
material. In applications in which the material is domain-specific,
                                                                             which targets, among other objectives, concept extraction as a ba-
concept extraction commonly boils down to the identification and
                                                                             sis for content-oriented visual and textual summaries of multilingual
extraction of terms, i.e., domain-specific (mono- or multiple-word)
                                                                             online textual material.
lexical items. Usually, these are nominal lexical items that denote
                                                                                The remainder of the paper is structured as follows. In Section 2,
concrete or abstract entities. The most straight-forward strategy for
                                                                             we introduce the basics of statistical and dictionary-based concept
term identification is a look up in existing terminological dictionar-
                                                                             extraction. In Section 3, we then outline our approach. The set up
ies. In recent research, this strategy has a poor reputation because it
                                                                             of the experiments we carried out to evaluate our approach and the
is prone to scaling limitations due to neologisms, lexical variation,
                                                                             results we achieved are discussed in Sections 4 and 5. In Section 6,
synonymy, etc., which make the terminology be submitted to a con-
                                                                             we discuss the achieved results, while Section 7, finally, draws some
stant change [15]. As an alternative, a number of works cast syntactic
                                                                             conclusions and points out some future work.
and/or semantic criteria into rules to determine whether a given lexi-
cal item qualifies as a term [3, 4, 7], while others apply the statistical
criterion of relative frequency of an item in a domain-specific corpus;      2    The Basics of statistical and dictionary-based
see, for example, [1, 10, 22, 24, 25]. Most often, state-of-the-art sta-          concept extraction
tistical term identification is preceded by a rule-based stage in which      Only a few proposals for concept extraction rely solely on linguistic
the preselection of term candidates is done drawing upon linguistic          analysis to do term extraction, always assuming that a term is a nom-
criteria.                                                                    inal phrase (NP). Bourigault [5], as one of the first addressing the
   However, most of the state-of-the-art proposals neglect that a            task of concept extraction, uses for this purpose part-of-speech (PoS)
new generation of terminological (and thus conceptual) resources             tags. Manning and Schütze [16], and Kaur [14] draw upon regular
emerged and with them, instruments to keep these resources updated.          expressions of PoS sequences.
1                                                                            3 http://www.wikipedia.org
    NLP Group, Pompeu Fabra University, Barcelona, email:
                                                                             4 wikidata.org
  joan.codina@upf.edu
2 Catalan Institute for Research and Advanced Studies (ICREA) and NLP        5 omegaWiki.org

  Group, Pompeu Fabra University, Barcelona, email: leo.wanner@upf.edu       6 wikitionary.org
   More common is the extension of statistical term extraction by a            upon a reference corpus, while the frequency of the candidate term
preceding linguistic feature-driven term detection stage, such that we         in the target domain corpus can be assumed to be T F , such that we
can speak of two core strategies for concept extraction: the statistical       get: T Ftarget ∗ IDFref [16].
(or corpus-based) concept extraction and the dictionary-based con-                Other measures have been developed specifically for term detec-
cept extraction. As already pointed out, concept extraction means              tion. The most common of them are:
here “term extraction”. Although resources such as BabelNet are
considerably richer than traditional terminological dictionaries, they         • C-Value [10]. The objective of the C-Value score is to assign a ter-
can be considered as the modern variant of the latter. Let us revise             mhood value to each candidate token sequence, considering also
the basics of both of these two core strategies.                                 its occurrence inside other terms. The C-value expands each term
                                                                                 candidate with all its possible nested multiword subterms that will
2.1     Statistical term extraction                                              become also term candidates. For instance, the term candidate
                                                                                 floating point routine includes two nested terms: floating point,
Corpus-based terminology extraction started to attract attention in              which is a term, and point routine, which is not a meaningful ex-
the 90s, with the increasing availability of large computerized textual          pression.
corpora; see [13, 6] for a review of some early proposals. In general,           The following formula fomarlizes the calculation of the C-Value
corpus-based concept extraction relies on corpus statistics to score             measure:
and select the terms among the term candidates. In the course of the
years, a number of different statistics have been suggested to identify                 (
                                                                                            log2 |t| T F (t),                         t is not nested
relevant terms and best word groupings; cf., e.g., [2].                                                      P
                                                                                                                     T F (b)
                                                                                                                                                         (1)
   As a rule, the extraction is done in a three-step procedure:                             log2 |t| T F (t) − b∈T t
                                                                                                                P (Tt )
                                                                                                                                      otherwise

1. Term candidate detection. The objective of this first step is to              where t is the candidate token sequence, Tt the set of extracted
   find words and multiword sequences that could be terms. This first            candidate terms that contain t, and P (Tt ) the number of the can-
   step has to offer a high recall, as the terms missed here will not be         didate terms.
   considered in the remainder of the procedure.                               • Lexical Cohesion [22]. Lexical cohesion computes the cohesion
2. Compute features for term candidates. For each term candidate,                of multiword terms, that is, at this stage, any arbitrary n-gram.
   a set of features is computed. Most of the features are statistical           This measure is a generalization of the Dice coefficient; it is pro-
   and measure how often the term is found as such in the corpus and             portional to the length of the term and the frequency:
   in the document, as part of other terms, and also with respect to the
   words that compound it. These basic features are then combined                                            |t|log10 (T F (t)) T F (t)
                                                                                                  LC(t) =         P                                       (2)
   to compute a global score.                                                                                        w∈t T F (w)
3. Select final terms from candidates Term candidates that obtain
   higher scores are selected as terms. The cut-off strategy can be              where |t| is the length of the term and w the number of words that
   based on a threshold applied to the score (obtained from a training           compound it.
   set, in order to optimize precision/recall ) or on a fixed number of        • Domain Relevance [25]. This measure compares frequencies of
   terms (in that case, the top N terms are selected).                           the term between the target and reference datasets:

   In what follows, we discuss each of these steps in turn.                                                        T Ftarget (t)
                                                                                                  DR(t) =                                                 (3)
                                                                                                             T Ftarget (t) + T Fref (t)

2.1.1    Term candidate detection                                              • Relevance [24]. This measure has been developed in an applica-
The most basic statistical term candidate detection strategies are               tion that focuses on Spanish. The syntactic patterns used to detect
based on n-gram extraction. Any n-gram in a text collection could                term candidates are thus specific for Spanish, but the term scoring
be a term candidate. For instance, Foo and Merkel [9] use unigrams               is language-independent. The formula aims to give less weight to
and bigrams as term candidates.                                                  terms with lower frequency in the target corpus and a higher value
   n-gram based concept extraction is straightforward to implement.              to very frequent terms, unless they are also very frequent in the
However, it produces too many false positives, which add noise to the            reference corpus or are not evenly distributed in the target corpus:
following stages. As already mentioned above, for this reason, most
of the works use linguistic features such as part-of-speech patterns or                                                           1
NP markers [16, 10] for initial filtering. See [23] for an overview.                 Relevance(t) = 1 −                                                 (4)
                                                                                                                         T Ftarget (t)+DFtarget (t)
                                                                                                              log2               T Fref (t)


2.1.2    Feature Extraction                                                      where T F (t) is the relative term frequency, while DF (t) is the
Once the term candidates have been selected, they need to be scored              relative number of documents in which t appears. The document
in order to be ranked with respect to the probability that they are              frequency tries to block those terms that appear many times in a
actual terms.                                                                    single document.
   Most of the proposed metrics are based on term frequency T F ,              • Weirdness [1]. Weirdness takes into account the relative sizes of
as the number of occurrences of a term in a text collection. In In-              the corpora when comparing frequencies:
formation Retrieval, T F is contrasted to IDF (Inverse Document
Frequency), which penalizes the most common terms. For the task of                                              T Ftarget (t) · |Corpusref |
                                                                                            W eirdness(t) =                                               (5)
term extraction, IDF of a term candidate can be computed drawing                                                T Fref (t) · |Corpustarget |

                                                                           2
2.1.3     Term selection                                                            metric. The C-Value measure serves us to measure the termhood of a
                                                                                    candidate term, while the Weirdness metric reveals to what extent a
Each of the metrics in the previous subsection produces a score for                 term candidate is domain specific.
each term candidate.The final step is to use the scores produced by                    However, the Weirdness metric requires some adaptation. The
the chosen metric to filter out the terms under a given threshold.                  original Weirdness metric can namely range from 0 to infinite, which
   Taking the terms sorted by their scores, we expect to have a de-                 is not desirable. To keep the possible values within a limited range,
creasing precision as we move down to the list, while recall increases.             we changed the quotient between probabilities to a quotient between
The F-score reaches a maximum around the point where precision                      IDF’s. As a result, Equation 5 is transformed to:
and recall cross. The list should be truncated at this point, defining
the minimum threshold. But, of course, each dataset provides a dif-                                                         IDFref (t)
                                                                                                      DomW eight(t) =                                  (6)
ferent threshold that needs to be set after observing different training                                                   IDFtarget (t)
sets. Some authors (as, e.g., Frantzi et al. [10]) set an arbitrary thresh-            BabelFy offers an API that annotates terms of a given text found
old; others just measure precision and recall when truncating the list              in one of the resources it consults (WordNet, Wikipedia, WikiData,
after some fixed number of terms [8].                                               Wiktionary, etc.), distinguishing between named entities and con-
   When more than one metric is available, the different metrics can                cepts. Cf. Figure 1 for illustration. The figure shows the result of
be combined to produce a single score. There are two main strategies                processing a sentence with BabelFy’s web interface. As can be ob-
to do it: The first one is to feed a machine learning model with the                served, BabelFy annotates nouns (including multiword nouns), ad-
different metrics and let it learn how to combine these metrics [26].               jectives and verbs (such as working or examine). In accordance with
The simplest procedure in this case is to calculate a weighted aver-                the goals of MULTISENSOR, we keep only nominal annotations and
age tuned by linear regression; cf., e.g., [22]. The second strategy                discard verbal and adjectival ones. Furthermore, BabelFy can be con-
is to come up with a decision for each metric, trained with its own                 sidered a general purpose thesaurus, which is not tailored to any spe-
threshold, and then apply majority voting [27].                                     cific domain. For this reason, during domain-specific term extraction
                                                                                    as in MULTISENSOR, not all terms that have been annotated by Ba-
2.2      Use of terminological resources for terminology                            belFy should be considered as part of the domain terminology.
         detection                                                                     To ensure the domain specificity, we index the documents for
                                                                                    which the IDF (t) is computed in a Solr index,8 with a field that
The problem of the use of traditional terminological resources for                  indicates the domain to which each of them belongs. This allows us
concept (i.e., term) identification mentioned in Section 1 is reflected             an incremental set up in which new documents can always be indexed
by the low recall usually achieved by dictionary-based concept ex-                  and the statistics can be continuously updated.
traction. For instance, studies on the medical domain with the Gene
Ontology (GO) terms show a recall between 28% and 53% [17]. To
overcome this limitation, different techniques have been developed in
order to expand the quantity of matched terms. Thus, Jacquemin [12]
uses a derivational morphological processor for analysis and gener-
ation of term variants. Other authors, like Medelyan [18], use a the-
saurus to annotate a training set for the discovery of terms within
similar contexts.
   BabelNet is a new type of terminological resource. It reflects
the state of the continuously updated large scale resources such as
Wikipedia, WikiData, etc. At least in theory, BabelNet should thus
not suffer from the coverage shortcoming of traditionally static ter-
minological resources.7
   BabelFy takes all the n-grams (with n ≤ 5) of a given text that
contain at least one noun, and checks whether they are substrings of
any item in BabelNet. To perform the match, BabelFy uses lemmas.
   We can thus hypothesize that an approach that draws upon Babel-
Net is likely to benefit from its large coverage and continuous update.


3     Our Approach
In the MULTISENSOR project, term recognition is realized as a hy-                     Figure 1. Concepts and named entities detected in a sentence using the
brid module, which combines corpus-driven term identification with                                          BabelFy web interface
dictionary-based term identification that is based on BabelFy. Com-
bining corpus-driven and dictionary-based term identification, we
aim to enrich BabelFy’s domain-neutral strategy with domain infor-                     The documents indexed in Solr comprise the texts of these doc-
mation in order to be able to identify domain-specific terms.                       uments, together with all the term candidates in them. To index the
   Based on the insights from [8, 27], who compare different metrics,               term candidates, and in order to allow for queries that may match ei-
we decided to implement the C-Value measure and the Weirdness                       ther a full term or parts of it (which can be, again, full terms), we use
                                                                                    lemmas (instead of word forms) and underscores between the lem-
7 Note, however, that even if the Wikipedia is continuously updated, BabelNet
                                                                                    mas to indicate the beginning, middle, and end of the term. The first
    is updated in a batch mode from time to time, producing a delay between
    the crowdsourced changes and their availability in BabelNet.                    8 http://lucene.apache.org/solr



                                                                                3
lemma of the term is suffixed with an underscore, the middle lem-               yoghurts, and diary industries.
mas are prefixed and suffixed with underscores, while the last lemma
is prefixed with an underscore (for instance, the term candidate real           Table 1. Number of documents and concepts annotated for each use case.
time clocks would be indexed as real time clock).                               The number of indexed chunks indicates in how many different text portions
   At the beginning, the index is filled with the documents that con-                    the documents have been split (at sentence boundaries)
form the reference and domain corpora. When a new document ar-
rives, we check in both corpora the frequencies of the term candidates              Use     Name               Num. of         Num. of         annotated
                                                                                    Case                       documents       indexed         terms
as well as the frequencies of their parts as terms and as parts of other                                                       chunks
terms. To extract these frequencies, several partial matches are re-                0       Reference          21,994          43,808          —
quired, which can be specified taking advantage of the underscores                          Corpus
within the term notation. For instance, to obtain the frequency of the              1       Household          1,000           2,171           123
                                                                                            Appliences
expression real time as a term, without that it is part of a longer term,
                                                                                    2       Energy             1,000           1,565           80
we must search for real time. To obtain the frequency of the same                           Policies
sequence of lemmas as part of longer terms, the corresponding query                 3       Yoghurt            1,000           2,096           118
would be real time OR real time OR real time. In this last                                  Industry
query, the first part would match terms starting with the sequence
under consideration (as, e.g., real time clock); the second part will              The collection of documents for the three use cases has been ex-
match terms that contain the sequence in the middle (as, e.g., near             tracted from controlled sources, which ensures that the texts within
real time system); and the last part seeks terms ending with sequence           the collection are clean. The documents have been first processed
(as, e.g., near real time).                                                     with the goal to detect term candidates, i.e., tokenized, parsed and
   Queries in Solr provide the number of documents matching the                 passed through the NP detector. Once processed, they have been in-
query. This implies that a document with a multiple occurrence of a             dexed in a Solr index. In addition, all documents have been split into
term will be counted only once. In some of the formulas of Section              chunks of about 20 sentences to balance the length of the processed
2.1.2, document frequencies are considered, while in others it is the           texts. In order to evaluate the performance of our hybrid term ex-
term frequency. In order to minimize this discrepancy, and weight               traction, for each use case, a set of 20 sentences (from different doc-
evenly very long and very short documents, we split long documents              uments) has been annotated as a ground truth by a team of three
into groups of about 20 sentences.                                              annotators.
   To generate term candidates for the statistical term extraction, all            Table 1 summarizes the information about the different use cases,
NPs in the text are detected. The module takes as input already to-             the reference corpus, the number of original documents, the number
kenized sentences of a document. Tokens are lemmatized and anno-                of documents after indexing (with some of the documents split as
tated with POS and syntactic dependencies. To detect NPs, we go                 mentioned above), and the number of manually annotated terms for
over all the nodes of the tree in pre-order, finding the head nouns             each domain.
and the dependent elements. A set of rules indicates which nouns
and which dependants will form the NP. The system includes sets of
rules for all the languages we work with: English, German, French               5       Evaluation
and Spanish. Each term candidate is expanded with all the subterms              In order to evaluate the proposed approach to concept extraction, and
(i.e., n-grams that compose them). The term candidates and all the              to observe the impact of the merge of corpus-driven and dictionary-
substrings they contain are then scored using the C − V alue and                based extraction, we first measured the performance of both of them
DomW eight metrics. Those with a DomW eight below 0.8 and                       separately and then of the merge. Table 2 shows the precision and
nested terms with a lower C − V alue than the term they belong to               recall of the three runs.
are filtered out. The remaining candidates are sorted by decreasing
C − V alue and, when there is a tie, by DomW eight.                                 Table 2. Results obtained by the different approaches and the hybrid
   After processing the text with BabelFy, we obtain another list of                      system in the three use cases (‘p’ = precision; ‘r’= recall)
term candidates, namely those that are found in BabelNet. Both lists
are merged by intersection and again sorted according to their C −                  Use       Corpus-driven      Dictionary-based          Hybrid
V alue and DomW eight scores.                                                       Case
                                                                                            p         r          p         r           p            r
                                                                                    1       38.1%     93.5%      50.3%     76.4%       65.2%        71.54%
4   Experimental setup                                                              2       28.0%     97.3%      36.2%     74.68%      48.3%        70.9%
                                                                                    3       34.8%     79.5%      46.2%     68.4%       60.9%        57.3%
The term extraction methodology described above has been tested                     avg     33.6%     90.1%      44.2%     73.2%       58.1%        66.6%
for three different use cases. All three use cases are composed by a
selection of 1,000 news articles, blogs and other web pages related                It can be observed that the hybrid approach increases the precision
to different domains. The reference corpus is a set of about 22,000             by between 14% and 25% points and decreases the recall by between
documents from different sources.                                               7 and 24% . To assess whether the increase of precision compensates
   The first use case contains documents about household appliances,            for the loss of coverage, we computed the F-score in Table 3.
with information about both appliances as such and companies in-                   The table shows that the F-score of the hybrid approach is 7% over
volved in the market of household appliances manufacturing and                  the score of the BabelFy (i.e., dictionary-based) approach and 13%
trading. The second use case is about energy policies; it includes              above the corpus-driven approach.
news and web pages on green and renewable energy. The third use                    The results shown in Tables 2 and 3 have been calculated with all
case covers yoghurt industry; it contains documents about yoghurt               terms provided by corpus-driven and dictionary-based term extrac-
products, legal regulations concerning the production and trade with            tion; only terms with a DomW eight under 0.8 and nested terms

                                                                            4
    Table 3. F-scores obtained by the different approaches and the hybrid
                         system in the 3 use cases

         Use Case     Corpus-driven    Dictionary-driven    Hybrid
         1               54.1%               60.7%          68.2%
         2               43.5%               48.8%          57.4%
         3               48.4%               55.1%          59.1%
         avg             49.0%               55.1%          62.1%




                                                                                   Figure 3. Evolution of precision, recall and F-score as we move down to
                                                                                  the list of terms generated by the hybrid system, sorted by the score obtained
                                                                                                               by statistical metrics


                                                                                  Approaches that are based exclusively on linguistic features serve
Figure 2. Evolution of precision, recall and F-score as we move down the          well to find very rare terms, but they tend to be language- and
 list of terms generated by the corpus-driven term extraction and sorted by       domain-dependent, which reduces their scalability and coverage. The
                                  their score                                     same applies to approaches that use gazetteers.
                                                                                     Corpus-driven term identification provides term candidates that
with a C − V alue lower than the one of the term they belong to                   are domain-specific and common enough to be considered terms, but
have been filtered out without any further threshold adjustment. In               may be semantically meaningless.
other words, the ordering of the terms according to their C − V alue                 Both corpus-driven and dictionary-based approaches offer a high
and DomW eight scores has not been considered. If we use only the                 recall at the expense of low precision because each of them adds its
top N terms with the highest scores, the precision of corpus-based                own noise. When combining the two techniques, we increase the pre-
term identification increases. In our current implementation, we do               cision but lose some recall. However, the decrease of recall is over-
not implement a threshold to cut off the list because the users request           compensated by a sufficient increase of precision that leads to the
the top N terms (with N = 20) as a concept profile of a document.                 improvement of the F-score. This increase is more evident when we
   Figure 2 shows how precision, recall and F-score evolve as we                  concentrate on terms with a higher score.
move down the list of terms sorted by the score obtained with corpus-                The use of an index like Solr to maintain the corpus data allows
driven term extraction (recall that BabelFy does not provide any con-             for the creation of an incremental system that can be updated with
fidence score).                                                                   upcoming news, making the response dynamic when new concepts
   The score places the most relevant terms at the top of the list, in-           appear in a domain.
creasing the precision by more than 25 points over the average (as can
be observed in the precision/recall/F-score graph, the first 30 terms
                                                                                  7    Conclusions and Future Work
maintain a precision over 70%).
   Figure 3 shows the evolution of precision, recall and F-score for              We presented a hybrid approach to concept (i.e., term) identifica-
the hybrid term extraction, keeping the ranking provided by the                   tion and extraction. The approach combines a state-of-the-art corpus-
corpus-driven approach. In this case, hybrid term extraction main-                driven approach with a dictionary lookup based on BabelFy. The
tains a 100% precision for the first 17 terms and ends with 95% of                combination of both increases the overall performance as it takes
precision after the first 20 (a single term is wrong among them); 80%             the best of both. While statistics are very good in detecting domain-
precision are maintained for the first 35 terms.                                  specific terms, dictionaries provide terms which are semantically
   A baseline term identification that does not use scores would ob-              meaningful.
tain a precision of 33%, or 44% using BabelFy and selecting 20                       The use of BabelFy (and thus of BabelNet) allows us to avoid the
terms at random. When scores are used, the precision of the corpus-               typical limitation of dictionary-based term identification of coverage.
driven approach increases up to 47.7%. When both approaches are                   As already argued above, BabelNet, which has been generated au-
combined, the average precision for the three use cases increases to              tomatically from Wikipedia and other resources, is a crowdsourced
73.6%, resulting in an overall increase of 26% compared to the indi-              terminological resource that can be considered to contain a critical
vidual techniques.                                                                mass of terms needed for our task.
                                                                                     Crowdsourced and continuously updated dictionaries ensure the
                                                                                  availability of up-to-date resources, but there is still a time off-
6     Discussion
                                                                                  set between the emergence of a new term and its inclusion in the
The performance figures displayed in the previous section show that               Wikipedia. In the future, it can be insightful to observe the first oc-
a combination of corpus-driven and dictionary-based term identifica-              currences of a term and assess its potential status of an emerging
tion achieves better results than in separation, especially when the              concept that cannot be expected to be already in the Wikipedia. This
corpus-driven approach is preceded by a linguistic filtering stage.               would allow us to give those terms an appropriate score and thus

                                                                              5
avoid that they are filtered out.                                                    [18] Olena Medelyan and Ian H. Witten, ‘Thesaurus based automatic
   A relevant topic that we did not look at yet in our current work is                    keyphrase indexing’, in Proceedings of the 6th ACM/IEEE-CS Joint
                                                                                          Conference on Digital Libraries, JCDL ’06, pp. 296–297, New York,
the detection of the synonymy of terms, which would further increase                      NY, USA, (2006). ACM.
the accuracy of the retrieved concept profiles of the documents.                     [19] George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross,
                                                                                          and Katherine J Miller, ‘Introduction to wordnet: An on-line lexi-
                                                                                          cal database*’, International journal of lexicography, 3(4), 235–244,
ACKNOWLEDGEMENTS                                                                          (1990).
                                                                                     [20] Andrea Moro, Alessandro Raganato, and Roberto Navigli, ‘Entity link-
This work was partially supported by the European Commission un-                          ing meets word sense disambiguation: a unified approach’, Transac-
                                                                                          tions of the Association for Computational Linguistics, 2, 231–244,
der the contract number FP7-ICT-610411 (MULTISENSOR).
                                                                                          (2014).
                                                                                     [21] Roberto Navigli and Simone Paolo Ponzetto, ‘Babelnet: The automatic
                                                                                          construction, evaluation and application of a wide-coverage multilin-
REFERENCES                                                                                gual semantic network’, Artif. Intell., 193, 217–250, (December 2012).
                                                                                     [22] Youngja Park, Roy J Byrd, and Branimir K Boguraev, ‘Automatic glos-
 [1] Khurshid Ahmad, Lee Gillam, Lena Tostevin, et al., ‘University of sur-               sary extraction: beyond terminology identification’, in Proceedings of
     rey participation in TREC8: Weirdness indexing for logical document                  the 19th international conference on Computational linguistics-Volume
     extrapolation and retrieval (WILDER)’, in Procedings of TREC, (1999).                1, pp. 1–7. Association for Computational Linguistics, (2002).
 [2] Lars Ahrenberg. Term extraction: A review draft version 091221,                 [23] Maria Teresa Pazienza, Marco Pennacchiotti, and Fabio Massimo Zan-
     http://www.ida.liu.se/l̃arah03/publications/tereview v2.pdf, 2009.                   zotto, ‘Terminology extraction: an analysis of linguistic and statistical
 [3] Hassan Al-Haj and Shuly Wintner, ‘Identifying multi-word expressions                 approaches’, in Knowledge mining, 255–279, Springer, (2005).
     by leveraging morphological and syntactic idiosyncrasy’, in Proceed-            [24] Anselmo Peñas, Felisa Verdejo, Julio Gonzalo, et al., ‘Corpus-based
     ings of the 23rd International conference on Computational Linguistics,              terminology extraction applied to information access’, in Proceedings
     pp. 10–18. Association for Computational Linguistics, (2010).                        of Corpus Linguistics, volume 2001. Citeseer, (2001).
 [4] Colin Bannard, ‘A measure of syntactic flexibility for automatically            [25] Francesco Sclano and Paola Velardi, ‘Termextractor: a web application
     identifying multiword expressions in corpora’, in Proceedings of the                 to learn the shared terminology of emergent web communities’, in En-
     Workshop on a Broader Perspective on Multiword Expressions, pp. 1–                   terprise Interoperability II, 287–290, Springer, (2007).
     8. Association for Computational Linguistics, (2007).                           [26] Jordi Vivaldi, Horacio Rodrı́guez, et al., ‘Improving term extraction
 [5] Didier Bourigault, ‘Surface grammatical analysis for the extraction of               by system combination using boosting’, in Machine Learning: ECML
     terminological noun phrases’, in Proceedings of the 14th conference                  2001, 515–526, Springer, (2001).
     on Computational linguistics-Volume 3, pp. 977–981. Association for             [27] Ziqi Zhang, José Iria, Christopher Brewster, and Fabio Ciravegna, ‘A
     Computational Linguistics, (1992).                                                   comparative evaluation of term recognition algorithms.’, in Proceed-
 [6] M Teresa Cabré Castellvı́, Rosa Estopa Bagot, and Jordi Vivaldi Pala-               ings of LREC, (2008).
     tresi, ‘Automatic term detection: A review of current systems’, Recent
     advances in computational terminology, 2, 53–88, (2001).
 [7] Paul Cook, Afsaneh Fazly, and Suzanne Stevenson, ‘Pulling their
     weight: Exploiting syntactic forms for the automatic identification of
     idiomatic expressions in context’, in Proceedings of the workshop on a
     broader perspective on multiword expressions, pp. 41–48. Association
     for Computational Linguistics, (2007).
 [8] Denis Fedorenko, Nikita Astrakhantsev, and Denis Turdakov, ‘Auto-
     matic recognition of domain-specific terms: an experimental evalua-
     tion.’, in SYRCoDIS, pp. 15–23, (2013).
 [9] Jody Foo and Magnus Merkel, ‘Using machine learning to perform au-
     tomatic term recognition’, in Proceedings of the LREC 2010 Workshop
     on Methods for automatic acquisition of Language Resources and their
     evaluation methods, 23 May 2010, Valletta, Malta, pp. 49–54, (2010).
[10] Katerina T Frantzi, Sophia Ananiadou, and Junichi Tsujii, ‘The c-
     value/nc-value method of automatic recognition for multi-word terms’,
     in Research and advanced technology for digital libraries, 585–604,
     Springer, (1998).
[11] Martin Rudi Holaker and Eirik Emanuelsen, ‘Event detection us-
     ing wikipedia’, Technical report, Institutt for datateknikk og infor-
     masjonsvitenskap, (2013).
[12] Christian Jacquemin, Judith L. Klavans, and Evelyne Tzoukermann,
     ‘Expansion of multi-word terms for indexing and retrieval using mor-
     phology and syntax’, in Proceedings of the Eighth Conference on Euro-
     pean Chapter of the Association for Computational Linguistics, EACL
     ’97, pp. 24–31, Stroudsburg, PA, USA, (1997). Association for Com-
     putational Linguistics.
[13] Kyo Kageura and Bin Umino, ‘Methods of automatic term recognition:
     A review’, Terminology, 3(2), 259–289, (1996).
[14] Gagandeep Kaur, SK Jain, Saurabh Parmar, and Anand Kumar, ‘Extrac-
     tion of domain-specific concepts to create expertise profiles’, in Global
     Trends in Computing and Communication Systems, 763–771, Springer,
     (2012).
[15] Michael Krauthammer and Goran Nenadic, ‘Term identification in the
     biomedical literature’, Journal of biomedical informatics, 37(6), 512–
     526, (2004).
[16] Christopher D Manning and Hinrich Schütze, Foundations of statistical
     natural language processing, volume 999, MIT Press, 1999.
[17] Alexa T McCray, Allen C Browne, and Olivier Bodenreider, ‘The lexi-
     cal properties of the gene ontology’, in Proceedings of the AMIA Sym-
     posium, p. 504. American Medical Informatics Association, (2002).


                                                                                 6