<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Combining Dictionary-and Corpus-Based Concept Extraction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Joan</forename><surname>Codina-Filbà</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Leo</forename><surname>Wanner</surname></persName>
						</author>
						<title level="a" type="main">Combining Dictionary-and Corpus-Based Concept Extraction</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">582B64E757FA1E7C0D114F80053D770B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T06:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Concept extraction is an increasingly popular topic in deep text analysis. Concepts are individual content elements. Their extraction offers thus an overview of the content of the material from which they were extracted. In the case of domain-specific material, concept extraction boils down to term identification. The most straightforward strategy for term identification is a look up in existing terminological resources. In recent research, this strategy has a poor reputation because it is prone to scaling limitations due to neologisms, lexical variation, synonymy, etc., which make the terminology to be submitted to a constant change. For this reason, many works developed statistical techniques to extract concepts. But the existence of a crowdsourced resource such as Wikipedia is changing the landscape. We present a hybrid approach that combines state-of-the-art statistical techniques with the use of the large scale term acquisition tool BabelFy to perform concept extraction. The combination of both allows us to boost the performance, compared to approaches that use these techniques separately.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Concept extraction is an increasingly popular topic in deep text analysis. Concepts are individual content elements, such that their extraction from textual material offers an overview of the content of this material. In applications in which the material is domain-specific, concept extraction commonly boils down to the identification and extraction of terms, i.e., domain-specific (mono-or multiple-word) lexical items. Usually, these are nominal lexical items that denote concrete or abstract entities. The most straight-forward strategy for term identification is a look up in existing terminological dictionaries. In recent research, this strategy has a poor reputation because it is prone to scaling limitations due to neologisms, lexical variation, synonymy, etc., which make the terminology be submitted to a constant change <ref type="bibr" target="#b14">[15]</ref>. As an alternative, a number of works cast syntactic and/or semantic criteria into rules to determine whether a given lexical item qualifies as a term <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b6">7]</ref>, while others apply the statistical criterion of relative frequency of an item in a domain-specific corpus; see, for example, <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b23">24,</ref><ref type="bibr" target="#b24">25]</ref>. Most often, state-of-the-art statistical term identification is preceded by a rule-based stage in which the preselection of term candidates is done drawing upon linguistic criteria.</p><p>However, most of the state-of-the-art proposals neglect that a new generation of terminological (and thus conceptual) resources emerged and with them, instruments to keep these resources updated.</p><p>1 NLP Group, Pompeu Fabra University, Barcelona, email: joan.codina@upf.edu 2 Catalan Institute for Research and Advanced Studies (ICREA) and NLP Group, Pompeu Fabra University, Barcelona, email: leo.wanner@upf.edu Consider, for instance, BabelNet http://www.babelnet.org <ref type="bibr" target="#b20">[21]</ref> and BabelFy http://www.babelfy.org <ref type="bibr" target="#b19">[20]</ref>. BabelNet captures the terms from Wikipedia<ref type="foot" target="#foot_0">3</ref> , WikiData <ref type="foot" target="#foot_1">4</ref> , OmegaWiki <ref type="foot" target="#foot_2">5</ref> , Wiktionary <ref type="foot" target="#foot_3">6</ref> and Wordnet <ref type="bibr" target="#b18">[19]</ref> and disambiguates and structures them in terms of an ontology. Wikipedia is nowadays a crowd-sourced multilingual encyclopedia that is constantly being updated by more than 100,000 active editors only for the English version. There are studies, cf., e.g., <ref type="bibr" target="#b10">[11]</ref>, which show that observing edits in the Wikipedia, one can learn what is happening around the globe. BabelFy is a tool that scans a text in search of terms and named entities (NEs) that are present in Babel-Net. Once the terms and NEs are detected, it uses the text as context in order to disambiguate them.</p><p>In the light of this significant change of the terminological dictionary landscape, it is time to assess whether dictionary-driven concept extraction cannot be factored in into linguistic and corpus-driven concept extraction to improve the performance of the overall task. The three techniques complement each other: while linguistic criteria filter term candidates, statistical measures help detect domainspecific terms from these candidates, and dictionaries provide terms from which we can assume that they are semantically meaningful.</p><p>In what follows, we present our work in which we incorporate Ba-belFy, and by extension BabelNet and Wikipedia, into the process of domain-specific linguistic and statistical term recognition. This work has been carried out in the context of the MULTISENSOR Project, which targets, among other objectives, concept extraction as a basis for content-oriented visual and textual summaries of multilingual online textual material.</p><p>The remainder of the paper is structured as follows. In Section 2, we introduce the basics of statistical and dictionary-based concept extraction. In Section 3, we then outline our approach. The set up of the experiments we carried out to evaluate our approach and the results we achieved are discussed in Sections 4 and 5. In Section 6, we discuss the achieved results, while Section 7, finally, draws some conclusions and points out some future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">The Basics of statistical and dictionary-based concept extraction</head><p>Only a few proposals for concept extraction rely solely on linguistic analysis to do term extraction, always assuming that a term is a nominal phrase (NP). Bourigault <ref type="bibr" target="#b4">[5]</ref>, as one of the first addressing the task of concept extraction, uses for this purpose part-of-speech (PoS) tags. Manning and Schütze <ref type="bibr" target="#b15">[16]</ref>, and Kaur <ref type="bibr" target="#b13">[14]</ref> draw upon regular expressions of PoS sequences.</p><p>More common is the extension of statistical term extraction by a preceding linguistic feature-driven term detection stage, such that we can speak of two core strategies for concept extraction: the statistical (or corpus-based) concept extraction and the dictionary-based concept extraction. As already pointed out, concept extraction means here "term extraction". Although resources such as BabelNet are considerably richer than traditional terminological dictionaries, they can be considered as the modern variant of the latter. Let us revise the basics of both of these two core strategies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Statistical term extraction</head><p>Corpus-based terminology extraction started to attract attention in the 90s, with the increasing availability of large computerized textual corpora; see <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b5">6]</ref> for a review of some early proposals. In general, corpus-based concept extraction relies on corpus statistics to score and select the terms among the term candidates. In the course of the years, a number of different statistics have been suggested to identify relevant terms and best word groupings; cf., e.g., <ref type="bibr" target="#b1">[2]</ref>.</p><p>As a rule, the extraction is done in a three-step procedure:</p><p>1. Term candidate detection. The objective of this first step is to find words and multiword sequences that could be terms. This first step has to offer a high recall, as the terms missed here will not be considered in the remainder of the procedure. 2. Compute features for term candidates. For each term candidate, a set of features is computed. Most of the features are statistical and measure how often the term is found as such in the corpus and in the document, as part of other terms, and also with respect to the words that compound it. These basic features are then combined to compute a global score. 3. Select final terms from candidates Term candidates that obtain higher scores are selected as terms. The cut-off strategy can be based on a threshold applied to the score (obtained from a training set, in order to optimize precision/recall ) or on a fixed number of terms (in that case, the top N terms are selected).</p><p>In what follows, we discuss each of these steps in turn.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1">Term candidate detection</head><p>The most basic statistical term candidate detection strategies are based on n-gram extraction. Any n-gram in a text collection could be a term candidate. For instance, Foo and Merkel <ref type="bibr" target="#b8">[9]</ref> use unigrams and bigrams as term candidates. n-gram based concept extraction is straightforward to implement. However, it produces too many false positives, which add noise to the following stages. As already mentioned above, for this reason, most of the works use linguistic features such as part-of-speech patterns or NP markers <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b9">10]</ref> for initial filtering. See <ref type="bibr" target="#b22">[23]</ref> for an overview.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.2">Feature Extraction</head><p>Once the term candidates have been selected, they need to be scored in order to be ranked with respect to the probability that they are actual terms.</p><p>Most of the proposed metrics are based on term frequency T F , as the number of occurrences of a term in a text collection. In Information Retrieval, T F is contrasted to IDF (Inverse Document Frequency), which penalizes the most common terms. For the task of term extraction, IDF of a term candidate can be computed drawing upon a reference corpus, while the frequency of the candidate term in the target domain corpus can be assumed to be T F , such that we get: T Ftarget * IDF ref <ref type="bibr" target="#b15">[16]</ref>.</p><p>Other measures have been developed specifically for term detection. The most common of them are:</p><p>• C-Value <ref type="bibr" target="#b9">[10]</ref>. The objective of the C-Value score is to assign a termhood value to each candidate token sequence, considering also its occurrence inside other terms. The C-value expands each term candidate with all its possible nested multiword subterms that will become also term candidates. For instance, the term candidate floating point routine includes two nested terms: floating point, which is a term, and point routine, which is not a meaningful expression.</p><p>The following formula fomarlizes the calculation of the C-Value measure:</p><formula xml:id="formula_0">log2 |t| T F (t), t is not nested log2 |t| T F (t) − b∈T t T F (b) P (T t ) otherwise (<label>1</label></formula><formula xml:id="formula_1">)</formula><p>where t is the candidate token sequence, Tt the set of extracted candidate terms that contain t, and P (Tt) the number of the candidate terms. • Lexical Cohesion <ref type="bibr" target="#b21">[22]</ref>. Lexical cohesion computes the cohesion of multiword terms, that is, at this stage, any arbitrary n-gram. This measure is a generalization of the Dice coefficient; it is proportional to the length of the term and the frequency:</p><formula xml:id="formula_2">LC(t) = |t|log10 (T F (t)) T F (t) w∈t T F (w)<label>(2)</label></formula><p>where |t| is the length of the term and w the number of words that compound it. • Domain Relevance <ref type="bibr" target="#b24">[25]</ref>. This measure compares frequencies of the term between the target and reference datasets:</p><formula xml:id="formula_3">DR(t) = T Ftarget(t) T Ftarget(t) + T F ref (t)<label>(3)</label></formula><p>• Relevance <ref type="bibr" target="#b23">[24]</ref>. This measure has been developed in an application that focuses on Spanish. The syntactic patterns used to detect term candidates are thus specific for Spanish, but the term scoring is language-independent. The formula aims to give less weight to terms with lower frequency in the target corpus and a higher value to very frequent terms, unless they are also very frequent in the reference corpus or are not evenly distributed in the target corpus:</p><formula xml:id="formula_4">Relevance(t) = 1 − 1 log2 T F target (t)+DF target (t) T F ref (t)<label>(4)</label></formula><p>where T F (t) is the relative term frequency, while DF (t) is the relative number of documents in which t appears. The document frequency tries to block those terms that appear many times in a single document. • Weirdness <ref type="bibr" target="#b0">[1]</ref>. Weirdness takes into account the relative sizes of the corpora when comparing frequencies:</p><formula xml:id="formula_5">W eirdness(t) = T Ftarget(t) • |Corpus ref | T F ref (t) • |Corpustarget| (5)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.3">Term selection</head><p>Each of the metrics in the previous subsection produces a score for each term candidate.The final step is to use the scores produced by the chosen metric to filter out the terms under a given threshold.</p><p>Taking the terms sorted by their scores, we expect to have a decreasing precision as we move down to the list, while recall increases. The F-score reaches a maximum around the point where precision and recall cross. The list should be truncated at this point, defining the minimum threshold. But, of course, each dataset provides a different threshold that needs to be set after observing different training sets. Some authors (as, e.g., Frantzi et al. <ref type="bibr" target="#b9">[10]</ref>) set an arbitrary threshold; others just measure precision and recall when truncating the list after some fixed number of terms <ref type="bibr" target="#b7">[8]</ref>.</p><p>When more than one metric is available, the different metrics can be combined to produce a single score. There are two main strategies to do it: The first one is to feed a machine learning model with the different metrics and let it learn how to combine these metrics <ref type="bibr" target="#b25">[26]</ref>. The simplest procedure in this case is to calculate a weighted average tuned by linear regression; cf., e.g., <ref type="bibr" target="#b21">[22]</ref>. The second strategy is to come up with a decision for each metric, trained with its own threshold, and then apply majority voting <ref type="bibr" target="#b26">[27]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Use of terminological resources for terminology detection</head><p>The problem of the use of traditional terminological resources for concept (i.e., term) identification mentioned in Section 1 is reflected by the low recall usually achieved by dictionary-based concept extraction. For instance, studies on the medical domain with the Gene Ontology (GO) terms show a recall between 28% and 53% <ref type="bibr" target="#b16">[17]</ref>. To overcome this limitation, different techniques have been developed in order to expand the quantity of matched terms. Thus, Jacquemin <ref type="bibr" target="#b11">[12]</ref> uses a derivational morphological processor for analysis and generation of term variants. Other authors, like Medelyan <ref type="bibr" target="#b17">[18]</ref>, use a thesaurus to annotate a training set for the discovery of terms within similar contexts. BabelNet is a new type of terminological resource. It reflects the state of the continuously updated large scale resources such as Wikipedia, WikiData, etc. At least in theory, BabelNet should thus not suffer from the coverage shortcoming of traditionally static terminological resources. 7  BabelFy takes all the n-grams (with n ≤ 5) of a given text that contain at least one noun, and checks whether they are substrings of any item in BabelNet. To perform the match, BabelFy uses lemmas.</p><p>We can thus hypothesize that an approach that draws upon Babel-Net is likely to benefit from its large coverage and continuous update.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Our Approach</head><p>In the MULTISENSOR project, term recognition is realized as a hybrid module, which combines corpus-driven term identification with dictionary-based term identification that is based on BabelFy. Combining corpus-driven and dictionary-based term identification, we aim to enrich BabelFy's domain-neutral strategy with domain information in order to be able to identify domain-specific terms.</p><p>Based on the insights from <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b26">27]</ref>, who compare different metrics, we decided to implement the C-Value measure and the Weirdness 7 Note, however, that even if the Wikipedia is continuously updated, BabelNet is updated in a batch mode from time to time, producing a delay between the crowdsourced changes and their availability in BabelNet.</p><p>metric. The C-Value measure serves us to measure the termhood of a candidate term, while the Weirdness metric reveals to what extent a term candidate is domain specific. However, the Weirdness metric requires some adaptation. The original Weirdness metric can namely range from 0 to infinite, which is not desirable. To keep the possible values within a limited range, we changed the quotient between probabilities to a quotient between IDF's. As a result, Equation 5 is transformed to:</p><formula xml:id="formula_6">DomW eight(t) = IDF ref (t) IDFtarget(t)<label>(6)</label></formula><p>BabelFy offers an API that annotates terms of a given text found in one of the resources it consults (WordNet, Wikipedia, WikiData, Wiktionary, etc.), distinguishing between named entities and concepts. Cf. Figure <ref type="figure" target="#fig_0">1</ref> for illustration. The figure shows the result of processing a sentence with BabelFy's web interface. As can be observed, BabelFy annotates nouns (including multiword nouns), adjectives and verbs (such as working or examine). In accordance with the goals of MULTISENSOR, we keep only nominal annotations and discard verbal and adjectival ones. Furthermore, BabelFy can be considered a general purpose thesaurus, which is not tailored to any specific domain. For this reason, during domain-specific term extraction as in MULTISENSOR, not all terms that have been annotated by Ba-belFy should be considered as part of the domain terminology.</p><p>To ensure the domain specificity, we index the documents for which the IDF (t) is computed in a Solr index, <ref type="foot" target="#foot_4">8</ref> with a field that indicates the domain to which each of them belongs. This allows us an incremental set up in which new documents can always be indexed and the statistics can be continuously updated. The documents indexed in Solr comprise the texts of these documents, together with all the term candidates in them. To index the term candidates, and in order to allow for queries that may match either a full term or parts of it (which can be, again, full terms), we use lemmas (instead of word forms) and underscores between the lemmas to indicate the beginning, middle, and end of the term. The first lemma of the term is suffixed with an underscore, the middle lemmas are prefixed and suffixed with underscores, while the last lemma is prefixed with an underscore (for instance, the term candidate real time clocks would be indexed as real time clock).</p><p>At the beginning, the index is filled with the documents that conform the reference and domain corpora. When a new document arrives, we check in both corpora the frequencies of the term candidates as well as the frequencies of their parts as terms and as parts of other terms. To extract these frequencies, several partial matches are required, which can be specified taking advantage of the underscores within the term notation. For instance, to obtain the frequency of the expression real time as a term, without that it is part of a longer term, we must search for real time. To obtain the frequency of the same sequence of lemmas as part of longer terms, the corresponding query would be real time OR real time OR real time. In this last query, the first part would match terms starting with the sequence under consideration (as, e.g., real time clock); the second part will match terms that contain the sequence in the middle (as, e.g., near real time system); and the last part seeks terms ending with sequence (as, e.g., near real time).</p><p>Queries in Solr provide the number of documents matching the query. This implies that a document with a multiple occurrence of a term will be counted only once. In some of the formulas of Section 2.1.2, document frequencies are considered, while in others it is the term frequency. In order to minimize this discrepancy, and weight evenly very long and very short documents, we split long documents into groups of about 20 sentences.</p><p>To generate term candidates for the statistical term extraction, all NPs in the text are detected. The module takes as input already tokenized sentences of a document. Tokens are lemmatized and annotated with POS and syntactic dependencies. To detect NPs, we go over all the nodes of the tree in pre-order, finding the head nouns and the dependent elements. A set of rules indicates which nouns and which dependants will form the NP. The system includes sets of rules for all the languages we work with: English, German, French and Spanish. Each term candidate is expanded with all the subterms (i.e., n-grams that compose them). The term candidates and all the substrings they contain are then scored using the C − V alue and DomW eight metrics. Those with a DomW eight below 0.8 and nested terms with a lower C − V alue than the term they belong to are filtered out. The remaining candidates are sorted by decreasing C − V alue and, when there is a tie, by DomW eight.</p><p>After processing the text with BabelFy, we obtain another list of term candidates, namely those that are found in BabelNet. Both lists are merged by intersection and again sorted according to their C − V alue and DomW eight scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental setup</head><p>The term extraction methodology described above has been tested for three different use cases. All three use cases are composed by a selection of 1,000 news articles, blogs and other web pages related to different domains. The reference corpus is a set of about 22,000 documents from different sources.</p><p>The first use case contains documents about household appliances, with information about both appliances as such and companies involved in the market of household appliances manufacturing and trading. The second use case is about energy policies; it includes news and web pages on green and renewable energy. The third use case covers yoghurt industry; it contains documents about yoghurt products, legal regulations concerning the production and trade with yoghurts, and diary industries. The collection of documents for the three use cases has been extracted from controlled sources, which ensures that the texts within the collection are clean. The documents have been first processed with the goal to detect term candidates, i.e., tokenized, parsed and passed through the NP detector. Once processed, they have been indexed in a Solr index. In addition, all documents have been split into chunks of about 20 sentences to balance the length of the processed texts. In order to evaluate the performance of our hybrid term extraction, for each use case, a set of 20 sentences (from different documents) has been annotated as a ground truth by a team of three annotators.</p><p>Table <ref type="table" target="#tab_0">1</ref> summarizes the information about the different use cases, the reference corpus, the number of original documents, the number of documents after indexing (with some of the documents split as mentioned above), and the number of manually annotated terms for each domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Evaluation</head><p>In order to evaluate the proposed approach to concept extraction, and to observe the impact of the merge of corpus-driven and dictionarybased extraction, we first measured the performance of both of them separately and then of the merge. Table <ref type="table" target="#tab_1">2</ref> shows the precision and recall of the three runs. It can be observed that the hybrid approach increases the precision by between 14% and 25% points and decreases the recall by between 7 and 24% . To assess whether the increase of precision compensates for the loss of coverage, we computed the F-score in Table <ref type="table" target="#tab_2">3</ref>.</p><p>The table shows that the F-score of the hybrid approach is 7% over the score of the BabelFy (i.e., dictionary-based) approach and 13% above the corpus-driven approach.</p><p>The results shown in Tables <ref type="table" target="#tab_2">2 and 3</ref> have been calculated with all terms provided by corpus-driven and dictionary-based term extraction; only terms with a DomW eight under 0.8 and nested terms  Figure <ref type="figure" target="#fig_1">2</ref> shows how precision, recall and F-score evolve as we move down the list of terms sorted by the score obtained with corpusdriven term extraction (recall that BabelFy does not provide any confidence score).</p><p>The score places the most relevant terms at the top of the list, increasing the precision by more than 25 points over the average (as can be observed in the precision/recall/F-score graph, the first 30 terms maintain a precision over 70%).</p><p>Figure <ref type="figure" target="#fig_2">3</ref> shows the evolution of precision, recall and F-score for the hybrid term extraction, keeping the ranking provided by the corpus-driven approach. In this case, hybrid term extraction maintains a 100% precision for the first 17 terms and ends with 95% of precision after the first 20 (a single term is wrong among them); 80% precision are maintained for the first 35 terms.</p><p>A baseline term identification that does not use scores would obtain a precision of 33%, or 44% using BabelFy and selecting 20 terms at random. When scores are used, the precision of the corpusdriven approach increases up to 47.7%. When both approaches are combined, the average precision for the three use cases increases to 73.6%, resulting in an overall increase of 26% compared to the individual techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Discussion</head><p>The performance figures displayed in the previous section show that a combination of corpus-driven and dictionary-based term identification achieves better results than in separation, especially when the corpus-driven approach is preceded by a linguistic filtering stage. Approaches that are based exclusively on linguistic features serve well to find very rare terms, but they tend to be language-and domain-dependent, which reduces their scalability and coverage. The same applies to approaches that use gazetteers.</p><p>Corpus-driven term identification provides term candidates that are domain-specific and common enough to be considered terms, but may be semantically meaningless.</p><p>Both corpus-driven and dictionary-based approaches offer a high recall at the expense of low precision because each of them adds its own noise. When combining the two techniques, we increase the precision but lose some recall. However, the decrease of recall is overcompensated by a sufficient increase of precision that leads to the improvement of the F-score. This increase is more evident when we concentrate on terms with a higher score.</p><p>The use of an index like Solr to maintain the corpus data allows for the creation of an incremental system that can be updated with upcoming news, making the response dynamic when new concepts appear in a domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusions and Future Work</head><p>We presented a hybrid approach to concept (i.e., term) identification and extraction. The approach combines a state-of-the-art corpusdriven approach with a dictionary lookup based on BabelFy. The combination of both increases the overall performance as it takes the best of both. While statistics are very good in detecting domainspecific terms, dictionaries provide terms which are semantically meaningful.</p><p>The use of BabelFy (and thus of BabelNet) allows us to avoid the typical limitation of dictionary-based term identification of coverage. As already argued above, BabelNet, which has been generated automatically from Wikipedia and other resources, is a crowdsourced terminological resource that can be considered to contain a critical mass of terms needed for our task.</p><p>Crowdsourced and continuously updated dictionaries ensure the availability of up-to-date resources, but there is still a time offset between the emergence of a new term and its inclusion in the Wikipedia. In the future, it can be insightful to observe the first occurrences of a term and assess its potential status of an emerging concept that cannot be expected to be already in the Wikipedia. This would allow us to give those terms an appropriate score and thus avoid that they are filtered out.</p><p>A relevant topic that we did not look at yet in our current work is the detection of the synonymy of terms, which would further increase the accuracy of the retrieved concept profiles of the documents.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. Concepts and named entities detected in a sentence using the BabelFy web interface</figDesc><graphic coords="3,305.13,396.00,245.09,215.55" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 .</head><label>2</label><figDesc>Figure 2. Evolution of precision, recall and F-score as we move down the list of terms generated by the corpus-driven term extraction and sorted by their score</figDesc><graphic coords="5,42.11,161.93,245.09,130.69" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 .</head><label>3</label><figDesc>Figure 3. Evolution of precision, recall and F-score as we move down to the list of terms generated by the hybrid system, sorted by the score obtained by statistical metrics</figDesc><graphic coords="5,305.13,71.12,245.08,161.01" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Number of documents and concepts annotated for each use case. The number of indexed chunks indicates in how many different text portions the documents have been split (at sentence boundaries)</figDesc><table><row><cell>Use</cell><cell>Name</cell><cell>Num. of</cell><cell>Num. of</cell><cell>annotated</cell></row><row><cell>Case</cell><cell></cell><cell>documents</cell><cell>indexed</cell><cell>terms</cell></row><row><cell></cell><cell></cell><cell></cell><cell>chunks</cell><cell></cell></row><row><cell>0</cell><cell>Reference</cell><cell>21,994</cell><cell>43,808</cell><cell>-</cell></row><row><cell></cell><cell>Corpus</cell><cell></cell><cell></cell><cell></cell></row><row><cell>1</cell><cell>Household</cell><cell>1,000</cell><cell>2,171</cell><cell>123</cell></row><row><cell></cell><cell>Appliences</cell><cell></cell><cell></cell><cell></cell></row><row><cell>2</cell><cell>Energy</cell><cell>1,000</cell><cell>1,565</cell><cell>80</cell></row><row><cell></cell><cell>Policies</cell><cell></cell><cell></cell><cell></cell></row><row><cell>3</cell><cell>Yoghurt</cell><cell>1,000</cell><cell>2,096</cell><cell>118</cell></row><row><cell></cell><cell>Industry</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Results obtained by the different approaches and the hybrid system in the three use cases ('p' = precision; 'r'= recall)</figDesc><table><row><cell>Use</cell><cell cols="2">Corpus-driven</cell><cell cols="2">Dictionary-based</cell><cell cols="2">Hybrid</cell></row><row><cell>Case</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>p</cell><cell>r</cell><cell>p</cell><cell>r</cell><cell>p</cell><cell>r</cell></row><row><cell>1</cell><cell>38.1%</cell><cell>93.5%</cell><cell>50.3%</cell><cell>76.4%</cell><cell>65.2%</cell><cell>71.54%</cell></row><row><cell>2</cell><cell>28.0%</cell><cell>97.3%</cell><cell>36.2%</cell><cell cols="2">74.68% 48.3%</cell><cell>70.9%</cell></row><row><cell>3</cell><cell>34.8%</cell><cell>79.5%</cell><cell>46.2%</cell><cell>68.4%</cell><cell>60.9%</cell><cell>57.3%</cell></row><row><cell>avg</cell><cell>33.6%</cell><cell>90.1%</cell><cell>44.2%</cell><cell>73.2%</cell><cell>58.1%</cell><cell>66.6%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>F-scores obtained by the different approaches and the hybrid system in the 3 use cases</figDesc><table><row><cell cols="4">Use Case Corpus-driven Dictionary-driven Hybrid</cell></row><row><cell>1</cell><cell>54.1%</cell><cell>60.7%</cell><cell>68.2%</cell></row><row><cell>2</cell><cell>43.5%</cell><cell>48.8%</cell><cell>57.4%</cell></row><row><cell>3</cell><cell>48.4%</cell><cell>55.1%</cell><cell>59.1%</cell></row><row><cell>avg</cell><cell>49.0%</cell><cell>55.1%</cell><cell>62.1%</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">http://www.wikipedia.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">wikidata.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">omegaWiki.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">wikitionary.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_4">http://lucene.apache.org/solr</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGEMENTS</head><p>This work was partially supported by the European Commission under the contract number FP7-ICT-610411 (MULTISENSOR).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">University of surrey participation in TREC8: Weirdness indexing for logical document extrapolation and retrieval (WILDER)</title>
		<author>
			<persName><forename type="first">Khurshid</forename><surname>Ahmad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lee</forename><surname>Gillam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lena</forename><surname>Tostevin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Procedings of TREC</title>
				<meeting>edings of TREC</meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">Lars</forename><surname>Ahrenberg</surname></persName>
		</author>
		<ptr target="http://www.ida.liu.se/larah03/publications/tereviewv2.pdf" />
		<title level="m">Term extraction: A review draft version 091221</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Identifying multi-word expressions by leveraging morphological and syntactic idiosyncrasy</title>
		<author>
			<persName><forename type="first">Hassan</forename><surname>Al</surname></persName>
		</author>
		<author>
			<persName><forename type="first">-Haj</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Shuly</forename><surname>Wintner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd International conference on Computational Linguistics</title>
				<meeting>the 23rd International conference on Computational Linguistics</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="10" to="18" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A measure of syntactic flexibility for automatically identifying multiword expressions in corpora</title>
		<author>
			<persName><forename type="first">Colin</forename><surname>Bannard</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on a Broader Perspective on Multiword Expressions</title>
				<meeting>the Workshop on a Broader Perspective on Multiword Expressions</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Surface grammatical analysis for the extraction of terminological noun phrases</title>
		<author>
			<persName><forename type="first">Didier</forename><surname>Bourigault</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 14th conference on Computational linguistics</title>
				<meeting>the 14th conference on Computational linguistics</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="1992">1992</date>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="977" to="981" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Automatic term detection: A review of current systems</title>
		<author>
			<persName><forename type="first">Cabré</forename><surname>Teresa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rosa</forename><forename type="middle">Estopa</forename><surname>Castellví</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jordi</forename><surname>Bagot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Palatresi</forename><surname>Vivaldi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Recent advances in computational terminology</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="53" to="88" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context</title>
		<author>
			<persName><forename type="first">Paul</forename><surname>Cook</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Afsaneh</forename><surname>Fazly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Suzanne</forename><surname>Stevenson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the workshop on a broader perspective on multiword expressions</title>
				<meeting>the workshop on a broader perspective on multiword expressions</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="41" to="48" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Automatic recognition of domain-specific terms: an experimental evaluation</title>
		<author>
			<persName><forename type="first">Denis</forename><surname>Fedorenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikita</forename><surname>Astrakhantsev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Denis</forename><surname>Turdakov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SYRCoDIS</title>
		<imprint>
			<biblScope unit="page" from="15" to="23" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Using machine learning to perform automatic term recognition</title>
		<author>
			<persName><forename type="first">Jody</forename><surname>Foo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Magnus</forename><surname>Merkel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the LREC 2010 Workshop on Methods for automatic acquisition of Language Resources and their evaluation methods</title>
				<meeting>the LREC 2010 Workshop on Methods for automatic acquisition of Language Resources and their evaluation methods<address><addrLine>Valletta, Malta</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2010-05-23">23 May 2010. 2010</date>
			<biblScope unit="page" from="49" to="54" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">The cvalue/nc-value method of automatic recognition for multi-word terms</title>
		<author>
			<persName><forename type="first">Sophia</forename><surname>Katerina T Frantzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Junichi</forename><surname>Ananiadou</surname></persName>
		</author>
		<author>
			<persName><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Research and advanced technology for digital libraries</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="585" to="604" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Event detection using wikipedia</title>
		<author>
			<persName><forename type="first">Martin</forename><surname>Rudi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Holaker</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Eirik</forename><surname>Emanuelsen</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
		<respStmt>
			<orgName>Institutt for datateknikk og informasjonsvitenskap</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Expansion of multi-word terms for indexing and retrieval using morphology and syntax</title>
		<author>
			<persName><forename type="first">Christian</forename><surname>Jacquemin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Judith</forename><forename type="middle">L</forename><surname>Klavans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Evelyne</forename><surname>Tzoukermann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics, EACL &apos;97</title>
				<meeting>the Eighth Conference on European Chapter of the Association for Computational Linguistics, EACL &apos;97<address><addrLine>Stroudsburg, PA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="24" to="31" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Methods of automatic term recognition: A review</title>
		<author>
			<persName><forename type="first">Kyo</forename><surname>Kageura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bin</forename><surname>Umino</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Terminology</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="259" to="289" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Extraction of domain-specific concepts to create expertise profiles</title>
		<author>
			<persName><forename type="first">Gagandeep</forename><surname>Kaur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Saurabh</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anand</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><surname>Kumar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Global Trends in Computing and Communication Systems</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="763" to="771" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Term identification in the biomedical literature</title>
		<author>
			<persName><forename type="first">Michael</forename><surname>Krauthammer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Goran</forename><surname>Nenadic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of biomedical informatics</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="512" to="526" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Foundations of statistical natural language processing</title>
		<author>
			<persName><forename type="first">D</forename><surname>Christopher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hinrich</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><surname>Schütze</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
			<publisher>MIT Press</publisher>
			<biblScope unit="volume">999</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">The lexical properties of the gene ontology</title>
		<author>
			<persName><forename type="first">Alexa</forename><forename type="middle">T</forename><surname>Mccray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Allen</forename><forename type="middle">C</forename><surname>Browne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Olivier</forename><surname>Bodenreider</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AMIA Symposium</title>
				<meeting>the AMIA Symposium</meeting>
		<imprint>
			<publisher>American Medical Informatics Association</publisher>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page">504</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Thesaurus based automatic keyphrase indexing</title>
		<author>
			<persName><forename type="first">Olena</forename><surname>Medelyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ian</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL &apos;06</title>
				<meeting>the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL &apos;06<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="296" to="297" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Introduction to wordnet: An on-line lexical database*</title>
		<author>
			<persName><forename type="first">George</forename><forename type="middle">A</forename><surname>Miller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Richard</forename><surname>Beckwith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christiane</forename><surname>Fellbaum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Derek</forename><surname>Gross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Katherine</forename><forename type="middle">J</forename><surname>Miller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International journal of lexicography</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="235" to="244" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Entity linking meets word sense disambiguation: a unified approach</title>
		<author>
			<persName><forename type="first">Andrea</forename><surname>Moro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alessandro</forename><surname>Raganato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Roberto</forename><surname>Navigli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="231" to="244" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network</title>
		<author>
			<persName><forename type="first">Roberto</forename><surname>Navigli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Simone</forename><surname>Paolo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ponzetto</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artif. Intell</title>
		<imprint>
			<biblScope unit="volume">193</biblScope>
			<biblScope unit="page" from="217" to="250" />
			<date type="published" when="2012-12">December 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Automatic glossary extraction: beyond terminology identification</title>
		<author>
			<persName><forename type="first">Youngja</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Roy</forename><forename type="middle">J</forename><surname>Byrd</surname></persName>
		</author>
		<author>
			<persName><surname>Branimir</surname></persName>
		</author>
		<author>
			<persName><surname>Boguraev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th international conference on Computational linguistics</title>
				<meeting>the 19th international conference on Computational linguistics</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2002">2002</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1" to="7" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Terminology extraction: an analysis of linguistic and statistical approaches</title>
		<author>
			<persName><forename type="first">Maria</forename><surname>Teresa Pazienza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marco</forename><surname>Pennacchiotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fabio</forename><forename type="middle">Massimo</forename><surname>Zanzotto</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Knowledge mining</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="255" to="279" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">&apos;Corpus-based terminology extraction applied to information access</title>
		<author>
			<persName><forename type="first">Anselmo</forename><surname>Peñas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Felisa</forename><surname>Verdejo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julio</forename><surname>Gonzalo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Corpus Linguistics</title>
				<meeting>Corpus Linguistics</meeting>
		<imprint>
			<publisher>Citeseer</publisher>
			<date type="published" when="2001">2001. 2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Termextractor: a web application to learn the shared terminology of emergent web communities</title>
		<author>
			<persName><forename type="first">Francesco</forename><surname>Sclano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paola</forename><surname>Velardi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Enterprise Interoperability II</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="287" to="290" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Improving term extraction by system combination using boosting</title>
		<author>
			<persName><forename type="first">Jordi</forename><surname>Vivaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Horacio</forename><surname>Rodríguez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning: ECML 2001</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="515" to="526" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">A comparative evaluation of term recognition algorithms</title>
		<author>
			<persName><forename type="first">Ziqi</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">José</forename><surname>Iria</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher</forename><surname>Brewster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fabio</forename><surname>Ciravegna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of LREC</title>
				<meeting>LREC</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
