=Paper=
{{Paper
|id=Vol-1384/paper5
|storemode=property
|title=The Termolator: Terminology Recognition based on Chunking, Statistical and Search-based Scores
|pdfUrl=https://ceur-ws.org/Vol-1384/paper5.pdf
|volume=Vol-1384
|dblpUrl=https://dblp.org/rec/conf/issi/MeyersHGB15
}}
==The Termolator: Terminology Recognition based on Chunking, Statistical and Search-based Scores==
The Termolator: Terminology Recognition based on Chunking, Statistical and Search-based Scores1 Adam Meyers1, Yifan He2, Zachary Glass3 and Olga Babko-Malaya4 1 meyers@cs.nyu.edu New York University, Dept. of Computer Science, 715 Broadway, NY, NY (USA) 2 yhe@cs.nyu.edu New York University, Dept. of Computer Science, 715 Broadway, NY, NY (USA) 3 zglass@alumni.princeton.edu New York University, Dept. of Computer Science, 715 Broadway, NY, NY (USA) 4 olga.babko-malaya@baesystems.com BAE Systems, 6 New England Executive Park, Burlington, MA, (USA) Abstract The Termolator is a high-performing terminology extraction system, which will soon be available as open source software. The Termolator combines several different approaches to get superior coverage and accuracy. The system identifies potential instances of terminology using a chunking procedure, similar to noun group chunking, but favoring chunks that contain out-of-vocabulary words, nominalizations, technical adjectives, and other specialized word classes. The system ranks such term chunks according to several metrics including: (a) a set of metrics that favors term chunks that are relatively more frequent in a “foreground” corpus about a single topic than they are in a “background” or multi-topic corpus and (b) a relevance score which measures how often terms appear in articles and patents in a Yahoo web search. We analyse the contributions made by each of these metrics and show that all modules contribute to the system’s performance, both in terms of the number and quality of terms identified. Workshop Topic Terminology Extraction Introduction Automatic terminology extraction systems aim to collect word sequences to be used as Information Retrieval key words, terms to be included in domain-specific glossaries or ontologies. Terms are also potential arguments of information extraction relations or entities to be tracked for technology forecasting applications. This paper describes the Termolator, a terminology extraction system which will soon be released as open source software. The Termolator selects the terms (scientific noun sequences) that are characteristic of a particular technical area. The system identifies all instances of terms in sets of files using a sequential pattern matching process called chunking. It is similar to the noun group chunkers used in many natural language processing systems, but adds additional constraints so that the noun group chunks must contain words belonging to specialized vocabulary classes including: out- of-vocabulary words, nominalizations, technical adjectives, and others. To find chunks that are characteristic of a topic, the system compares the frequencies of particular terms in 2 sets of documents: the foreground corpus (documents about a single topic) and the background corpus (documents about a mixture of topics). It uses several statistical measures to make this determination including Document Relevance Document Consensus or DRDC (Navigli and Velardi, 2004), Term Frequency-Inverse Document Frequency (TFIDF) and Kullback-Leibler 1 Approved for public release; unlimited distribution Divergence or KLD (Cover and Thomas, 1991; Hisamitsu et al., 1999). For each foreground set of documents, the system produces a list of terms, which is initially ordered based on the distributional means just described. Two other types of scores are factored in to the system’s ranking: a well-formedness score based on linguistic constraints, and a relevance score, based on how often a Yahoo (https://search.yahoo.com) web-search results for that term point to patents or articles. The final ranking is used to extract the top terms. We have found that given about 5000 foreground documents and 5000 background documents, we can generate about 5000 terms that are approximately 85% accurate. The system has been tested and scored on US patents and Web of Science abstracts. We have also performed preliminary tests on English journal articles (PubMed Central corpus, http://www.ncbi.nlm.nih.gov/pmc/). We have implemented some of the components of a Chinese version of the system and have plans to continue development. System Description System Overview Our system consists of three stages: terminological chunking and abbreviation, distributional ranking and filtering. The first stage identifies instances of potential terms in text. The second stage orders the terms according to their relative distribution in the foreground and background corpora. The final stage reorders the top N terms from the second stage based on a well-formedness metric and a relevance metric. The assumption behind the ranking is that the higher ranked terms are preferred over lower ranked ones in three respects: 1) higher ranked terms are less likely to be errors, noun sequences that are not really instances of terminology because they are ill-formed as noun groups or represent phrases that are part of the general vocabulary, rather than specialized vocabulary; 2) higher ranking terms tend to be more characteristic of a particular field of interest than lower ranking terms; and 3) higher ranking terms tend to have greater relevance than the low ranking ones, i.e., specialists and others are currently more interested in the concepts represented by the high ranking terms. Stage 1: Terminological Chunking and Abbreviation In Meyers, et. al. (2014a), we describe the component of our system designed for identifying terms in sentences, independent of their distribution in sets of documents. Like Justeson and Katz (1995), we assume that most instances of terminology are noun groups, head nouns and pre-modifiers other than determiners. Consequently, we currently exclude non-noun instances of terminology (verbs like calcify or coactivate; adjectives like covalent or model theoretic and adverbs like deterministically or stochastically). Unlike previous approaches, we consider only a subset of noun groups as we adapt a more stringent set of chunking rules than used for standard noun group detection. We also identify an additional set of terms by means of rules for identifying abbreviations. We incorporate into our chunking rules requirements that constituents contain nominalizations, out of vocabulary words, technical adjectives and other classes of a more fine-grained nature than typical parts of speech used in noun chunking. Nominalizations, such as amplification and radiation are identified and classified using NOMLEX_PLUS (Macleod, et. Al. 1998 and Meyers, et. al. 2004), contributing to the ranking of the terms optical amplification medium fiber and optical radiation). Out of vocabulary words (e.g., photoconductor and collimate) are words not found in COMLEX Syntax (Macleod, et. Al. 1997) or classified as names (thus selecting terms like electrophotographic photoconductor and optical collimate). Technical adjectives are adjectives found in COMLEX or classified by a POS tagger that end in -ic, -cal, or –ous, but are not part of a manually selected outlist (e.g., public, jealous). 2 The chunking component is modelled as a finite state machine using a fine- grained set of parts of speech (FPOS) to determine transitions between Beginning, Ending, Internal and Other states in the style of Ramshaw and Marcus (1995). The FPOS include nominalizations, technical adjectives and Out of Vocabulary (OOV) words as defined above, as well as several other categories such as nationalities (the adjectival form of a country, state, city or continent, e.g., European, Indian, and Peruvian); adjectives or nouns with the first letter capitalized, person names, and roman numerals. The technical noun chunks are sequences of these categories, omitting preceding determiners, normal adjectives and other words that were not likely to be parts of instances of terminology. 3 We extract instances of abbreviations and full forms, using pattern matching similar to Schwartz and Hearst (2003) in contexts where a full form/abbreviation pair are separated by an open parentheses, e.g., Hypertext Markup Language (HTML). In the simplest case, the abbreviation consists of the initials for each word of the full form, but variations in which words are skipped, multiple letters match a single word, etc. are incorporated as well. Keyword-based heuristics and gazetteers are used to differentiate non-terminology abbreviation cases from terminology ones, e.g., New York University, Acbel Polytech Inc., can be ruled out because the words Inc., University indicate organizations, and British Columbia is ruled out due to a gazetteer. Both the terminology chunker and the abbreviation system identify terms in sentences in each document. These instances are collected and output to be used for stage 2. We also use the stage one output independently of the rest of the Termolator, to find instances of terms that are arguments of the Information Extraction relations discussed in Meyers, et. al. (2014b). Some example relations from the PubMed corpus follow: 1. found in the IκB protein, an inhibitor of NF-κB • Relation: Exemplify, Arg1: IκB protein, Arg2: inhibitor of NF-κB • Interpretation: Arg1 is an instance of Arg2 2. a necrotrophic effector system that is an exciting contrast to the biotrophic effector models that have been intensively studied • Relation: Contrast, Arg1: necrotrophic effector system, Arg2: biotrophic effector models • Interpretation: Arg1 and Arg2 are in contrast with each other 3. Bayesian networks hold a considerable advantage over pairwise association tests • Relation: Better than, Arg1: Bayesian networks, Arg2: pairwise association tests • Interpretation: Arg1 is better than Arg2 (in some respect) 4. housekeeping gene 36B4 (acidic ribosomal phosphoprotein P0) • Relation: Alias, Arg1: housekeeping gene 36B4, Arg2: acidic ribosomal phosphoprotein P0 2 There are 1445 adjectives in COMLEX with these endings in COMLEX and it is possible to quickly go through these by eye in a few hours. All but 237 of these adjectives were deemed to be technical. 3 This set of constraints is based on informal observations of the composition of valid terms in corpora. We validate this set of constraints by showing that results that are constrained this way have higher scores than results that are not so constrained, as discussed below in the Evaluation section. • Interpretation: Arg1 and Arg2 are alternative names for the same concept, but neither is a shortened form (acronym or abbreviation). Stage 2: Distributional Ranking While stage 1 identifies term instances or tokens, stage 2 groups together these tokens into general types, i.e., it determines which noun sequences would belong in a terminology dictionary for a particular field. Furthermore, this classification is relative to a particular field or topic, represented by contrasting sets of documents. This methodology is based on many previous systems for identifying terminology (Damerau 1993, Drouin 2003, Navigli and Velardi 2004, etc.) which aim to find nouns or noun sequences (n-grams or noun groups) that are the most characteristic of a topic. Towards this goal, noun sequences are ranked according to their characteristic-ness of one topic, where a noun sequence N1 is more characteristic to a topic T than a noun sequence N2, if N1 scores higher than N2 using a metric that rewards a term for occurring more frequently in some target set of documents about a single topic than it does in a set of documents about a wide variety of topics. The output of systems of this type have been used as Information Retrieval key words (Jacquemin and Bourigault 2003) or terms to be defined in thesauri or glossaries for a particular field (Velardi, et. al. 2001). We plan to use terms derived this way as part of a technology forecasting system (Daim et al., 2006, Babko-Malya, et. al. 2015). We rank our terms using a combination of three metrics: (1) the standard Term Frequency Inverse Document Frequency (TFIDF); (2) Document Relevance Document Consensus (DRDC) metric (Navigli and Velardi, 2004); and (3) Kullback-Leibler Divergence (KLD) metric (Cover and Thomas, 1991; Hisamitsu et al., 1999). The TFIDF metric selects terms specific to a domain by favoring terms that occur more frequently in the relevant (foreground) documents than they do in the background. The formula is: 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓(𝑡𝑡) 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑙𝑙𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇(𝑡𝑡) = ∗ log( ) 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓(𝑡𝑡) 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑡𝑡) In the DRDC metric, two factors are considered: (i) document relevance (DR), which measures the specificity of a terminological candidate with respect to the target domain via comparative analysis of a general domain; and (ii) document consensus (DC), which measures the distributed use of a terminological candidate in the target domain. The formula for DRDC is: freqRDG (t ) freq (t , d ) freqRDG (t ) DRDC (t ) = * ∑ *log( ) freqTotalDoc(t ) d∈RDG freqRDG (t ) freq (t , d ) where freqRDG means the frequency of a specific term t (for example, "cth2 mrna") in a Related Document Group (RDG), documents relevant to the same topic. FreqTotalDoc of t means the frequency in all documents (RDG+nonRDG), freq(t,d) means the frequency of t in document d. The KLD metric measures the difference between two probability metrics: the probability that a term will appear in the foreground corpus vs the background corpus. The formula is: KLD(t ) = log( freqRDG (t )) − log( freqTotalDoc(t ))) * freqRDG (t ) These three metrics are combined together with equal weights, ranking both the terms produced in stage 1 and substrings of those terms, producing an ordered list. Stage 3: Well-formedness Score and Relevance Score The previous stages produce a ranked list of terms, the ranking derived from the distributional score, which we encode as D, a percentile score between 0 and 1. This score can be reranked by creating other scores between 0 and 1 and multiplying all the scores together. Weights can be applied as exponents on each of the scores, resulting in one aggregate score that we use for reranking the terms. However, we currently assume all weights to equal 1. We assume 2 additional scores: W, a well-formedness score and R, a relevance score. The aggregate score which we use for reranking purposes is simply: D*W*R. Like stage 1, the stage 2 components (W and R) can be used separately from the other portions of Termlator, to score or rank terms entered by a user, e.g., terms produced by other terminology extraction systems. 4 Well-formedness Score Our well-formedness (W) score is based on several linguistic rules and subjective evaluations about violations of those rules. Although many of these linguistic rules are built into the chunking rules in stage 1, stage 2 includes highly frequent substrings of stage 1 terms in the output. Also the abbreviation rules may introduce terms that would not have been licensed by the chunking rules. So when applied to our own terms W usually has a value of 1 for terms produced in stage 1, 0 for substrings that are not well-formed and very rarely intermediate values. The W score filters out erroneous substrings, since a W score of 0 multiplies with all other scores to produce an aggregate score of 0. As mentioned, the stage 3 filters can be applied to term lists not produced by The Termolator, e.g., terms based on N-grams, rather than noun-groups. For this application, the W score is likely to have a larger effect than it does for terms produced by the Termoloator. We assume that applications of the following rules are reason to give a candidate term a perfect score (1.0): • ABBREVIATION_OR_TERM_THAT_IS_ABBREVIATED – This rule matches terms that are either abbreviations or a full length term that has been abbreviated, e.g., html, hypertext markup language, OCR, optical character recognition, ... • Out_of_Vocabulary_Word – This rule matches terms consisting of single words (and their plurals) that are not found in our dictionaries, e.g., radionuclide, photoconductor, … • Hyphenated Word + OOV Noun – This applies if a word contains one or more hyphen and the part of the word following the last hyphen would matches the conditions described in the previous bullet, e.g., mono-axial, lens-pixel, … These rules yield a score of 0.7: • Common_Noun_Nominalization – This means that the term is a single word, identified as a nominalization using dictionary lookup, e.g., demagnetization, overexposure, • Hyphenated Word + Nominalization – This applies if a word contains one or more hyphen and the part of the word following the last hyphen would match the conditions described in the previous bullet, e.g., de-escalation, cross-fertilization 4 We have used these components to evaluate sets of terms that were not produced by the Termolator. Our subjective analysis is that they can be used effectively in this way to rate or rerank such terms, but a formal evaluation is outside the scope of this paper. This rule gives a score of 0.3: • Normal_Common_Noun_or_Number – This means that the term consists of a single word that is either a number, a common noun, a name or a combination of numbers and letters (e.g., ripcord, H1D2). The following rules have scores that vary, depending on the type of words found in the phrase: • Normal_NP – This means that the term consists of a word sequence that is part of a noun group according to our chunker, described above. The score can be as high as 1.0 if the term would be recognised as such by our stage 1 chunker (e.g., electrophotographic photoconductor). A noun group containing one “unnecessary” element such as a preceding adjective, would have a score of 0.5 (acceptable organic solvent). Other noun groups or noun phrases would have scores of 0.2 (wheel drive capacity). • 2_Part_NP – This means that the term consists of 2 noun groups according to our chunker, possibly separated by a preposition. Currently 2_Part_NPs containing prepositions receive scores of 0.45 (voltage of the output buffer), and those without receive scores of 0 (service provider issues remittance). There are several other rules which have scores of 0 associated with them including: • Single_Word_Non_Noun – This means that the word is identified as a non-noun, either by dictionary lookup or by simple morphological rules, e.g., we assume that an out of vocabulary word ending in -ly is an adverb, e.g., downwardly, optical, tightening • Bad_character -- This means that the term contains at least one character that is not either: a) a letter; b) a number; c) a space; d) a hyphen; e) a period; or f) an apostrophe, e.g., box™, sum_l, slope Δa • Contains_conjunction – This rule matches sequences including coordinate conjunctions (and, or, but, nor), e.g., or reproducing, asic or other integrated • Too many verbs – This means that the sequence contains multiple verbs, e.g., insulating film corresponding, emitting diodes disposed • Verbal or Sentential Structure – This means that some chunking rules found a verbal constituent other than an adjective-like pre-modifier (broken record), e.g., developer containing, photoelectric converting • Unexpected_POS_sequence – This applies to multi-word terms that do not fit any of the profiles above, e.g., of the developing roll, beam area of the charged Relevance Score The relevance score is derived by searching for the term using Yahoo’s search engine (powered by Microsoft Bing) and applying some heuristics to the search result. This score is intended to measure the “relevance” of a term to technical literature. The Relevance Score R = HT2 where the two factors H and T are defined as follows and the weight on T was determined experimentally: • H = the total number of hits for an exact match. The log 10 of this number (up to a maximum of 10) is normalized between 0 and 1. • T = the percentage of the top 10 hits that are either articles or patents The following information from a Yahoo search are used to compute this score: (1) the total number of hits; (2) a check to see if this result is based on the search or if a similar search was substituted, i.e., if the result includes the phrase including results for or the phrase showing results for, then we know that our search was not matched at all and we should downgrade the value of H to a very small number. 5; and (3) the top 10 search results as represented by URLs, titles and summaries. For each result, we search the URL, title and summary for key words which indicate that this hit is probably an article or a patent (patent, article, sciencedirect, proceedings, journal, dissertation, thesis, abstract). T is equal to the number of these search results that match, divided by 10. In practice, this heuristic seems to capture the intuition that a good term is likely to be the topic of current scientific articles or patents, i.e., that the term is relevant. Runtime is a limiting factor for the Relevance scores because it takes about .75 seconds to search for each term. This means that producing Relevance scores for 30K terms takes about 6 hours, whereas the rest of the system for producing terms takes minutes. Evaluation We ran the complete system with 5000 patents about optical systems and components as the foreground (US patent codes 250, 349, 356, 359, 362, 385, 398 and 399) and 5000 diverse patents as background. We collected a total of 219K terms, ranked by the stage 2 system. We selected the top 30K of these terms and ran the stage 3 processes on these 30K terms. We ranked these top terms 3 different ways, each time selecting a different top 5000 terms for evaluation. We selected the top 5000 terms after ranking these 30K terms in the following ways: (a) according to stage 2 (Distributional Score); (b) according to the Relevance Score (c) according to the Combined Score (D*R*W). As W primarily was used to remove ill-formed examples, it was not well-suited for this test as a separate factor. For each list of 5000 terms, we sampled 100 terms, took 20 random terms from each 20% interval, manually inspected the output, and rated each term as correct or incorrect. 71% of the terms ranked according to D only were correct; 82% of the terms ranked according to R were correct and 86% of the terms ranked according to the Combined Score were correct. While we believe that it is signicant that the combined score produced the best result, it is unclear whether the fact that R alone did better than the stage 2 ranking because the R score was applied to the 30K terms out of 219K terms with the highest D scores. While in principle, we could run R on all 219 K terms, time constraints make it impractical to do this, in general, for all output of our system. 6 Coverage of a term extractor is difficult to measure for terms without having a human being do the task, e.g., reading all 5000 articles and writing out the list of terms. 7 Informally however, we have observed a significant increase in term output since we adopted the chunking model described above, compared to a previous version of the system that used a 5 Our current strategy is to treat instances of fewer than 10 hits the same as if the term did match, but to set H as if there were 500 hits. 6 We evaluated the correctness of terms ourselves. We previously did some experiments in which graduate biology students evaluated our biology terms. We discontinued this practice primarily because we could not afford to have experts in all of the domains for which we had terms. In addition, the domain expertise was rarely accompanied by linguistic expertise. So the process of training domain experts to make consistent determinations about what does and does not constitute a linguistic unit was difficult. In contrast, using one set of annotators resulted in more consistent evaluation. Most unknown terms could be looked up and identified with high accuracy. 7 There are no established sets of manually encoded data to test the system with. Note that the SemEval keyword extraction task (Kim, et. al. 2010) while overlapping with terminology extraction, does not capture the task we are doing here. In particular, we are not attempting to find a small number of keywords for a small number of articles, but rather large sets of terms that cover fields of study. We believe that constructing such a shared task manually would be prohibitive. standard noun chunker. In other words, we are able to take a larger number of top ranked terms than before without a major decline in accuracy. One of the tasks for future work is to develop a good metric for measuring this. Examples Table 1 provides some sample potential terms along with scores D, W, R and the aggregate score. The table is arranged in descending order by the aggregate score. These terms are excerpts from the best of the three rankings described in the previous section, i.e., the terms ordered by the total score. In the right-most column is an indication of whether or not these are valid terms, as per the judgement of one of the authors. The incorrect examples include: (a) irradiation time t, which is really a variable (a particular irradiation time), not a productively used noun group that should be part of a glossary or a key word; (b) evolution, a common word that is part of the general language and should no longer be relegated to a list of specialized vocabulary; and (c) crystal adjacent, a word sequence that does not form a natural constituent – it is part of longer phrases like a one-dimensional photonic crystal adjacent to the magneto-optical metal film. In this sequence the word crystal, is modified by a long adjectival modifier beginning with the word adjacent and it would be an error to consider this pair of words a single constituent. Table 1: System Output with aggregate scores, component scores and correctness judgements Rank Term D W R Total Correct 41 stimulable phosphor .866 1 .174 .151 Yes 104 ion beam profile .889 1 .117 .126 Yes 346 x-ray receiver .906 1 .099 .089 Yes 533 wavelength-variable .838 1 .091 .076 Yes 556 irradiation time t .460 1 .163 .075 No 1275 quadrupole lens .460 1 .113 .052 Yes 1502 evolution .439 1 .109 .048 No 1581 proximity correction .451 1 .103 .046 Yes 1613 dfb laser .943 1 .049 .046 Yes 1685 asymmetric stress .493 1 .067 .033 Yes 3834 panoramagram .483 1 .056 .027 Yes 4203 crystal adjacent .316 1 .080 .025 No 4244 single-mode optical fiber .875 1 .029 .025 Yes 4467 total reflection plane .988 1 .024 .024 Yes 4879 photosensitive epoxy resin .286 1 .079 .022 Yes The Chinese System Our current Chinese Termolator implements several components parallel to the English system and we intend to implement additional components in future work. The Chinese Termoloator uses an in-house CTB 8 word segmenter and part-of-speech tagger and a rule based noun group chunker, but without additional rules requiring technical words. Stage 2 is similar to the English system in that we compare word distribution in a given domain with word distribution in a general background set and find topic words of the given domain. 8 https://catalog.ldc.upenn.edu/LDC2013T21 One challenge for the Chinese system is that Chinese word boundaries are implicit, and are automatically induced by the word segmenter, which is prone to errors. We accordingly implemented an accessor-variety (AV) based filter (Feng et al., 2004), which calculates an accessor-variety score for each word based on the number of distinct words that appear before or after it. Character sequences with low AV scores are not independent enough, and usually should not be considered as valid Chinese words (Feng et al., 2004). We therefore filter out words whose accessor-variety scores are less than 3. We evaluated the precision of extracted terms on a set of speech processing patents: the precision was 85% for the top 20 terms and 78% for the top 50 terms. This evaluation was based on 1,100 terms extracted from 2,000 patents related to speech processing. We developed a well-formedness-based automatic evaluation metric for Chinese terms, which follows the same spirit as the English well-formedness score. This metric penalizes noun phrases that contain non-Chinese characters, contain words that are not nouns or adjectives, contain too many single character words, or are longer than 3 characters. Since this error is exactly the sort of error that would be ruled out by the AV-based filter, we do not use it as part of our own terminology system. Rather, we use it when we are applying our filters to score term lists created externally, just as we are doing with parts of the English system. We expect to implement a version of the Relevance Score that will work with Chinese language search engines in future work. As with the English, this will be a separable component of the system that can be applied to Chinese term lists created independently from our system. Concluding Remarks We have described a terminology system with state-of-the-art results for English that combines several different methods including linguistically motivated rules, a statistical distribution metric and a web-based relevance metric. We can derive at least 5000 highly accurate (86%) terms from 5000 documents about a topic. We have partially implemented this system for Chinese and are currently achieving high accuracy for Chinese as well. In future work, we intend to further develop the system for Chinese and improve the evaluation measures for English. Acknowledgments Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20154. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government. References Babko-Malaya, O., Seidel, A., Hunter, D., HandUber, J., Torrelli, M., and Barlos, F. (2015). Forecasting Technology Emergence from Metadata and Language of Scientific Publications and Patents. 15th International Conference on Scientometrics and Informetrics. Cover, T. and Thomas, J. A. (1991). Elements of Information Theory. Wiley-Interscience, New York. Daim, T. U., Rueda, G., Martin, H., and Gerdsri, P. (2006). Forecasting emerging technologies: Use of bibliometrics and patent analysis. Technological Forecasting and Social Change, 73(8):981–1012. Damerau, F. J. (1993). Generating and evaluating domain-oriented multiword terms from texts. Information Processing and Management, 29:433–447. Drouin, P. (2003). Term Extraction Using Non-technical Corpora as a Point of Leverage. Terminology, 9: 99—115. Feng, H., Chen, K., Deng, X., and Zheng, W. (2004). Accessor variety criteria for chinese word extraction. Computational Linguistics, 30:75–93. Hisamitsu, T., Niwa, Y., Nishioka, S., Sakurai, H., Imaichi, O., Iwayama, M., and Takano, A. (1999). Term extraction using a new measure of term representativeness. Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition. Jacquemin, C. and Bourigault, D. (2003). Term Extraction and Automatic Indexing. In Mitkov, R., editor, Handbook of Computational Linguistics. Oxford University Press, Oxford. Justeson, J. S. and Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1):9–27. Kim, S. N., Medelyan, O., Kan, M. Y., and Baldwin T. (2010). SemEval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles. SemEval 2010, pages 21—26. Macleod, C., Grishman, R., and Meyers, A. (1997). COMLEX Syntax. Computers and the Humanities, 31:459–481. Macleod, C., Grishman, R., Meyers, A., Barrett, L., and Reeves, R. (1998). Nomlex: A lexicon of nominalizations. Proceedings of Euralex98. Meyers, A., Glass, Z., Grieve-Smith, A., Y. He, S. L., and Grishman, R. (2014a). Jargon-Term Extraction by Chunking. COLING Workshop on Synchronic and Diachronic Approaches to Analyzing Technical Language. Meyers, A., Lee, G., Grieve-Smith, A., He, Y., and Taber, H. (2014b). Annotating Relations in Scientific Articles. LREC-2014. Meyers, A., Reeves, R., Macleod, C., Szekeley, R., Zielinska, V., and Young, B. (2004). The Cross- Breeding of Dictionaries. Proceedings of LREC-2004, Lisbon, Portugal. Navigli, R. and Velardi, P. (2004). Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites. Computational Linguistics, 30. Ramshaw, L. A. and Marcus, M. P. (1995). Text Chunking using Transformation-Based Learning. ACL Third Workshop on Very Large Corpora, pages 82–94. Schwartz, A. and Hearst, M. (2003). A simple algorithm for identifying abbreviation definitions in biomedical text. Pacific Composium on Biocomputing. Velardi, P., Missikoff, M., and Basili, R. (2001). Identification of relevant terms to support the construction of domain ontologies. Workshop on Human Language Technology and Knowledge Management.