=Paper=
{{Paper
|id=Vol-1292/ipamin2014_paper10
|storemode=property
|title=Insight to Hyponymy Lexical Relation Extraction in the Patent Genre Versus Other Text Genres
|pdfUrl=https://ceur-ws.org/Vol-1292/ipamin2014_paper10.pdf
|volume=Vol-1292
|dblpUrl=https://dblp.org/rec/conf/konvens/AnderssonLPPHR14
}}
==Insight to Hyponymy Lexical Relation Extraction in the Patent Genre Versus Other Text Genres==
Insight to Hyponymy Lexical Relation Extraction in the Patent Genre Versus Other Text Genres∗ Linda Andersson, Mihai Lupu, João Pallotti, Florina Piroi, Allan Hanbury, Andreas Rauber Vienna University of Technology Information and Software Engineering Group (IFS) Favoritenstrasse 9-11, 1040 Vienna, Austria surname@ifs.tuwien.ac.at ABSTRACT thereby make commercial technical terminology dictionaries such Due to the large amount of available patent data, it is no longer as EuroTermBank1 and IATE2 less re-usable [17]. feasible for industry actors to manually create their own termi- nology lists and ontologies. Furthermore, domain specific the- In this paper we explore hyponymy relation extraction from the col- sauruses are rarely accessible to the research community. In this lection itself using lexico-syntactic patterns defined in [13]. With paper we present extraction of hyponymy lexical relations con- the variation in concept formulations, where paraphrasing of exist- ducted on patent text using lexico-syntactic patterns. We explore ing concepts is generally applied, a support tool such as a thesaurus the lexico-syntactic patterns. Since this kind of extraction involves or an ontology based on automatic extraction of lexical relations Natural Language Processing we also compare the extractions made from the patent genre will be an usable search aid. Automatic on- with and without domain adaptation of the extraction pipeline. We tology population consists of several steps, normalization of data, also deployed our modified extraction method to other text genres tokenization, Part of Speech (PoS) tagging, etc. However, the prob- in order to demonstrate the method’s portability to other text do- lem of using standard Natural Language Processing (NLP) tools is mains. From our study we conclude that the lexico-syntactic pat- that the source data and the target data do not have the same feature terns are portable to domain specific text genre such as the patent distribution, this being a pre-requisite for their correct use [26]. genre. We observed that general Natural Language Processing tools, Too many unseen events will decrease the performance of broad when not adapted to the patent genre, reduce the amount of correct coverage NLP tools. In order to reduce the gap between source hyponymy lexical relation extractions and increase the number of and target data several studies involving patent domain adaptation incomplete extractions. This was also observed in other domain of broad coverage NLP tools have been investigated [16, 10, 2, 11, specific text genres. 23, 8]. The focus of these adaptations have been either on reduc- ing the sentence length or increasing the lexicon. Only [2] and [8] Categories and Subject Descriptors have target adaptations incorporating domain information about the H.3.1 [Content Analysis and Indexing]: [linguistic processing] noun phrases’ (NP) syntactic distributions. In this paper, we re-use the heuristic rules presented in [2]. Keywords The objectives of this study are: Patent Text Mining, Natural Language Processing, Ontology 1. INTRODUCTION 1. to examine if it is possible to extract hyponymy lexical re- One of the first tasks of a patent examiner when given a new patent lations using the general lexico-syntactic patterns defined in application is to identify essential patent aspects and extract terms, [5]; which later can be used in the search query session. 2. to verify if the heuristic domain adaptation rules deployed in When conducting Prior Art Search it is essential to find different the extraction pipeline improve the candidate extractions; aspects of a patent? Each aspect can be divided into term pairs consisting of a general term and a specific term. [1] 3. to examine the portability of our modified extractor method, developed for the patent text genre, to other domain specific This task requires both domain knowledge and access to technical genres; terminology (both explicit and implicit knowledge). However, pre- vious studies in the patent genre have observed that patent writers 4. to examine if it is possible to simplify the evaluation process intentionally use entirely different word combinations to re-create a of hyponymy relation extraction. “concept”, which increases the vocabulary mismatch issue [5]; and ∗Copyright c 2014 for the individual papers by the papers’ au- The remainder of this paper is organized as follows. We first present thors. some related work and terminology in Section 2. In Section 3 we Copying permitted for private and academic purposes. present our experimental set up. In Section 4 we report our general This volume is published and copyrighted by its editors. Published at Ceur-ws.org results. Section 5 presents our conclusion and future work. Proceedings of the First International Workshop on Patent Mining 1 and Its Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014. http://project.eurotermbank.com/ 2 At KONVENS’14, October 8–10, 2014, Hildesheim, Germany. http://iate.europa.eu/ 2. RELATED WORK time with terms such as “LP” and “water closet” being regarded In the Information Retrieval (IR) community, the patent retrieval as instances of obsolescence [12]. This type of discourse charac- research has focused mainly on improvements and method devel- teristic makes the patent text mining task more challenging. Many opments within systems for supporting patent experts in the process Patent Retrieval studies have tried to address different patent search of Prior Art search. Less research attention has been given to other problems by applying linguistic knowledge, using broad coverage type of resources that support the patent examiner in the informa- NLP tools. However, as the generic NLP tools are not trained on tion process activities. the patent domain they experience problems with parsing long and complex NPs [10, 8]. There have been several studies focusing on 2.1 Terminology Effect on NLP reducing the gap between the source and target data, the focus be- Before we can re-use NLP tools in text genres with high density of ing placed mainly on sentence reduction [11], on lexicon increase scientific terminology and new words, we need to understand the [16], or on both [23]. However, just increasing the lexical coverage word formation process of the English language. The most produc- or decomposing sentences will not solve the problem, since token tive word formation in English is affixation i.e. adding prefixes or coverage and sentence length are only part of the problem. [28] suffixes to a root [6]. The suffixes ‘-ing’ and ‘-ed’ are especially concluded that, since there is no significant difference between the problematic for NLP applications because when they are added to general English and the English used in the patent discourse, on sin- verbs, the new formed word may be a noun, an adjective or remain gle token coverage, the technical terminology is more likely present a verb (as in sentence 8, Figure 1 in the Appendix). in multi-word constructions consisting of complex NPs. Informa- tion about NPs’ syntactic distribution has only been deployed in [2, One of the major mechanisms of word formation is the morpholog- 8], in order to improve the NLP analysis. In [8] a hierarchical chun- ical composite, which allows the formation of compound nouns out ker was designed to fit the syntactic structure of the patent sentence, of two nouns (e.g. floppy disk, air flow) [18], and thereby creating targeting embedded NPs, while in [2] heuristic rules addressing the a Multi Word Unit (MWU). It has been observed that in the tech- most common observed errors made by the NLP tools were used as nical jargon a heavy use of noun compounds constitutes the ma- a post correcting filter. jority of scientific terminologies [14]. The compounding strategy causes not only unseen events on word level with new orthographi- 2.3 Ontology Population cal units, it also generates a diversity of syntactic structures among Automatic ontology population relates to the methods used in In- noun phrases, which is problematic for NLP tools [10, 24]. Fur- formation Extraction (IE) as the general purpose is to extract pre- thermore, many NLP applications have chosen to overlook MWUs defined relations from text, hence referred to ontology based in- due to their complexity and flexible nature [4]. formation extraction (OBIE) [19]. There are several applications where OBIE is used to enhance domain knowledge, to create a NPs can consist of single tokens, or can as well be as long and customized ontology, and in rich existing ontologies. OBIE tech- complex as any other occurring phrases in a sentence [15]. The NPs niques consist of identifying named entities (NE), technical terms, have an internal structure that dictates where additional elements or relations. The OBIE process consists of several steps, data nor- can occur in relation to the head noun (e.g. pre- and post-modifier). malization, tokenization, PoS tagging, etc., thereafter following the There is a range of elements that can take the pre-modifier role in an recognition steps like gazetteers combined with rule-based gram- NP but adjectives are the most typical pre-modifiers. In hyponymy mars, ontology design pattern (ODP), pattern slots identifications lexical relation extraction, adjectives have a semantic significance, such as lexico-syntactic pattern (LSP). Different techniques for hy- since the adjective modifiers could be considered a hyponym to ponymy lexical relation extraction have been explored âĂŞ many the head noun [7]. For example, ‘apple juice’ is a valid hyponym of them depending on pre-encoded knowledge such as domain on- to ‘juice’, but only in this combination since the modifier ‘apple’ tologies and machine readable dictionaries [9]. In order to avoid the specifies the head ‘juice’ [6]. The post-modifier construction is need of pre-existing domain knowledge and remain independent of more complex, since a head noun can be post-modified by both the sub-language one option is to use generic LSPs for hyponymy phrases and clauses. lexical relation extraction. [13] proposed a method to extract hy- ponymy lexical relations based on five LSPs, see Table 1. One central concept when analyzing NPs is to define its head [24]. The head in an NP has a supreme importance, as is the central part Table 1: Sentence examples to each lexical syntactic pattern of the noun (e.g.“the poet Burns”, “Burns the poet”) [15]. When a NP contains a prepositional phrase the traditional linguists promote Example sentences LSP the proper name (e.g. “the city of Stockholm”) or the NP followed 1 ... work such author as Herrick, such NP as {NP, }* Goldsmith, and Shakespeare after the preposition (e.g. “a group of DNA strings”) as the main 2 Even then, we would trail behind {(or|and)} NP head noun, since the NP after the preposition tends to have the high- other European Community mem- est degree of lexicalization [6, 24, 15]. However, what should be ber, such as Germany, France and identified as the head noun in an NP is not straight forward [24]. Italy 3 Bruises, wounds, broken bones or NP{, NP}*{,} or other Moreover, in [10] it was observed that the syntactic parsers right- other injuries NP headed bias caused problems during the analysis step of the patent 4 Temples, treasuries, and other im- NP{, NP}*{,} and other sentences, thereby yielding erroneous analyzes. portant civic buildings NP 5 All common-law countries, includ- NP{,} including {NP,}* ing Canada and England {or|and} NP 2.2 Patent Text Effects in NLP 6 ... most European countries, espe- NP{,} especially {NP,}* Patents are semi-structured documents which offer many differ- cially France, England, and Spain {or|and} NP ent applications for text mining [3]. In patent documents, abstract and non-standard terminology is used to avoid narrowing the scope There are several issues related to extracting relations from a raw of the invention, unlike the style of other genres like newspapers text based on LSPs. For instance, the LSP examples 2, 5 and 6 and scientific articles [21]. Moreover, the vocabulary varies over in Table 1 are not clear cases of hyponymy lexical relations, as in ‘domestic pets such as cats and dogs,’ since in LSP 2 Germany, [17] concluded that the pattern-based methods and especially the France and Italy are members of the European Community and in morpho-syntactic approach achieved good performance on the tech- LSP 6 France, England and Spain are countries in Europe i.e. a part nical domain data, therefore demonstrating that the general purpose of the geographic content called Europe [20]. hypernym detection models are portable to other domain and user- specific data. With a wider semantic definition of the hyponym property, we can include both ‘part of’ and ‘member of’ in the definition: In [21], hyponymy relations were extracted from US and Japanese patent re-using LSP patterns in [13]. For English 3,898,000 and “. . . an expression A is a hyponym of an expression B iff the mean- for Japanese 7,031,149 candidate hyponymy relations were identi- ing of B is part of the meaning of A and A is subordinated of B. In fied. The alignment between the language pair was conducted via addition to the meaning of B, the meaning of A must contain fur- citation analysis; 2,635 pairs of English-Japanese hyponymy rela- ther specifications, rendering the meaning of A, the hyponym, more tions were manually evaluated. The best method obtained Recall specific than the meaning of B. If A is a hyponym of B, B is called of 79.4% and Precision of 77.5%. a hypernym of A.” [18, p83] Hearst’s patterns, [13], give high precision but low recall, while 3. OUR APPROACH Our data sets consist of five different text genres: the Brown cor- ODP gives high recall and low precision [19]. In [13], LSP 1 pus3 (henceforth Brown), the WO and EP patent documents of was used to extract candidate relations from the Grolier’s Amer- IREC (Patent)4 , the TREC test collection for Clinical Decision Sup- ican Academic Encyclopaedia (8.6M words). In this study, 7,067 port Track (MedIR)5 , the test collection for Mathematical retrieval sentences match LSP 1 and 152 relations fit the restriction i.e. to provided by NTCIR (MathIR)6 , and the papers produced during the contain an unmodified noun (or with just one modifier). Conference and Labs of Evaluation forum7 (CLEFpaper). In Table 2 we present the total amount of sentences fitting the LSPs per data A common approach to evaluate hyponymy relation extractions is and extraction methods. to use an existing ontology as a gold standard [9]. For instance, in [13] the assessment was conducted by looking up if the relation was Table 2: Sentences per LSPs, data collection and extraction found in WordNet. Out of 226 unique words, 180 words existed in method. the WordNet hierarchy, and 61 out of 106 relations already existed in the WordNet. However, since most of the terms in WordNet are Patent MedIR MathIR CLEF Brown unmodified nouns or nouns with a single modifier, using WordNet paper in the evaluation process of this study was not feasible. Domain 92,702 1,643,254 48,922 3,698 762 Rules Simple 135,550 2,084,529 70,822 5,748 950 In [5] the gold standard was created by using linguists, but this type Rules of labeling task is both time-consuming and costly, which makes No 135,946 2,252,056 73,472 6,164 944 Rules the approach feasible only for small gold standards. The annota- tors were asked to manually identify domain-specific terms, NEs, synonymy and hyponymy relationships between identified terms Example sentences from each data sets are shown in the Appendix and NEs. The annotation task requires both linguistic knowledge, , Figure 1. as well as, some domain specific knowledge. The gold standard was used to evaluate automatic hyponymy rela- 3.1 Method For this experiment we applied exactly the same methodology to all tion extractions from technical corpora, in English and Dutch. The 5 data sets. We used all of the LSP patterns in Table 1. For the NLP data consisted of dredging year reports and news articles from the pipeline we enriched all data sets with PoS tags using the Stanford financial domain. The data was enriched with PoS tagging and tagger – English-left3words-distisim.tagger model [25]. In order to lemmas produced by the LeTs Preprocessing Toolkit. The LeTs allow more flexibility to the phrase boundary we chose to use the Preprocessing toolkit was trained on similar data where the accu- baseNP Chunker [22]. We defined three pipeline extraction meth- racy of the PoS tagger was 96.3%. The NE extractor only achieved ods: a recall of 62.92% and a precision of 59.33% [27]. For the hyponymy lexical relations extraction, three different tech- 1. No rules (NoRules) was used to modifying the NLP pipeline niques were used: 1) a lexico-syntactic pattern model based on LSP analyzes in [13], 2) a distribution model using context cluster by an agglom- erative clustering technique and 3) a morpho-syntactic model. The 2. Three rules (SimpleRules) addressing observed errors among morpho-syntactic model is based on the head-modifier principle: sentence fitted the LSP patterns. The rules address different type of conjunction and commas issues. Rule i) NP [cat and • Single-word NP, if lexical item L0 is a suffice string of lexical dogs] changed to two NPs [cat] and [dog], ii) [cat or dogs] item L1 , L0 is a hypernym of L1 changed two NPs [cat] or [dog], iii) numerous listing with commas. • MWUs NP, if lexical item L0 is the head of term of lexical 3 item L1 , then L0 is a hypernym of L1 http://www.hit.uib.no/icame/brown/bcm.html 4 IREC, is the corrected version of the MAREC http://www.ifs.tuwien.ac. • NP + prepositional phrase, if lexical item L0 is the first part at/imp/marec.shtml 5 of a term in L1 containing a NP plus prepositions (EN: of, http://www.trec-cds.org/2014.html 6 for, before, from, to, on), then L0 is to be the hypernym of http://ntcir-math.nii.ac.jp/ 7 L1 . http://www.clef-initiative.eu/publication/proceedings 3. Domain rules, (DomainRules) here we applied the simple scale 1 (very easy) to 5 (very difficult). Furthermore, since it was rules (2) and the rules presented in [2]. observed in [3] that web searches for many candidate phrases were required in order to understand their meaning, we gave the assessor the possibility to search for the concept via a web service. We aim Figure 1 in the appendix displays the difference between NoRules to improve the evaluation tool and give better interactive support and DomainRules among the pairs of sentences (3,4), (5,6) and therefore this feedback information is valuable for us. (7,8). 4. RESULTS In Table 3 we present the evaluation result based upon the linguist assessor. We see that the NoRules method generates more candi- date extractions compared to the other ones, with correct boundary Table 4: Correct identified positive relations and NP boundaries in identification. This fact puzzled us since our experience during the relation to sample and for the most dominant relation “A kind of” assessment indicated the opposite. For instance, a common error DomainRules NoRules SimpleRules was deverbal nouns exclusion. This error especially decreased cor- Group:Linguist hyper hypo hyper hypo hyper hypo rect and complete extractions for the domain specific text genres ok ok ok ok ok ok when using NoRules. For instance, when the head noun is a dever- A kind of 70% 78% 71% 83% 71% 83% bal noun, the PoS-tagger assigns the label verb instead of a noun Brown Relations 72% 80% 70% 84% 69% 83% A kind of 84% 96% 84% 93% 93% 91% (e.g. “ultrasonic/JJ welding/VBN” and “laser/NN welding/VBN”, MedIR Relations 87% 92% 87% 92% 87% 88% and compare sentences 7 and 8 in figure 1, appendix). A kind of 85% 78% 64% 64% 65% 79% MathIR Relations 86% 77% 68% 82% 68% 81% CLEF A kind of 71% 90% 75% 75% 71% 83% Our first assumption to this contradiction was that one of the rules paper Relations 76% 89% 77% 88% 74% 84% in the DomainRule method, which unifies NPs with ‘of’-construction, Patent A kind of 82% 92% 76% 76% 79% 90% harmed the extractions. In example 1, the hypernym consists of an Relations 79% 91% 77% 90% 80% 90% embedded NP with prepositional ‘of’-construction modifying the head noun. Table 5: Number of positive extraction in relation to all extraction made for each sample and method Example 1: Embedded NP ‘of’-construction The novel conjugate molecules are provided for the manufacture of a medicament Group:Linguist DomainRules NoRules SimpleRules for gene therapy, apoptosis, or for the treatment of diseases such as cancer, au- Brown 39% 40% 40% toimmune diseases or infectious diseases. MedIR 52% 33% 54% MathIR 44% 66% 33% CLEFpaper 50% 47% 56% If we include the entire NP i.e. “the treatment of diseases” the Patent 64% 71% 81% hyponymy lexical relation becomes incorrect since “cancer”, “au- toimmune diseases” and “infectious diseases” are “diseases” and For the evaluation only a smaller set was sampled out (1,647 in- not “treatments”. On the other hand, in sentence 5 (figure 1, ap- stances) for manual assessment, approximately 100 instances per pendix) the relation between the hypernym and hyponyms becomes data collection and method. One instance correspond to one re- incorrect since hyponyms constitute properties of the hyponym there- lation extracted from a sentences, if there are several possible ex- fore the NP should be unified. In sentences 3 and 4 (figure 1, ap- traction in a single sentence, each extraction correspond to one in- pendix) the unification of the NPs ‘of’-construction is more doubt- stance (see figure 1 in the appendix). Therefore not exact 1,500 ful for the hypernym where “potential risk factors” (sentence 3) instances were evaluated since some sentences contain more than compared to “the distribution of potential risk factors” (sentence one instances. Due to the fact that there are very few people hav- 4) seems to be the better choice. However, one of the hyponyms is ing the level of linguistic knowledge, as well as the domain spe- overlooked in sentence 3 but extracted in sentence 4 with the help of cific knowledge required to conduct assessment, we decided upon the domain rule unifying ‘of’-construction NPs. When examining a more generic evaluation schema. The assessors were divided into the outcome of the rule we found that 131 instances were consid- three groups: linguist, and expert and non-expert. The linguist has ered correct (i.e. the NP with ‘of’-construction should be unified) domain knowledge of the patent domain and the computer science and only 44 instances were incorrect. The more likely reason for domain. the NoRules more complete and correct identified hyponymy rela- tion is that the NoRules generated more extractions compared to For the evaluation task, we constructed a simple interface, see fig- DomainRules which has a more strict extraction rule schema. ure 1 in the appendix. The evaluation tool shows the original sen- tence and five definition of relations between L0 and L1 ; i) L0 is Table 4 shows the percentage of the most dominant relation “a Kind a kind of L1 , ii) L0 is a part of L1 , iii) L0 is a member of L1 ; iv) Of” and all positive relations (“a Kind Of”, “a Part Of”, “a Mem- L0 is in another relation with L1 , v) L0 has no relation to L1 . For ber Of”, “another Relation”) for each method and data set. The uncertainty of the assessor we added Cannot say anything about the preferred hypernym rule is the DomainRules method regardless of two and for erroneous extraction we added The sentence makes no data set. For hyponyms, the result is more inconclusive since sev- sense. Since the NP boundaries were not entirely correct identified eral methods ended up having the same percentages. For the “a for all extractions, we added a check box for wrong boundary (for Kind Of” relation the preferred method is either SimpleRules or L0 and L1 ). In the instruction for the evaluation task, a simple ex- NoRules as seen in Table 4. ample and a domain example were given for all types of relations. Table 5 displays the percentage of all examined sentences matching In order to find out how difficult the task was thought to be by the the LSP patterns where a positive and correct extraction was iden- assessors, we asked each assessor to grade each relation from as tified. For three out of five data sets the method SimpleRules was Table 3: The total amount of correct identified relation and NP boundaries DomainRules NoRules SimpleRules Group: Linguist hyper ok hypo ok Total hyper ok hypo ok total hyper ok hypo ok total Brown 74 82 103 94 113 135 95 114 137 MedIR 110 116 126 150 159 172 142 144 163 MathIR 83 74 96 84 101 123 70 83 103 CLEFpaper 70 82 92 99 113 129 86 98 117 Patent 109 125 138 147 172 191 150 169 188 Table 6: Inter-annotator agreement between assessment groups MathIR Brown CLEFpaper Linguist vs Ex- Linguist vs None Lin- Linguist vs None Lin- Linguist+Domain pert guist guist knowledge vs Expert Relations 85% 81% 83% 88% No relation 68% 72% 72% 75% Cannot tell 86% 77% 83% 89% Makes no sense 90% 89% 80% 93% hypernymBoundaryWrong 64% 67% 83% 67% hyponymBoundaryWrong 62% 67% 85% 82% preferred. In the future we will explore machine learning algorithms to se- lect which extraction method should be used for a specific relation, In order to examine the simplification of the evaluation process, we instance and data collection. The additional modifying the NLP computed inter-annotation agreements between the three groups: pipeline need further examination, since it becomes contra produc- expert, linguist and non-expert. The inter-annotation agreement tive for some instance but improve for others. Furthermore, we also for identifying relations ranges between 81% and 88% (Table 6), want to examine additional patterns exploring similarity between regardless of the group comparisons for Brown and for the scien- the internal structures of NPs, as described in [1]. tific paper data sets. Similar agreement values were found for the patent and medical text domain. The inter-annotation agreement decreases for wrong NP boundary identifications, which can be ex- 6. ACKNOWLEDGMENTS plained by that fact that it requires linguistic schooling to correctly This research was partly funded by the Austrian Science Fund (FWF) identify NPs. projects P25905-N23 (ADmIRE) and I1094-N23 (MUCKE). 5. CONCLUSIONS 7. REFERENCES We conclude the following: [1] A. Adams. Personal correspondence. PatOlympics, 2011, Vienna, 2011. [2] L. Andersson, M. Lupu, and A. Hanbury. Domain adaptation • It is possible to re-use LSPs for hyponymy lexical relation of general natural language processing tools for a patent extractions. We thereby confirm the observation made in [1] claim visualization system. In M. Lupu, E. Kanoulas, and that the LSP method for relation extraction is portable to dif- F. Loizides, editors, Multidisciplinary Information Retrieval, ferent text genres. volume 8201 of Lecture Notes in Computer Science, pages • We also confirm that for domain specific text genre, such as 70–82. Springer Berlin Heidelberg, 2013. patent or medical genres, at least for the hypernyms modi- [3] P. Anick, M. Verhagen, and J. Pustejovsky. Identification of fication of NLP tools is required. For detecting hyponyms multiword expressions in the brwac. In N. Calzolari, the additional rules were less successful. On the other hand, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, as seen in sentences 7 and 8 (figure 1, appendix) the rules J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, addressing deverbal nouns make it possible to extract more Proceedings of LREC-2014, Reykjavik, Iceland, may 2014. correct instances. European Language Resources Association (ELRA). [4] V. Arranz, J. Atserias, and M. Castillo. Multiwords and word • The simplified process of evaluating hyponymy lexical rela- sense disambiguation. In A. Gelbukh, editor, Computational tions extractions using non-linguists and non-experts is on Linguistics and Intelligent Text Processing, volume 3406 of an acceptable inter-annotation agreement level. However, Lecture Notes in Computer Science, pages 250–262. more information regarding the identification of NP bound- Springer Berlin Heidelberg, 2005. aries should be added in future evaluation guidelines. [5] K. H. Atkinson. Toward a more rational patent search • In the future we will explore machine learning algorithms paradigm. In Proceedings of the 1st ACM Workshop on to select which extraction method should be used for a spe- Patent Information Retrieval, PaIR ’08, pages 37–40, New cific relation, instance and data collection. The additional York, NY, USA, 2008. ACM. modifying the NLP pipeline need further examination, since [6] K. Ballard. The Frameworks of English. Palgrave Macmillan, it becomes contra productive for some instance but improve 2007. for others. Furthermore, we also want to examine additional [7] W. Bosma and P. Vossen. Bootstrapping language neutral patterns exploring similarity between the internal structures term extraction. In N. Calzolari, K. Choukri, B. Maegaard, of NPs, as described in [1]. J. Mariani, J. Odijk, S. Piperidis, M. Rosner, and D. Tapias, editors, Proceedings of the Seventh International Conference Retrieval, PaIR ’11, pages 25–30, New York, NY, USA, on Language Resources and Evaluation (LREC’10), Valletta, 2011. ACM. Malta, may 2010. European Language Resources [22] L. A. R. and M. P. M. Text chunking using Association (ELRA). transformation-based learning. In S. Armstrong, K. Church, [8] N. Bouayad-Agha, A. Burga, G. Casamayor, J. Codina, P. Isabelle, S. Manzi, E. Tzoukermann, and D. Yarowsky, R. Nazar, and L. Wanner. An exercise in reuse of resources: editors, Natural Language Processing Using Very Large Adapting general discourse coreference resolution for Corpora, volume 11 of Text, Speech and Language detecting lexical chains in patent documentation. In Technology, pages 157–176. Springer Netherlands, 1999. N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, [23] S. Sheremetyeva. Natural language analysis of patent claims. B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and In Proceedings of the ACL-2003 Workshop on Patent Corpus S. Piperidis, editors, Proceedings of LREC-2014), Reykjavik, Processing - Volume 20, PATENT ’03, pages 66–73, Iceland, may 2014. European Language Resources Stroudsburg, PA, USA, 2003. Association for Computational Association (ELRA). Linguistics. [9] P. Cimiano. Ontology Learning and Population from Text: [24] K. Spärck Jones. Compound Noun Interpretation Problems. Algorithms, Evaluation and Applications. Springer-Verlag In F. Fallside and W. A. Woods, editors, Computer Speech New York, Inc., Secaucus, NJ, USA, 2006. Processing. Prentice-Hall, Englewood Cliffs, NJ, 1983. [10] E. D’hondt, S. Verberne, C. H. A. Koster, and L. Boves. Text [25] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. representations for patent classification. Computational Feature-rich part-of-speech tagging with a cyclic dependency Linguistics, 39(3):755–775, 2013. network. In Proceedings of the 2003 Conference of the North [11] G. Ferraro. Towards deep content extraction from specialized American Chapter of the Association for Computational discourse: the case of verbal relations in patent claims. PhD Linguistics on Human Language Technology - Volume 1, thesis, Universitat Pompeu Fabra, 2012. NAACL ’03, pages 173–180, Stroudsburg, PA, USA, 2003. [12] C. G. Harris, R. Arens, and P. Srinivasan. Using Association for Computational Linguistics. classification code hierarchies for patent prior art searches. In [26] J. Turmo, A. Ageno, and N. Català. Adaptive information M. Lupu, K. Mayer, J. Tait, and A. J. Trippe, editors, Current extraction. ACM Comput. Surv., 38(2), July 2006. Challenges in Patent Information Retrieval, volume 29 of [27] M. van de Kauter, G. Coorman, E. Lefever, B. Desmet, The Information Retrieval Series, pages 287–304. Springer L. Macken, and V. Hoste. LeTs Preprocess: The multilingual Berlin Heidelberg, 2011. LT3 linguistic preprocessing toolkit. Computational [13] M. A. Hearst. Automatic acquisition of hyponyms from large Linguistics in the Netherlands Journal, 3:103–120, 12/2013 text corpora. In Proceedings of the 14th Conference on 2013. Computational Linguistics - Volume 2, COLING ’92, pages [28] S. Verberne, C. H. A. Koster, and N. Oostdijk. Quantifying 539–545, Stroudsburg, PA, USA, 1992. Association for the challenges in parsing patent claims. In In Proceedings of Computational Linguistics. the 1st International Workshop on Advances in Patent [14] J. S. Justeson and S. M. Katz. Technical terminology: some Information Retrieval (AsPIRe 2010, pages 14–21, 2010. linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1:9–27, 3 1995. [15] E. Keizer. The English Noun Phrase: The nature of linguistic categorization. Cambridge University Press, 2010. [16] C. H. Koster and J. G. Beney. Phrase-based document categorization revisited. In Proceedings of the 2Nd International Workshop on Patent Information Retrieval, PaIR ’09, pages 49–56, New York, NY, USA, 2009. ACM. [17] E. Lefever, M. V. de Kauter, and V. Hoste. Evaluation of automatic hypernym extraction from technical corpora in english and dutch. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of LREC-2014, pages 490–497, 2014. [18] S. Löbner. Understanding Semantics. Oxford University Press, New York, 2002. [19] D. Maynard, Y. Li, and W. Peters. Nlp techniques for term extraction and ontology population. In Proceedings of the 2008 Conference on Ontology Learning and Population: Bridging the Gap Between Text and Knowledge, pages 107–127, Amsterdam, The Netherlands, The Netherlands, 2008. IOS Press. [20] V. Mititelu. Automatic Extraction of Patterns Displaying Hyponym-Hypernym Co-Occurrence from Corpora. In Proceedings of the 1st CESCL, Budapest, Hungary, 2006. [21] H. Nanba, S. Mayumi, and T. Takezawa. Automatic construction of a bilingual thesaurus using citation analysis. In Proceedings of the 4th Workshop on Patent Information Figure 1: Sentences examples for the different data sets, with and without Domain Rules. Figure 2: Evaluation tool interface.