=Paper=
{{Paper
|id=Vol-1292/ipamin2014_paper5
|storemode=property
|title=Research of Semantic Role Labeling and Application in Patent Knowledge Extraction
|pdfUrl=https://ceur-ws.org/Vol-1292/ipamin2014_paper5.pdf
|volume=Vol-1292
|dblpUrl=https://dblp.org/rec/conf/konvens/MengHL14
}}
==Research of Semantic Role Labeling and Application in Patent Knowledge Extraction==
Research of Semantic Role Labeling and Application in Patent knowledge Extraction Ling’en Meng Yanqing He * Ying Li * Institute of Scientific and Technical Institute of Scientific and Technical Institute of Scientific and Technical Information of China, Beijing Information of China, Beijing Information of China, Beijing mengle2013@istic.ac.cn heyq@istic.ac.cn liying@istic.ac.cn ABSTRACT performed by iterative methods to find some optimized coefficients. Discriminant models generally include linear Semantic Role Labeling (SRL) is a leading task of identifying interpolation, SVM[2], Perceptron[3], SNoW(Sparse Network of arguments for a predicate and assigning semantically meaningful Winnows)[4], Boosting[5], Maximum Entropy, Decision tree, labels to them. SRL is crucial to information extraction, question Random forest[6], etc. Combining the results produced by multiple answering, and machine translation. When applied to patent text, classifiers is a development direction and can obtain better results existing tools for SRL have unsatisfying performance because of than any one classifier. These supervised learning methods above long sentences. To improve performance in patent SRL systems, are often dependent on the effect of syntactic parsing and accurate this study separates each sentence in patent abstracts into a annotation of SRL. It is widely used in information extraction, simpler structure, and then labels semantic roles for the simplified question answering, and machine translation. sentence. At last, semantic information and semantic framework SRL has the vital significance in shallow semantic parsing for text for frequently used words are used to extract patent knowledge. information, especially patent texts. Patent texts contain useful Our work demonstrates that the method used in this article can information about technologies. Analyzing patent texts can master improve the performance in SRL system and obtain beneficial the present situation of patent texts, predict the hotspot timely and knowledge from patents. grasp the trend of the technology. The existing patent platforms Categories and Subject Descriptors Patsnap (http://cn.patsnap.com/), TechGlory (Patent risk controls and competitive intelligence analysis system. http://www.tek- I.2.7 [Computing Methodologies]: Language Constructs and glory.cn/), and Wang Xuefeng[7] use a manually annotated corpus, Features –Language parsing and understanding, Text analysis. they have high cost and low speed. Researchers also adopt automatic extraction method to obtain key information from General Terms patent texts. Jiang Caihong[8] constructs an ontology and writes Algorithms, Experimentation, Languages rules for patent knowledge extraction. Zhai Dongsheng[9] uses ontology knowledge and semantic inference measure to construct Keywords a reference network of patent. This article introduces SRL information combined with a Semantic role labeling, Patent text, Patent knowledge extraction semantic framework rules to extract patent technical topic from patent abstract. As we all know, patent text usually has the 1. INTRODUCTION characteristic of long sentences with complex structures. As SRL Semantic Role Labeling is the process of annotating the predicate- systems are ported into patent texts, they get poor results and argument structure in text with semantic labels. SRL includes two affect the effectiveness of the semantic analysis and knowledge sub-tasks: the identification of syntactic constituents that are extraction. Compare the following examples: semantic roles probably, and the labeling of those constituents with the correct semantic role [1]. Most of current researches on Long sentence: SRL focus on using supervised learning method including A plurality of resonance units are arranged [ARGM-TMP generative model and discriminate model. The generative model is firstly used in the SRL classification model. This model has fast in the shell], wherein one end of each resonance unit is fixed on training rate and the dependence on the training corpus is not the inner wall at one side of the shell. strong. But the poor description ability and strong assumption of Simplified sentence: features independence lead to unsatisfactory performance. A plurality of resonance units are arranged [ARGM-LOC in Discriminate models directly estimate the final goal of the shell] optimization-- conditional probability. The process is usually one end of each resonance unit is fixed on the inner wall at o ne side of the shell. It‘s obviously that the sematic tag ARGM-TMP (ARGM-TMP Copyright © 2014 for the individual papers by the papers´authors. represents time, more details in 2.2) in long sentence is wrong. Copying permitted for private and academic purposes. The correct tag is ARGM-LOC (ARGM-LOC represents location) This volume is published and copyrighted by its editors. in the simplified sentence. To resolve the above problem, our approach separates each long complicated sentence in patent Published at Ceur-ws.org abstracts into a simpler structure, then labels semantic roles for Proceedings of the First International Workshop on Patent Mining and Its the simplified sentences, finally, synthesizes all the semantic Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014. labels and semantic framework to extract patent topic. Finally, At KONVENS´14, October 8-10, 2014, Hildesheim, Germany. SRL information is used to extract patent knowledge from patent Stanford Parser (http://cemantix.org/software.html) is introduced abstract and obtains beneficial topic knowledge from patents. in order to support us to find clause boundaries. On account of the length of sentence over 70 words can‘t be parsed, the sentence over 70 words is divided at ‗;‘, ‗wherein‘ before parsed. This 2. SYSTEM ARCHITECTURE AND practice can maintain the integrity of sentence structure. But there are still less than 7% sentences over 70 words, they are divided at TECHNICAL DETAILS the middle ‗,‘ by a simple iterative method. After parsing, If the In a patent text, an abstract contains its topic, effect, components long sentence contains ‗wherein‘ clauses, we separate the long and features; all of them are important information for the patent. sentence at ‗wherein‘ into two parts; if the long sentence contains The purpose of this article is to automatically extract the patent ‗which‘ clauses, we deal with them using a program, the pseudo- topic from the patent abstract. Patent topic mainly involves patent code is given in Figure 3. type and patent filed. An example is given in Figure 1. For the patent abstract, the phrase ―An electrically tunable filter‖ Begin indicates the patent type and the phrase Input :long sentence ―technical field of electronic communication‖ shows patent field. parsing the long sentence,we can get the syntactic tree —— parseLongSentence. if parseLongSentence contains guide word —— '(which)' The two phrases: patent type phrase and patent field phrase, need find the guide word —— (which) in the syntactic tree, record the position as whichPosition. to be extracted to form the patent technical topic. /*search from whichPosition,judge 'NP(…)' or 'VP(…)' which one come first,if NP(…),record TRUE*/ Abstract: The embodiment of the invention provides an electrically tunable filter, relating if search from whichPosition, ‘NP’ come first to the technical field of electronic communication. The electrically tunable filter comprises a shell, a signal search from whichPosition,then take out the first S(…) close to whichPostition —— sentence1; input end and a signal output end. A plurality of resonance units are arranged in the shell, wherein one end of e else search from whichPosition, ‘VP’ come first ach resonance unit is fixed on the inner wall at one side of the shell. Gaps are search from whichPostition in the opposite direction,then take out the first kept among adjacent resonance units. Two resonance units with the farthest distance from each other are NP(…) close to whichPosition; search from whichPosition,then take out the first S(…) close to respectively connected with the signal input end and the signal output end. Medium sheets which whichPostition; are used for adjusting the resonance frequencies of the corresponding resonance units by means of combine NP(…) with S(…) as a new simplified sentence —— sentence2; ascending and descending are arranged below the resonance units. The electrically tunable filter not Output: sentence1,sentence2; // print the sentence which removed the clause sentences. only has a small number of tuning parameters, but also has a simple structure and can better realize the free Output:long sentence – sentence1 – sentence2; movement of a center frequency point and the bandwidth of a passband. End Technical Topic: An electrically tunable filter, technical field of electronic communication Figure 3.The pseudo-code for sentence contains ‘which’ Figure 1. An example of patent abstract and its technical topic clauses After sentence analysis, the patent abstract shown in Figure 1 is English turned into some simplified sentences (bold fonts) shown in Patent Abstract Sentence Analysis Simplified sentence Figure 4. Sub-sentences:The embodiment of the invention provides an electrically tunable filter, relating Frequently used to the technical field of electronic communication. words Semantic Role Labeling for The electrically tunable filter comprises a shell, a signal input end and a signal output end. simplified sentence A plurality of resonance units are arranged in the shell. One end of each resonance unit is fixed on the inner wall at one side of the shell. Gaps are kept among adjacent resonance units. Two resonance units with the farthest distance from each other are respectively connected with the Patent Technical SRL based Patent topic Labeled Simplified Topic extraction Sentences signal input end and the signal output end. Medium sheets are used for adjusting the resonance frequencies of the corresponding resonance units by means of ascending and descending Figure2. The system processing pipeline Medium sheets are arranged below the resonance units. As shown in Figure2, our processing is divided into three steps. The electrically tunable filter not only has a small number of tuning parameters, but also has a simple struct First, English patent abstracts are separated into simplified ure and can better realize the free movement of a center frequency point and the bandwidth of a sentences by sentence analysis module. Next the simplified passband. sentences are labeled with semantic roles. Finally, the frequently used words with semantic framework and labeled simplified Figure 4. Simplified sentences of patent abstract shown in sentences are input into a patent topic extraction module to obtain Figure 1 the patent technical topics. 2.1 Sentence Analysis A patent abstract often contains long sentences, some of which 2.2 SRL System for Simplified Sentences may involve clauses, such as adverbial clause, object clause, After obtaining the simplified sentences, we use the tool -- attributive clause, etc. Clauses can generate inaccuracy in Automatic Statistical SEmantic Role Tagger (ASSERT) (about syntactic parsing. These errors even can transmit to SRL. For this tool, you can find more information by visiting these reasons, we take out clauses in the long sentence, then, turn http://cemantix.org/publications.html) to label them. A sentence is the long sentence into simplified sentences. Here we mainly annotated with tags such as TARGET, ARG 0~5, ARGM. Each separated attributive clause containing ‗which‘ and ‗wherein‘. predicate verb of the sentence is marked with TARGET. ARG0、 ARG1 respectively represents agent, patient. ARG2 - ARG5 have different meanings in different situations. As to ARGM, it has Table 3. TARGET semantic framework of thirteen subtypes, they are shown in Table 1. frequently-used-words Frequent Semantic Example Table 1. Subtypes of the ARGM modifier tag Word Framework ARGM- location ARGM- cause [ARG1 The invention] LOC CAU relate [TARGET relates][ARG2 relate [to ARG2] to a double-shielded ARGM- extent ARGM- time EXT TMP mineral-insulated cable] ARGM-DIS discourse ARGM-PNC purpose [ARG0 The embodiment of connectives the invention] [TARGET provides ] [ARG1 an ARGM- general purpose ARGM- manner provide provide [ARG1] electrically tunable filter ADV MNR relating to the technical ARGM- negation marker ARGM-DIR direction field of electronic NEG communication] ARGM- modal verb MOD Next, we match the word from the list with TARGET of each More information about semantic roles please refer to Martha simplified sentence in the abstract. If matched, the phrase for Palmer[10]. Table 2 shows the difference of SRL for patent abstract semantic role ARG0~ARG5 of TARGET is extracted from this shown in Figure 1 and Figure 4. sentence according to its framework. Table 2 Difference of SRL between Long Sentence and For the field-phrase, we firstly choose the labeled sentence that Simplified Sentence contains phrase with ―field‖ between ―[‖ and ―]‖. If the semantic role for the phrase is ARGM, we extract the corresponding phrase SRL errors in long sentences Correct SRL results in as the field-phrase. Otherwise, we locate TARGET in the sentence simplified sentences containing ―field‖, and then judge TARGET semantic framework A plurality of resonance units ar to determine which semantic role should be extracted from ARG0 e arranged[ARGM- A plurality of resonance units ar to ARG5. TMP in the shell], wherein one e arranged [ARGM-LOC in the end of each resonance unit is fix shell] In fact, in order to promote performance of extraction, post- ed on the inner wall at one side processing methods are used, such as getting rid of the preposition of the shell. at the beginning or removing some gerundial phrases. A plurality of resonance units ar In the process of separating the e arranged in the shell, [TARGE long sentence, word—‗wherein‘ 3. EXPERIMENT T wherein] one end of each is removed. This error can be no In this section, we perform an experiment to evaluate our patent resonance unit is fixed on the in more arise in simplified topic extraction based on SRL. The evaluation standard - ner wall at one side of the shell. sentence. ‗Precision‘, ‗Recall‘, ‗F1‘ are used to evaluate the system effect. [ARG1 Medium sheets which [ARG1 We choose 50 patent abstracts relating to communication field as are used for adjusting the Medium sheets] are used for adj our experiment data. Detailed statistics of corpus is shown in resonance frequencies of the usting the resonance frequencie Table 4. We take out the clauses from the long sentence by using corresponding resonance units s of the corresponding described method in section 2.1. The experimental results are by means of ascending and resonance units by means of asc shown in Table 5. From the table, the precision of ―which‖ clause descending] ending and descending is 73.61% and ―wherein‖ clause reach a higher precision 96.07%. When putting them together, the precision is 79.61% and error analysis shows that the error mainly due to inaccuracy syntactic analysis even syntactic errors. Of course, the syntactic structure is 2.3 Patent Topic Extraction Based on SRL lost for less than 7% of the sentences. This probably contributes to As stated in the above, since patent topic includes two parts: type- the small performance loss. phrase and field-phrase, we extract type phrase and field phrase separately. First, we build a frequently-used-words list for patent‘s Table 4 Detailed statistics of experimental corpus topic. In this step, we manually annotated the patent abstracts in Data Language Number vocabulary Average small-scale, and then the predicates appear frequently in the of sentence sentence that contains patent topic is collected to form this list. sentences length Next, we analyze every frequently-used-word to obtain its linguistic features and assign a framework of SRL information for Long English 175 8195 47 each of them. The semantic framework can help us to decide sentences which semantic role should be extracted as the patent topic. Two examples for the semantic framework of frequently-used-words is Table 5 the Performance of sentence analysis shown in Table 3. If a sentence contains ‗provide‘ as the TARGET( the predicate tag of the sentence), ARG1 is taken out clauses Precision(%) Recall(%) F1(%) from the sentence as the type-phrase. which 73.61 67.08 70.19 wherein 96.07 96.07 96.07 6. REFERENCES which+wherein 79.61 78.09 78.84 [1] Sameer Pradhan, Wayne Ward, Daniel Jurafsky, Kadri Hacioglu and James H. Martin.2005. Semantic Role Labeling Using Different Syntactic Views. ACL- Using the SRL tool — ASSERT, we get the simplified sentences 05.Association for Computational Linguistics Annual with semantic tags. Then patent topics are extracted from abstracts Meeting(Ann Arbor, MI(US),June 25-30,2005).2005,581- according the algorithm in section 2.3. In order to evaluate the 588. performance of topic extraction, we let three experts label the [2] Sammer Pradhan, Kadrihacioglu, Valerie Krugler, Wayne topics in the 50 English patent abstracts, and then regard them as Ward, Jamesh. Martin, and Daniel Jurafsky.2005. Support the golden standard. Three non-experts are asked to judge whether Vector Learning for Semantic Argument Classification. the extracted topics are correct. When more than two of them Machine Learning Journal. 60, 1/3(2005), 11-39. give a correct judgment for an extracted topic, we regard it is a right one. [3] Hierarchical Recognition of Propositional Arguments with The result shows that there are more than 35 patent abstracts Perceptrons(2004). In Proceedings of CoNLL 2004 Shared which match the manual annotated results. This means our Task.2004. method has a 70% precision for topic extraction. After careful [4] P. Koomen, V. Punyakanok, D. Roth, and Wen-tau Yih. examination, we think the error results from two main reasons: 2005.Generalized Inference with Multiple Semantic Role (1) The high-frequency words list has a small coverage of Labeling Systems. Proceedings of CoNLL-2005. (Ann Arbor, vocabulary. Their frameworks are not precise enough to get a Michigan).2005,181-184. correct patent type phrase or patent field phrase. [5] R. E. Schapire, and Y. Singer. Improved Boosting (2) If one sentence has predicates share same words, it is a Algorithms Using Confidence-rated Predictions. 1998. challenge to decide which one is the best. Proceedings of the Eleventh annual conference on Computational learning theory .Madison,(WI(US);Madison, WI(US)). 1998,80-91. 4. CONCLUSION This research studied SRL and applied it to patent knowledge [6] R. D. Nielsen, and S. Pradhan. 2004. Mixing Weak Learners extraction. The patent abstract is separated into simplified in Semantic Parsing. 42nd Annual Meeting of the sentences by sentence analysis, then labeled semantic role for Association for Computational Linguistics them. Patent technical topic is generated by combing the patent (Barcelona(ES)).2004,1-8. type phrase and patent field phrase. The patent topics are [7] Wang Xuefeng, Wang Youguo, and Liu Yuqin. Construction automatically extracted from the simplified sentences with SRL. of Patent Analysis System Based on Data Collaboration. Our work demonstrates the method we used is effective. Library and Information Service.57,14(2013),92- Until now, the research only performed a simple preprocessing 96.DOI=http://dx.doi.org/10.7536/j.issn.0252- before SRL and our extraction rules of semantic framework are 3116.2013.14.01. also far from comprehensive. In order to get more improvement, [8] Jiang Caihong, Qiao Xiaodong, and Zhu Lijun. the following work needed to be done: (1) A high frequency 2009.Ontology-based Patent Abstracts‘ Knowledge vocabulary can be constructed in larger scale with deeper Extraction. New Technology of Library and Information semantic information of patent context. (2) The pre-processing of Service. 2,(July.2009):23- SRL need to be further optimized. (3) This research only extracted 28.DOI=http://dx.doi.org/10.3969/j.issn.1003- patent technical topic and more information, such as patent 3513.2009.02.004 components, patent characteristics and effect can be done. Our [9] Zhai Dongsheng, Zhang Xinqi, and Zhang Jie. 2013.Design system will be modified to realize more patent information mining. and Implementation of Derwent Patent Ontology.Information We are supposed to further exploring in patent semantic level. Science. 31.12(2013):95-100. 5. ACKNOWLEDGMENTS [10] Martha Palmer, Daniel Gildea, and Paul Kingsbury. This activity has been carried out within the China funded project, 2004.The Proposition Bank: An annotated corpus of Natural Science Funds ―context analysis on statistical machine semantic roles. Computational Linguistics. translation for patent texts‖(No.61303152).The work described in 31,1(July,2004),71- this paper could have not been possible without the collaboration 105.DOI=http://doi.acm.org/10.1162/0891201053630264 of a number of people. We wish thank you our colleagues Jin WEI, Zhaofeng ZHANG, and Peng QU.