=Paper= {{Paper |id=Vol-1292/ipamin2014_paper5 |storemode=property |title=Research of Semantic Role Labeling and Application in Patent Knowledge Extraction |pdfUrl=https://ceur-ws.org/Vol-1292/ipamin2014_paper5.pdf |volume=Vol-1292 |dblpUrl=https://dblp.org/rec/conf/konvens/MengHL14 }} ==Research of Semantic Role Labeling and Application in Patent Knowledge Extraction== https://ceur-ws.org/Vol-1292/ipamin2014_paper5.pdf
    Research of Semantic Role Labeling and Application in
                Patent knowledge Extraction
              Ling’en Meng                                       Yanqing He *                                     Ying Li *
  Institute of Scientific and Technical               Institute of Scientific and Technical        Institute of Scientific and Technical
     Information of China, Beijing                       Information of China, Beijing                Information of China, Beijing
     mengle2013@istic.ac.cn                                   heyq@istic.ac.cn                             liying@istic.ac.cn

ABSTRACT                                                                    performed by iterative methods to find some optimized
                                                                            coefficients. Discriminant models generally include linear
Semantic Role Labeling (SRL) is a leading task of identifying               interpolation, SVM[2], Perceptron[3], SNoW(Sparse Network of
arguments for a predicate and assigning semantically meaningful             Winnows)[4], Boosting[5], Maximum Entropy, Decision tree,
labels to them. SRL is crucial to information extraction, question          Random forest[6], etc. Combining the results produced by multiple
answering, and machine translation. When applied to patent text,            classifiers is a development direction and can obtain better results
existing tools for SRL have unsatisfying performance because of             than any one classifier. These supervised learning methods above
long sentences. To improve performance in patent SRL systems,               are often dependent on the effect of syntactic parsing and accurate
this study separates each sentence in patent abstracts into a               annotation of SRL. It is widely used in information extraction,
simpler structure, and then labels semantic roles for the simplified        question answering, and machine translation.
sentence. At last, semantic information and semantic framework
                                                                            SRL has the vital significance in shallow semantic parsing for text
for frequently used words are used to extract patent knowledge.
                                                                            information, especially patent texts. Patent texts contain useful
Our work demonstrates that the method used in this article can
                                                                            information about technologies. Analyzing patent texts can master
improve the performance in SRL system and obtain beneficial
                                                                            the present situation of patent texts, predict the hotspot timely and
knowledge from patents.
                                                                            grasp the trend of the technology. The existing patent platforms
Categories and Subject Descriptors                                          Patsnap (http://cn.patsnap.com/), TechGlory (Patent risk controls
                                                                            and competitive intelligence analysis system. http://www.tek-
I.2.7 [Computing Methodologies]: Language Constructs and
                                                                            glory.cn/), and Wang Xuefeng[7] use a manually annotated corpus,
Features –Language parsing and understanding, Text analysis.
                                                                            they have high cost and low speed. Researchers also adopt
                                                                            automatic extraction method to obtain key information from
General Terms                                                               patent texts. Jiang Caihong[8] constructs an ontology and writes
Algorithms, Experimentation, Languages                                      rules for patent knowledge extraction. Zhai Dongsheng[9] uses
                                                                            ontology knowledge and semantic inference measure to construct
Keywords                                                                    a reference network of patent.
                                                                            This article introduces SRL information combined with a
Semantic role labeling, Patent text, Patent knowledge extraction            semantic framework rules to extract patent technical topic from
                                                                            patent abstract. As we all know, patent text usually has the
1. INTRODUCTION                                                             characteristic of long sentences with complex structures. As SRL
Semantic Role Labeling is the process of annotating the predicate-
                                                                            systems are ported into patent texts, they get poor results and
argument structure in text with semantic labels. SRL includes two
                                                                            affect the effectiveness of the semantic analysis and knowledge
sub-tasks: the identification of syntactic constituents that are
                                                                            extraction. Compare the following examples:
semantic roles probably, and the labeling of those constituents
with the correct semantic role [1]. Most of current researches on                Long sentence:
SRL focus on using supervised learning method including                          A plurality of resonance units are arranged [ARGM-TMP
generative model and discriminate model. The generative model is
firstly used in the SRL classification model. This model has fast            in the shell], wherein one end of each resonance unit is fixed on
training rate and the dependence on the training corpus is not              the inner wall at one side of the shell.
strong. But the poor description ability and strong assumption of                Simplified sentence:
features independence lead to unsatisfactory performance.                        A plurality of resonance units are arranged [ARGM-LOC in
Discriminate models directly estimate the final goal of                     the shell]
optimization-- conditional probability. The process is usually
                                                                                 one end of each resonance unit is fixed on the inner wall at o
                                                                            ne side of the shell.

                                                                            It‘s obviously that the sematic tag ARGM-TMP (ARGM-TMP
 Copyright © 2014 for the individual papers by the papers´authors.          represents time, more details in 2.2) in long sentence is wrong.
 Copying permitted for private and academic purposes.                       The correct tag is ARGM-LOC (ARGM-LOC represents location)
 This volume is published and copyrighted by its editors.                   in the simplified sentence. To resolve the above problem, our
                                                                            approach separates each long complicated sentence in patent
 Published at Ceur-ws.org
                                                                            abstracts into a simpler structure, then labels semantic roles for
 Proceedings of the First International Workshop on Patent Mining and Its   the simplified sentences, finally, synthesizes all the semantic
 Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014.
                                                                            labels and semantic framework to extract patent topic. Finally,
 At KONVENS´14, October 8-10, 2014, Hildesheim, Germany.
SRL information is used to extract patent knowledge from patent                                                     Stanford Parser (http://cemantix.org/software.html) is introduced
abstract and obtains beneficial topic knowledge from patents.                                                       in order to support us to find clause boundaries. On account of the
                                                                                                                    length of sentence over 70 words can‘t be parsed, the sentence
                                                                                                                    over 70 words is divided at ‗;‘, ‗wherein‘ before parsed. This
2. SYSTEM ARCHITECTURE AND                                                                                          practice can maintain the integrity of sentence structure. But there
                                                                                                                    are still less than 7% sentences over 70 words, they are divided at
TECHNICAL DETAILS                                                                                                   the middle ‗,‘ by a simple iterative method. After parsing, If the
In a patent text, an abstract contains its topic, effect, components                                                long sentence contains ‗wherein‘ clauses, we separate the long
and features; all of them are important information for the patent.                                                 sentence at ‗wherein‘ into two parts; if the long sentence contains
The purpose of this article is to automatically extract the patent                                                  ‗which‘ clauses, we deal with them using a program, the pseudo-
topic from the patent abstract. Patent topic mainly involves patent                                                 code is given in Figure 3.
type and patent filed. An example is given in Figure 1. For the
patent abstract, the phrase ―An electrically tunable filter‖                                                        Begin
indicates      the      patent     type      and       the    phrase                                                Input :long sentence

―technical field of electronic communication‖ shows patent field.                                                     parsing the long sentence,we can get the syntactic tree —— parseLongSentence.
                                                                                                                            if parseLongSentence contains guide word —— '(which)'
The two phrases: patent type phrase and patent field phrase, need                                                                    find the guide word —— (which) in the syntactic tree, record the position as whichPosition.
to be extracted to form the patent technical topic.                                                                                  /*search from whichPosition,judge 'NP(…)' or 'VP(…)' which one come first,if NP(…),record
                                                                                                                                     TRUE*/
Abstract: The embodiment of the invention provides an electrically tunable filter, relating                                                   if search from whichPosition, ‘NP’ come first
to the technical field of electronic communication. The electrically tunable filter comprises a shell, a signal                                        search from whichPosition,then take out the first S(…) close to
                                                                                                                                                       whichPostition —— sentence1;
input end and a signal output end. A plurality of resonance units are arranged in the shell, wherein one end of e
                                                                                                                                              else search from whichPosition, ‘VP’ come first
ach resonance unit is fixed on the inner wall at one side of the shell. Gaps are                                                                       search from whichPostition in the opposite direction,then take out the first
kept among adjacent resonance units. Two resonance units with the farthest distance from each other are                                                NP(…) close to whichPosition;
                                                                                                                                                       search from whichPosition,then take out the first S(…) close to
respectively connected with the signal input end and the signal output end. Medium sheets which
                                                                                                                                                       whichPostition;
are used for adjusting the resonance frequencies of the corresponding resonance units by means of                                                      combine NP(…) with S(…) as a new simplified sentence —— sentence2;
ascending and descending are arranged below the resonance units. The electrically tunable filter not                Output: sentence1,sentence2;
                                                                                                                    // print the sentence which removed the clause sentences.
only has a small number of tuning parameters, but also has a simple structure and can better realize the free       Output:long sentence – sentence1 – sentence2;
movement of a center frequency point and the bandwidth of a passband.                                               End

Technical Topic: An electrically tunable filter, technical field of electronic communication



                                                                                                                          Figure 3.The pseudo-code for sentence contains ‘which’
Figure 1. An example of patent abstract and its technical topic
                                                                                                                                                 clauses
                                                                                                                    After sentence analysis, the patent abstract shown in Figure 1 is
     English
                                                                                                                    turned into some simplified sentences (bold fonts) shown in
      Patent
     Abstract
                                            Sentence Analysis                    Simplified sentence                Figure 4.
                                                                                                                    Sub-sentences:The embodiment of the invention provides an electrically tunable filter, relating
                                       Frequently used                                                              to the technical field of electronic communication.
                                           words
                                                                               Semantic Role Labeling for           The electrically tunable filter comprises a shell, a signal input end and a signal output end.
                                                                                  simplified sentence
                                                                                                                    A plurality of resonance units are arranged in the shell.
                                                                                                                    One end of each resonance unit is fixed on the inner wall at one side of the shell.
                                                                                                                    Gaps are kept among adjacent resonance units.
                                                                                                                    Two resonance units with the farthest distance from each other are respectively connected with the
Patent Technical                      SRL based Patent topic                     Labeled Simplified
     Topic                                 extraction                                Sentences                      signal input end and the signal output end.
                                                                                                                    Medium sheets are used for adjusting the resonance frequencies of the corresponding
                                                                                                                    resonance units by means of ascending and descending
                     Figure2. The system processing pipeline
                                                                                                                    Medium sheets are arranged below the resonance units.
As shown in Figure2, our processing is divided into three steps.                                                    The electrically tunable filter not only has a small number of tuning parameters, but also has a simple struct
First, English patent abstracts are separated into simplified                                                       ure and can better realize the free movement of a center frequency point and the bandwidth of a
sentences by sentence analysis module. Next the simplified                                                          passband.
sentences are labeled with semantic roles. Finally, the frequently
used words with semantic framework and labeled simplified                                                               Figure 4. Simplified sentences of patent abstract shown in
sentences are input into a patent topic extraction module to obtain                                                                              Figure 1
the patent technical topics.
2.1 Sentence Analysis
A patent abstract often contains long sentences, some of which                                                      2.2 SRL System for Simplified Sentences
may involve clauses, such as adverbial clause, object clause,                                                       After obtaining the simplified sentences, we use the tool --
attributive clause, etc. Clauses can generate inaccuracy in                                                         Automatic Statistical SEmantic Role Tagger (ASSERT) (about
syntactic parsing. These errors even can transmit to SRL. For                                                       this tool, you can find more information by visiting
these reasons, we take out clauses in the long sentence, then, turn                                                 http://cemantix.org/publications.html) to label them. A sentence is
the long sentence into simplified sentences. Here we mainly                                                         annotated with tags such as TARGET, ARG 0~5, ARGM. Each
separated attributive clause containing ‗which‘ and ‗wherein‘.                                                      predicate verb of the sentence is marked with TARGET. ARG0、
                                                                                                                    ARG1 respectively represents agent, patient. ARG2 - ARG5 have
different meanings in different situations. As to ARGM, it has                       Table 3. TARGET semantic framework of
thirteen subtypes, they are shown in Table 1.                                                  frequently-used-words
                                                                          Frequent            Semantic                     Example
           Table 1. Subtypes of the ARGM modifier tag                      Word              Framework
  ARGM-              location           ARGM-             cause                                                    [ARG1 The invention]
   LOC                                   CAU                                relate                                [TARGET relates][ARG2
                                                                                          relate [to ARG2]          to a double-shielded
  ARGM-               extent            ARGM-              time
   EXT                                   TMP                                                                       mineral-insulated cable]

ARGM-DIS            discourse        ARGM-PNC            purpose                                                  [ARG0 The embodiment of
                   connectives                                                                                     the invention] [TARGET
                                                                                                                      provides ] [ARG1 an
  ARGM-          general purpose        ARGM-            manner           provide         provide [ARG1]            electrically tunable filter
   ADV                                   MNR                                                                        relating to the technical
  ARGM-          negation marker      ARGM-DIR           direction                                                     field of electronic
   NEG                                                                                                                  communication]
  ARGM-            modal verb
   MOD                                                                   Next, we match the word from the list with TARGET of each
More information about semantic roles please refer to Martha             simplified sentence in the abstract. If matched, the phrase for
Palmer[10]. Table 2 shows the difference of SRL for patent abstract      semantic role ARG0~ARG5 of TARGET is extracted from this
shown in Figure 1 and Figure 4.                                          sentence according to its framework.
     Table 2 Difference of SRL between Long Sentence and                 For the field-phrase, we firstly choose the labeled sentence that
                     Simplified Sentence                                 contains phrase with ―field‖ between ―[‖ and ―]‖. If the semantic
                                                                         role for the phrase is ARGM, we extract the corresponding phrase
SRL errors in long sentences             Correct SRL results in          as the field-phrase. Otherwise, we locate TARGET in the sentence
                                          simplified sentences           containing ―field‖, and then judge TARGET semantic framework
A plurality of resonance units ar                                        to determine which semantic role should be extracted from ARG0
e arranged[ARGM-                     A plurality of resonance units ar   to ARG5.
TMP in the shell], wherein one       e arranged [ARGM-LOC in the
end of each resonance unit is fix    shell]                              In fact, in order to promote performance of extraction, post-
ed on the inner wall at one side                                         processing methods are used, such as getting rid of the preposition
of the shell.                                                            at the beginning or removing some gerundial phrases.
A plurality of resonance units ar    In the process of separating the
e arranged in the shell, [TARGE      long sentence, word—‗wherein‘       3. EXPERIMENT
T wherein] one end of each           is removed. This error can be no    In this section, we perform an experiment to evaluate our patent
resonance unit is fixed on the in    more arise in simplified            topic extraction based on SRL. The evaluation standard -
ner wall at one side of the shell.   sentence.                           ‗Precision‘, ‗Recall‘, ‗F1‘ are used to evaluate the system effect.
[ARG1 Medium sheets which            [ARG1                               We choose 50 patent abstracts relating to communication field as
are used for adjusting the           Medium sheets] are used for adj     our experiment data. Detailed statistics of corpus is shown in
resonance frequencies of the         usting the resonance frequencie     Table 4. We take out the clauses from the long sentence by using
corresponding resonance units        s of the corresponding              described method in section 2.1. The experimental results are
by means of ascending and            resonance units by means of asc     shown in Table 5. From the table, the precision of ―which‖ clause
descending]                          ending and descending               is 73.61% and ―wherein‖ clause reach a higher precision 96.07%.
                                                                         When putting them together, the precision is 79.61% and error
                                                                         analysis shows that the error mainly due to inaccuracy syntactic
                                                                         analysis even syntactic errors. Of course, the syntactic structure is
2.3 Patent Topic Extraction Based on SRL                                 lost for less than 7% of the sentences. This probably contributes to
As stated in the above, since patent topic includes two parts: type-
                                                                         the small performance loss.
phrase and field-phrase, we extract type phrase and field phrase
separately. First, we build a frequently-used-words list for patent‘s           Table 4 Detailed statistics of experimental corpus
topic. In this step, we manually annotated the patent abstracts in       Data           Language      Number        vocabulary      Average
small-scale, and then the predicates appear frequently in the                                         of                            sentence
sentence that contains patent topic is collected to form this list.                                   sentences                     length
Next, we analyze every frequently-used-word to obtain its
linguistic features and assign a framework of SRL information for          Long          English           175          8195            47
each of them. The semantic framework can help us to decide               sentences
which semantic role should be extracted as the patent topic. Two
examples for the semantic framework of frequently-used-words is
                                                                                       Table 5 the Performance of sentence analysis
shown in Table 3. If a sentence contains ‗provide‘ as the
TARGET( the predicate tag of the sentence), ARG1 is taken out                clauses         Precision(%)     Recall(%)         F1(%)
from the sentence as the type-phrase.                                         which                73.61           67.08            70.19
    wherein            96.07            96.07            96.07        6. REFERENCES
which+wherein          79.61            78.09            78.84        [1] Sameer Pradhan, Wayne Ward, Daniel Jurafsky, Kadri
                                                                          Hacioglu and James H. Martin.2005. Semantic Role
                                                                          Labeling Using Different Syntactic Views. ACL-
Using the SRL tool — ASSERT, we get the simplified sentences              05.Association for Computational Linguistics Annual
with semantic tags. Then patent topics are extracted from abstracts       Meeting(Ann Arbor, MI(US),June 25-30,2005).2005,581-
according the algorithm in section 2.3. In order to evaluate the          588.
performance of topic extraction, we let three experts label the       [2] Sammer Pradhan, Kadrihacioglu, Valerie Krugler, Wayne
topics in the 50 English patent abstracts, and then regard them as        Ward, Jamesh. Martin, and Daniel Jurafsky.2005. Support
the golden standard. Three non-experts are asked to judge whether         Vector Learning for Semantic Argument Classification.
the extracted topics are correct. When more than two of them              Machine Learning Journal. 60, 1/3(2005), 11-39.
give a correct judgment for an extracted topic, we regard it is a
right one.                                                            [3] Hierarchical Recognition of Propositional Arguments with
The result shows that there are more than 35 patent abstracts             Perceptrons(2004). In Proceedings of CoNLL 2004 Shared
which match the manual annotated results. This means our                  Task.2004.
method has a 70% precision for topic extraction. After careful        [4] P. Koomen, V. Punyakanok, D. Roth, and Wen-tau Yih.
examination, we think the error results from two main reasons:            2005.Generalized Inference with Multiple Semantic Role
(1) The high-frequency words list has a small coverage of                 Labeling Systems. Proceedings of CoNLL-2005. (Ann Arbor,
vocabulary. Their frameworks are not precise enough to get a              Michigan).2005,181-184.
correct patent type phrase or patent field phrase.                    [5] R. E. Schapire, and Y. Singer. Improved Boosting
(2) If one sentence has predicates share same words, it is a              Algorithms Using Confidence-rated Predictions. 1998.
challenge to decide which one is the best.                                Proceedings of the Eleventh annual conference on
                                                                          Computational learning theory .Madison,(WI(US);Madison,
                                                                          WI(US)). 1998,80-91.
4. CONCLUSION
This research studied SRL and applied it to patent knowledge          [6] R. D. Nielsen, and S. Pradhan. 2004. Mixing Weak Learners
extraction. The patent abstract is separated into simplified              in Semantic Parsing. 42nd Annual Meeting of the
sentences by sentence analysis, then labeled semantic role for            Association for Computational Linguistics
them. Patent technical topic is generated by combing the patent           (Barcelona(ES)).2004,1-8.
type phrase and patent field phrase. The patent topics are            [7] Wang Xuefeng, Wang Youguo, and Liu Yuqin. Construction
automatically extracted from the simplified sentences with SRL.           of Patent Analysis System Based on Data Collaboration.
Our work demonstrates the method we used is effective.                    Library and Information Service.57,14(2013),92-
Until now, the research only performed a simple preprocessing             96.DOI=http://dx.doi.org/10.7536/j.issn.0252-
before SRL and our extraction rules of semantic framework are             3116.2013.14.01.
also far from comprehensive. In order to get more improvement,        [8] Jiang Caihong, Qiao Xiaodong, and Zhu Lijun.
the following work needed to be done: (1) A high frequency                2009.Ontology-based Patent Abstracts‘ Knowledge
vocabulary can be constructed in larger scale with deeper                 Extraction. New Technology of Library and Information
semantic information of patent context. (2) The pre-processing of         Service. 2,(July.2009):23-
SRL need to be further optimized. (3) This research only extracted        28.DOI=http://dx.doi.org/10.3969/j.issn.1003-
patent technical topic and more information, such as patent               3513.2009.02.004
components, patent characteristics and effect can be done. Our
                                                                      [9] Zhai Dongsheng, Zhang Xinqi, and Zhang Jie. 2013.Design
system will be modified to realize more patent information mining.
                                                                          and Implementation of Derwent Patent Ontology.Information
We are supposed to further exploring in patent semantic level.
                                                                          Science. 31.12(2013):95-100.
5. ACKNOWLEDGMENTS                                                    [10] Martha Palmer, Daniel Gildea, and Paul Kingsbury.
This activity has been carried out within the China funded project,        2004.The Proposition Bank: An annotated corpus of
Natural Science Funds ―context analysis on statistical machine             semantic roles. Computational Linguistics.
translation for patent texts‖(No.61303152).The work described in           31,1(July,2004),71-
this paper could have not been possible without the collaboration          105.DOI=http://doi.acm.org/10.1162/0891201053630264
of a number of people. We wish thank you our colleagues Jin
WEI, Zhaofeng ZHANG, and Peng QU.