<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Research of Semantic Role Labeling and Application in Patent knowledge Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ling'en Meng</string-name>
          <email>mengle2013@istic.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanqing He *</string-name>
          <email>heyq@istic.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ying Li *</string-name>
          <email>liying@istic.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Scientific and Technical, Information of China</institution>
          ,
          <addr-line>Beijing</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>8</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Semantic Role Labeling (SRL) is a leading task of identifying arguments for a predicate and assigning semantically meaningful labels to them. SRL is crucial to information extraction, question answering, and machine translation. When applied to patent text, existing tools for SRL have unsatisfying performance because of long sentences. To improve performance in patent SRL systems, this study separates each sentence in patent abstracts into a simpler structure, and then labels semantic roles for the simplified sentence. At last, semantic information and semantic framework for frequently used words are used to extract patent knowledge. Our work demonstrates that the method used in this article can improve the performance in SRL system and obtain beneficial knowledge from patents. Published at Ceur-ws.org performed by iterative methods to find some optimized coefficients. Discriminant models generally include linear interpolation, SVM[2], Perceptron[3], SNoW(Sparse Network of Winnows)[4], Boosting[5], Maximum Entropy, Decision tree, Random forest[6], etc. Combining the results produced by multiple classifiers is a development direction and can obtain better results than any one classifier. These supervised learning methods above are often dependent on the effect of syntactic parsing and accurate annotation of SRL. It is widely used in information extraction, question answering, and machine translation. SRL has the vital significance in shallow semantic parsing for text information, especially patent texts. Patent texts contain useful information about technologies. Analyzing patent texts can master the present situation of patent texts, predict the hotspot timely and grasp the trend of the technology. The existing patent platforms Patsnap (http://cn.patsnap.com/), TechGlory (Patent risk controls and competitive intelligence analysis system. http://www.tekglory.cn/), and Wang Xuefeng[7] use a manually annotated corpus, they have high cost and low speed. Researchers also adopt automatic extraction method to obtain key information from patent texts. Jiang Caihong[8] constructs an ontology and writes rules for patent knowledge extraction. Zhai Dongsheng[9] uses ontology knowledge and semantic inference measure to construct a reference network of patent. This article introduces SRL information combined with a semantic framework rules to extract patent technical topic from patent abstract. As we all know, patent text usually has the characteristic of long sentences with complex structures. As SRL systems are ported into patent texts, they get poor results and affect the effectiveness of the semantic analysis and knowledge extraction. Compare the following examples:</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic role labeling</kwd>
        <kwd>Patent text</kwd>
        <kwd>Patent knowledge extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Semantic Role Labeling is the process of annotating the
predicateargument structure in text with semantic labels. SRL includes two
sub-tasks: the identification of syntactic constituents that are
semantic roles probably, and the labeling of those constituents
with the correct semantic role [1]. Most of current researches on
SRL focus on using supervised learning method including
generative model and discriminate model. The generative model is
firstly used in the SRL classification model. This model has fast
training rate and the dependence on the training corpus is not
strong. But the poor description ability and strong assumption of
features independence lead to unsatisfactory performance.
Discriminate models directly estimate the final goal of
optimization-- conditional probability. The process is usually
Copyright © 2014 for the individual papers by the papers´authors.</p>
      <sec id="sec-1-1">
        <title>Copying permitted for private and academic purposes.</title>
      </sec>
      <sec id="sec-1-2">
        <title>This volume is published and copyrighted by its editors.</title>
        <p>A plurality of resonance units are arranged [ARGM-TMP
in the shell], wherein one end of each resonance unit is fixed on
the inner wall at one side of the shell.</p>
        <p>Simplified sentence:</p>
        <p>A plurality of resonance units are arranged [ARGM-LOC in
the shell]</p>
        <p>one end of each resonance unit is fixed on the inner wall at o
ne side of the shell.</p>
        <p>It‘s obviously that the sematic tag ARGM-TMP (ARGM-TMP
represents time, more details in 2.2) in long sentence is wrong.
The correct tag is ARGM-LOC (ARGM-LOC represents location)
in the simplified sentence. To resolve the above problem, our
approach separates each long complicated sentence in patent
abstracts into a simpler structure, then labels semantic roles for
the simplified sentences, finally, synthesizes all the semantic
labels and semantic framework to extract patent topic. Finally,
SRL information is used to extract patent knowledge from patent
abstract and obtains beneficial topic knowledge from patents.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. SYSTEM ARCHITECTURE AND</title>
    </sec>
    <sec id="sec-3">
      <title>TECHNICAL DETAILS</title>
      <p>In a patent text, an abstract contains its topic, effect, components
and features; all of them are important information for the patent.
The purpose of this article is to automatically extract the patent
topic from the patent abstract. Patent topic mainly involves patent
type and patent filed. An example is given in Figure 1. For the
patent abstract, the phrase ―An electrically tunable filter‖
indicates the patent type and the phrase
―technical field of electronic communication‖ shows patent field.
The two phrases: patent type phrase and patent field phrase, need
to be extracted to form the patent technical topic.</p>
      <p>Abstract: The embodiment of the invention provides an electrically tunable filter, relating
to the technical field of electronic communication. The electrically tunable filter comprises a shell, a signal
input end and a signal output end. A plurality of resonance units are arranged in the shell, wherein one end of e
ach resonance unit is fixed on the inner wall at one side of the shell. Gaps are
kept among adjacent resonance units. Two resonance units with the farthest distance from each other are
respectively connected with the signal input end and the signal output end. Medium sheets which
are used for adjusting the resonance frequencies of the corresponding resonance units by means of
ascending and descending are arranged below the resonance units. The electrically tunable filter not
only has a small number of tuning parameters, but also has a simple structure and can better realize the free
movement of a center frequency point and the bandwidth of a passband.</p>
      <p>Technical Topic: An electrically tunable filter, technical field of electronic communication</p>
    </sec>
    <sec id="sec-4">
      <title>2.1 Sentence Analysis</title>
      <p>A patent abstract often contains long sentences, some of which
may involve clauses, such as adverbial clause, object clause,
attributive clause, etc. Clauses can generate inaccuracy in
syntactic parsing. These errors even can transmit to SRL. For
these reasons, we take out clauses in the long sentence, then, turn
the long sentence into simplified sentences. Here we mainly
separated attributive clause containing ‗which‘ and ‗wherein‘.
Stanford Parser (http://cemantix.org/software.html) is introduced
in order to support us to find clause boundaries. On account of the
length of sentence over 70 words can‘t be parsed, the sentence
over 70 words is divided at ‗;‘, ‗wherein‘ before parsed. This
practice can maintain the integrity of sentence structure. But there
are still less than 7% sentences over 70 words, they are divided at
the middle ‗,‘ by a simple iterative method. After parsing, If the
long sentence contains ‗wherein‘ clauses, we separate the long
sentence at ‗wherein‘ into two parts; if the long sentence contains
‗which‘ clauses, we deal with them using a program, the
pseudocode is given in Figure 3.</p>
      <p>Begin
Input :long sentence
parsing the long sentence,we can get the syntactic tree —— parseLongSentence.</p>
      <p>if parseLongSentence contains guide word —— '(which)'
find the guide word —— (which) in the syntactic tree, record the position as whichPosition.
/*search from whichPosition，judge 'NP(…)' or 'VP(…)' which one come first,if NP(…)，record
TRUE*/
if search from whichPosition, ‘NP’ come first
search from whichPosition，then take out the first S(…) close to
whichPostition —— sentence1;
else search from whichPosition, ‘VP’ come first
search from whichPostition in the opposite direction,then take out the first
NP(…) close to whichPosition;
search from whichPosition，then take out the first S(…) close to
whichPostition;
combine NP(…) with S(…) as a new simplified sentence —— sentence2;
Output: sentence1,sentence2;
// print the sentence which removed the clause sentences.</p>
      <p>Output:long sentence – sentence1 – sentence2;
End</p>
    </sec>
    <sec id="sec-5">
      <title>2.2 SRL System for Simplified Sentences</title>
      <p>After obtaining the simplified sentences, we use the tool
-Automatic Statistical SEmantic Role Tagger (ASSERT) (about
this tool, you can find more information by visiting
http://cemantix.org/publications.html) to label them. A sentence is
annotated with tags such as TARGET, ARG 0~5, ARGM. Each
predicate verb of the sentence is marked with TARGET. ARG0、
ARG1 respectively represents agent, patient. ARG2 - ARG5 have
different meanings in different situations. As to ARGM, it has
thirteen subtypes, they are shown in Table 1.
More information about semantic roles please refer to Martha
Palmer[10]. Table 2 shows the difference of SRL for patent abstract
shown in Figure 1 and Figure 4.</p>
    </sec>
    <sec id="sec-6">
      <title>2.3 Patent Topic Extraction Based on SRL</title>
      <p>As stated in the above, since patent topic includes two parts:
typephrase and field-phrase, we extract type phrase and field phrase
separately. First, we build a frequently-used-words list for patent‘s
topic. In this step, we manually annotated the patent abstracts in
small-scale, and then the predicates appear frequently in the
sentence that contains patent topic is collected to form this list.
Next, we analyze every frequently-used-word to obtain its
linguistic features and assign a framework of SRL information for
each of them. The semantic framework can help us to decide
which semantic role should be extracted as the patent topic. Two
examples for the semantic framework of frequently-used-words is
shown in Table 3. If a sentence contains ‗provide‘ as the
TARGET( the predicate tag of the sentence), ARG1 is taken out
from the sentence as the type-phrase.
Next, we match the word from the list with TARGET of each
simplified sentence in the abstract. If matched, the phrase for
semantic role ARG0~ARG5 of TARGET is extracted from this
sentence according to its framework.</p>
      <p>For the field-phrase, we firstly choose the labeled sentence that
contains phrase with ―field‖ between ―[‖ and ―]‖. If the semantic
role for the phrase is ARGM, we extract the corresponding phrase
as the field-phrase. Otherwise, we locate TARGET in the sentence
containing ―field‖, and then judge TARGET semantic framework
to determine which semantic role should be extracted from ARG0
to ARG5.</p>
      <p>In fact, in order to promote performance of extraction,
postprocessing methods are used, such as getting rid of the preposition
at the beginning or removing some gerundial phrases.</p>
    </sec>
    <sec id="sec-7">
      <title>3. EXPERIMENT</title>
      <p>In this section, we perform an experiment to evaluate our patent
topic extraction based on SRL. The evaluation standard
‗Precision‘, ‗Recall‘, ‗F1‘ are used to evaluate the system effect.
We choose 50 patent abstracts relating to communication field as
our experiment data. Detailed statistics of corpus is shown in
Table 4. We take out the clauses from the long sentence by using
described method in section 2.1. The experimental results are
shown in Table 5. From the table, the precision of ―which‖ clause
is 73.61% and ―wherein‖ clause reach a higher precision 96.07%.
When putting them together, the precision is 79.61% and error
analysis shows that the error mainly due to inaccuracy syntactic
analysis even syntactic errors. Of course, the syntactic structure is
lost for less than 7% of the sentences. This probably contributes to
the small performance loss.
wherein
which+wherein
Using the SRL tool — ASSERT, we get the simplified sentences
with semantic tags. Then patent topics are extracted from abstracts
according the algorithm in section 2.3. In order to evaluate the
performance of topic extraction, we let three experts label the
topics in the 50 English patent abstracts, and then regard them as
the golden standard. Three non-experts are asked to judge whether
the extracted topics are correct. When more than two of them
give a correct judgment for an extracted topic, we regard it is a
right one.</p>
      <p>
        The result shows that there are more than 35 patent abstracts
which match the manual annotated results. This means our
method has a 70% precision for topic extraction. After careful
examination, we think the error results from two main reasons:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) The high-frequency words list has a small coverage of
vocabulary. Their frameworks are not precise enough to get a
correct patent type phrase or patent field phrase.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) If one sentence has predicates share same words, it is a
challenge to decide which one is the best.
      </p>
    </sec>
    <sec id="sec-8">
      <title>4. CONCLUSION</title>
      <p>
        This research studied SRL and applied it to patent knowledge
extraction. The patent abstract is separated into simplified
sentences by sentence analysis, then labeled semantic role for
them. Patent technical topic is generated by combing the patent
type phrase and patent field phrase. The patent topics are
automatically extracted from the simplified sentences with SRL.
Our work demonstrates the method we used is effective.
Until now, the research only performed a simple preprocessing
before SRL and our extraction rules of semantic framework are
also far from comprehensive. In order to get more improvement,
the following work needed to be done: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) A high frequency
vocabulary can be constructed in larger scale with deeper
semantic information of patent context. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) The pre-processing of
SRL need to be further optimized. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) This research only extracted
patent technical topic and more information, such as patent
components, patent characteristics and effect can be done. Our
system will be modified to realize more patent information mining.
We are supposed to further exploring in patent semantic level.
      </p>
    </sec>
    <sec id="sec-9">
      <title>5. ACKNOWLEDGMENTS</title>
      <p>This activity has been carried out within the China funded project,
Natural Science Funds ―context analysis on statistical machine
translation for patent texts‖(No.61303152).The work described in
this paper could have not been possible without the collaboration
of a number of people. We wish thank you our colleagues Jin
WEI, Zhaofeng ZHANG, and Peng QU.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Pradhan</surname>
          </string-name>
          , Wayne Ward, Daniel Jurafsky, Kadri Hacioglu and
          <string-name>
            <surname>James H. Martin</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Semantic Role Labeling Using Different Syntactic Views</article-title>
          . ACL05.
          <article-title>Association for Computational Linguistics Annual Meeting(Ann Arbor, MI(US</article-title>
          ),
          <source>June 25-30</source>
          ,
          <year>2005</year>
          ).
          <year>2005</year>
          ,
          <fpage>581</fpage>
          -
          <lpage>588</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sammer</given-names>
            <surname>Pradhan</surname>
          </string-name>
          , Kadrihacioglu, Valerie Krugler, Wayne Ward, Jamesh. Martin,
          <string-name>
            <given-names>and Daniel</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Support Vector Learning for Semantic Argument Classification</article-title>
          .
          <source>Machine Learning Journal. 60</source>
          ,
          <issue>1</issue>
          /3(
          <year>2005</year>
          ),
          <fpage>11</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Hierarchical Recognition of Propositional Arguments with Perceptrons(</article-title>
          <year>2004</year>
          ).
          <source>In Proceedings of CoNLL 2004 Shared Task</source>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Koomen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Punyakanok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          , and
          <string-name>
            <surname>Wen-tau Yih</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Generalized Inference with Multiple Semantic Role Labeling Systems</article-title>
          .
          <source>Proceedings of CoNLL-2005</source>
          . (Ann Arbor, Michigan).
          <year>2005</year>
          ,
          <fpage>181</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Schapire</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Singer. Improved Boosting Algorithms Using</surname>
          </string-name>
          Confidence-rated
          <string-name>
            <surname>Predictions</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Proceedings of the Eleventh annual conference on Computational learning theory</article-title>
          .
          <source>Madison</source>
          ,
          <article-title>(WI(US);Madison, WI(US))</article-title>
          .
          <year>1998</year>
          ,
          <fpage>80</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Pradhan</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Mixing Weak Learners in Semantic Parsing. 42nd Annual Meeting of the Association for Computational Linguistics (Barcelona(ES))</article-title>
          .
          <year>2004</year>
          ,1-
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Wang</given-names>
            <surname>Xuefeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Wang</given-names>
            <surname>Youguo</surname>
          </string-name>
          , and Liu Yuqin.
          <source>Construction of Patent Analysis System Based on Data Collaboration. Library and Information Service</source>
          .
          <volume>57</volume>
          ,
          <issue>14</issue>
          (
          <year>2013</year>
          ),
          <fpage>92</fpage>
          -
          <lpage>96</lpage>
          .DOI=http://dx.doi.org/10.7536/j.issn.
          <volume>0252</volume>
          -
          <fpage>3116</fpage>
          .
          <year>2013</year>
          .
          <volume>14</volume>
          .01.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jiang</given-names>
            <surname>Caihong</surname>
          </string-name>
          , Qiao Xiaodong, and
          <string-name>
            <given-names>Zhu</given-names>
            <surname>Lijun</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Ontology-based Patent Abstracts' Knowledge Extraction</article-title>
          .
          <source>New Technology of Library and Information Service</source>
          .
          <volume>2</volume>
          ,
          <issue>(</issue>
          <year>July</year>
          .
          <year>2009</year>
          ):
          <fpage>23</fpage>
          -
          <lpage>28</lpage>
          .DOI=http://dx.doi.org/10.3969/j.issn.
          <volume>1003</volume>
          -
          <fpage>3513</fpage>
          .
          <year>2009</year>
          .
          <volume>02</volume>
          .004
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Zhai</given-names>
            <surname>Dongsheng</surname>
          </string-name>
          , Zhang Xinqi, and Zhang Jie.
          <year>2013</year>
          .
          <article-title>Design and Implementation of Derwent Patent Ontology</article-title>
          .
          <source>Information Science</source>
          .
          <volume>31</volume>
          .12(
          <year>2013</year>
          ):
          <fpage>95</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Martha</surname>
            <given-names>Palmer</given-names>
          </string-name>
          , Daniel Gildea, and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Kingsbury</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>The Proposition Bank: An annotated corpus of semantic roles</article-title>
          .
          <source>Computational Linguistics</source>
          .
          <volume>31</volume>
          ,
          <issue>1</issue>
          (
          <issue>July</issue>
          ,
          <year>2004</year>
          ),
          <fpage>71</fpage>
          -
          <lpage>105</lpage>
          .DOI=http://doi.acm.
          <source>org/10.1162/0891201053630264</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>