<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Get Your Hands Dirty: Evaluating Word2Vec Models for Patent Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hidir Aras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rima Turker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dieter Geiss</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Max Milbradt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIZ Karlsruhe</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Karlsruhe Institute of Technology, Institute AIFB</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leibniz Institute for Information Infrastructure</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Patent search systems allow complex queries to be formulated by combining di erent search terms using boolean and other operators such as proximity, wildcards, etc. in order to nd relevant patents. This widely adopted approach is based on exact match, making it difcult to e ciently identify and analyze relevant patents, as the search terms often do not match the terminology used by the inventors. Another problem concerns the large number of relevant hits due to weekly and monthly updates of patent applications and grants. Although some semantic search systems for patents based on latent semantic analysis have been implemented as black-box systems in the past, word embeddings that have been successfully applied to generate semantic representations of text have rarely been employed and evaluated for a (large) patent corpus. The work described here aims to evaluate semantic representations for patent data via a pre-trained general model in comparison to an adapted word embedding model created from a patent corpus in order to contribute to a multitude of semantic analysis tasks for patents such as similarity search, content analysis, entity linking etc..</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Patents are regarded as an important source allowing companies to de ne new
business strategies and support high-level decision making processes. With
increasing complexity and volumes of patent data enhanced search systems and
novel methods for analyzing patents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which aid the time-consuming patent
reviewing process are required. Herewith, experts can gather valuable insights
for detecting novel inventions, analyzing patent trends, identifying technological
hotspots, etc. The expectations towards new types of search systems for patents
are high [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], as information professionals not only wish to nd more accurate
results but also detect frequently hits which could not be found using traditional
search. Although patents provide valuable scienti c information which can be
gathered via text mining [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and are able to indicate to novel scienti c
relationships in earlier literature, they clearly focus on commercial applications, e.g. use
of drugs for medical purposes. Besides that, patents also entail considerable
difculties due to their broad claims, non-relevant references embedded in patent
text which may lead to wrong relations, and the heavy use of acronyms
leading to more false positives. In this paper, we evaluate word embeddings models
created from di erent corpora for calculating the semantic similiarity of patent
documents, a task which is crucial in several patent analysis use cases such as
prior art, freedom to operate and infringement.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] latent semantic indexing (LSI) was applied for automatic indexing in
information retrieval. However, the results only showed little improvement over the
vector space model. Alternatively, commercial systems like TotalPatent 1 or
OctiMine2 are accessible only as black-box systems for search and retrieval based
on the (semantic) similarity of patent documents to determine their relevance
for a given query. Other systems like PatBase3 also enable semantic search based
on the semantic analysis of citation networks. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] the peculiarities of patent
search systems such as semantic similarity and semantic search are described in
more detail. The work in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] describe semantic representations of
paragraphs and short text, while the research needed to study the semantic similarity
complex document types such as patents is still lacking.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Approach: Using Word Embeddings for Patent Data</title>
      <p>Dataset. In order to create a domain-speci c word embedding model, a subset
of patent documents from the WIPO (World Intelectual Property Organization)
and the EPO (European Patent O ce) patent databases has been sampled by
ltering for speci c patent classi cation codes as shown in Table 1. In total
Search Fields Used IPC, CPC Codes
IPC, CPC G06* AND NOT G06M* OR H04L0009* OR</p>
      <p>H04W0012-00 OR H04H0060-23 OR G11*
410.607 patent documents have been retrieved and the English patent description
texts were used to generate the patent embedding model, furtheron referred to
as IT corpus.</p>
      <p>Model Creation. To create a word2vec model, the Gensim library was applied
using default parameters. The model creation and document vector
representation were based on the description text of the patent documents. For comparison
the pre-trained Google word2vec model has been used as a baseline.</p>
      <sec id="sec-3-1">
        <title>1 https://www.lexisnexis.com/totalpatent 2 https://www.octimine.com 3 https://www.patbase.com</title>
        <p>Get Your Hands Dirty: Evaluating Word2Vec Models for Patent Data</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <sec id="sec-4-1">
        <title>Setup</title>
        <p>The evaluation3 of both models for the task of semantic similarity of patent
documents was performed based on a randomly selected set of 25 documents
(14 WIPO and 11 EPO) from the IT corpus. For each document the 10 most
similar patent documents were determined based on the vector representation
of the description text. To obtain document vectors, word vectors for the patent
description text were averaged. The cosine similarity was used to compute the
similarity among document vectors. As to our best knowledge there exists no
ground-truth for this task, a qualitative evaluation was carried out with patent
experts. In the evaluation they assesed the relevance of the top 10 similar patents
according to the following 3-level scale: 0: irrelevant patent, 1: related patent, 2:
similar patent.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Results</title>
        <p>Google Model
Patent Model
irrelevant
related
similar
irrelevant
related</p>
        <p>similar
0:6
0:5
0:4
0:3
0:2
0:1
0
1
0:6
0:5
0:4
0:3
0:2
0:1
0
1
2
3
4
5
6
7
8</p>
        <p>In order to compare the results of both models we analyzed the cumulated
and normalized scores (Fig. 1 and Fig. 2) in dependence of the rank for the
top 10 documents. Looking at the similarity, we see better average score for
the Patent Model with increasing rank, while the relatedness shows a di erent
picture. Here, the higher ranks of the Google Model4 show better results, while
with increasing rank the relatedness score meet at the same level. It can also be</p>
        <sec id="sec-4-2-1">
          <title>3 https://github.com/dlat z/w2v4pat</title>
          <p>4 https://code.google.com/archive/p/word2vec/
observed that the average score for the irrelevant documents rise with increasing
rank in the Google Model, while in the Patent Model the score stabilize around
40%. We assume that for the similarity the representation of domain-speci c
words in the customized model is more sophisticated compared to the Google
model, which was trained on general (news) data. In contrast, the patent model
was trained only with a much smaller number of domain-speci c patent data
not being able to cover all aspects of relatedness appropriately.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we have evaluated semantic representations of patent texts via
word embeddings created from distinct corpora. We compared the results for
the two regarded distinct word2vec models for the tasks of semantic similarity
and relatedness. The achieved results showed that the mean average scores for
the domain speci c word embedding model are much higher in comparison to the
general Google word2vec model, while the relatedness aspect must be evaluated
in additional experiments employing a more ne-grained scoring scheme for the
experts. In future work, we aim to enhance our approach by considering a more
sophisticated pre-processing also taking into account the inherent structure of a
patent document. In this work, the evaluation was performed based on the patent
description text only, while in future we also want to analyze how abstracts and
claims of the patent text can be exploited for di erent patent analysis use cases
in di erent selected domains such as life science, engineering, etc.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Assad</given-names>
            <surname>Abbas</surname>
          </string-name>
          , Limin Zhang, Samee U.
          <article-title>Khan, A literature review on the state-ofthe-art in patent analysis</article-title>
          ,
          <source>World Patent Information</source>
          , Volume
          <volume>37</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Lupu</surname>
          </string-name>
          , Katja Mayer, Noriko Kando, and
          <string-name>
            <given-names>Anthony J.</given-names>
            <surname>Trippe</surname>
          </string-name>
          .
          <source>Current Challenges in Patent Information Retrieval (2nd ed.)</source>
          . Springer Publishing Company, Incorporated,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Aras</surname>
            , Hidir, Rene Hackl-Sommer,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Schwantner</surname>
            and
            <given-names>Mustafa</given-names>
          </string-name>
          <string-name>
            <surname>Sofean</surname>
          </string-name>
          .
          <article-title>Applications and Challenges of Text Mining with Patents</article-title>
          .
          <source>IPaMin@KONVENS</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Moldovan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. I.</given-names>
            <surname>Bot</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Wanka.</surname>
          </string-name>
          <article-title>Latent semantic indexing for patent documents</article-title>
          ,
          <source>International Journal of Applied Mathematics and Computer Science</source>
          <volume>15</volume>
          (
          <issue>4</issue>
          ),
          <fpage>551</fpage>
          -
          <lpage>560</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bjo</surname>
          </string-name>
          <article-title>rn Jurgens, Nigel Clarke, Study and comparison of the unique selling propositions (USPs) of free-to-use multinational patent search systems</article-title>
          , World Patent Information,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Distributed representations of sentences and documents</article-title>
          .
          <source>In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML'14)</source>
          , Eric P.
          <source>Xing and Tony Jebara (Eds.)</source>
          , Vol.
          <volume>32</volume>
          . JMLR.org II-1188
          <string-name>
            <surname>-</surname>
          </string-name>
          II-1196.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Tom</given-names>
            <surname>Kenter</surname>
          </string-name>
          and Maarten de Rijke.
          <year>2015</year>
          .
          <article-title>Short Text Similarity with Word Embeddings</article-title>
          .
          <source>In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM '15)</source>
          . ACM, New York, NY, USA,
          <fpage>1411</fpage>
          -
          <lpage>1420</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>