<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>in: Journal of Physics: Conference Series</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/ACCESS.2023.3253388</article-id>
      <title-group>
        <article-title>Diving into Knowledge Graphs for Patents: Open Challenges and Benefits</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Danilo Dessí</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rima Dessí</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GESIS Leibniz Institute for the Social Sciences</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Patent for Science Department, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure</institution>
          ,
          <addr-line>Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2060</year>
      </pub-date>
      <volume>2060</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Textual documents are the means of sharing information and preserving knowledge for a large variety of domains. The patent domain is also using such a paradigm which is becoming dificult to maintain and is limiting the potentialities of using advanced AI systems for domain analysis. To overcome this issue, it is more and more frequent to find approaches to transform textual representations into Knowledge Graphs (KGs). In this position paper, we discuss KGs within the patent domain, present its challenges, and envision the benefits of such technologies for this domain. In addition, this paper provides insights of such KGs by reproducing an existing pipeline to create KGs and applying it to patents in the computer science domain.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Patent Domain</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Intellectual Property</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Publishing patents in natural language textual documents is the current paradigm employed to
describe and disclose technological innovations and protect intellectual properties. Searching,
analyzing, and understanding patents are keys for analyzing the current state of industry and
society, identifying their needs, and driving their future development. However, these key points
may not be achieved with today’s publishing paradigm due to the complexity, heterogeneity,
and length of patent documents. This limits patent searchers in finding and exploring patents
to support business-critical decisions and makes such processes expensive and time-consuming.
The main limitation of such a paradigm is given by the typical complexity of the natural
language which requires critical human thinking based on grammar, semantics, and a complex
understanding of what is conveyed in patents’ text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This is also more and more stressed
by the striking growth of patents’ number which makes the evaluation process of new patents
dificult for patent ofices [ 2]. Patent writers as well as other involved stakeholders (e.g., patent
ofices, academic institutions, industries, and policymakers) usually rely on search engines
such as Google Patents1 which only allow keyword search to get a list o patents related to
the desired topic, and do not provide any ready-to-use insight about the protected invention.
Furthermore, we are today witnessing the birth of new intelligent systems such as ChatGPT
which can provide general knowledge about a large variety of domains. However, it is still
dificult to rely on such systems due to the fact that they might provide incorrect information,
partially invented answers and not verifiable information due to a lack of transparency in the
used technology and algorithms [3]. Therefore, there is still a need to explore ad-hoc solutions
for sensitive domains which involve technical and legal aspects where intelligent systems can
support involved stakeholders to make sense of patent content for a variety of applications.
      </p>
      <p>One prominent solution that is taking place in various domains is the development of
Knowledge Graphs (KG), i.e., interlinked graphs of entities that describe a domain based on well-defined
and formal semantics, to support a variety of tasks. KGs have been extensively utilized in a
variety of domains, including artificial intelligence, semantic web, information retrieval, etc.
By structuring facts in a graph-like architecture, KGs enable machines to reason and infer
implicit knowledge, and perform sophisticated analysis. Furthermore, KGs draw great attention
from researchers, especially after the announcement of Google’s Knowledge Graph [4]. Several
papers have been published for the creation, completion, and alignment of KGs [5, 6, 7, 8, 9, 10].
Examples of KGs, among others, can be found in the biology [11], scholarly [12, 5], and medical
domain [10]. Biology benefits from such KGs because of the fast sharing of new discoveries
among several institutions, the scholarly domains obtained benefits for the search and analysis
of research trends, medical domain used KGs for clinical decision support systems.</p>
      <p>However, there has been comparatively less investment and focus on patent-related KGs
creation although first investigations can be found. For example, by utilizing patent claims,
authors in [13] aim to create an engineering knowledge graph. They extract fact based on
pre-defined rules which are solely related to the engineering domain. The KG could be exploited
only for the extraction of technical elements, i.e., engineering knowledge. One drawback of this
existing solution is that defining these rules is quite expensive and time-consuming. Similarly
to authors in [13], [14] aims to build a patent-KG that consists of facts related to engineering
design. The authors use patents that have specific Cooperative Patent Classification (CPC) 2
codes, and thus the contained facts are very domain dependent. As a result, the application of
such KG is quite limited to the specific domain. Pipelines and methods for other domains or
applicable to a broader range of domains are more and more demanded today.</p>
      <p>With this in mind, in this position paper, we sketch the potential challenges of building KGs
about patents, we discuss the positive implications that such technologies can have for the
stakeholders, and finally, we provide insights into how pipelines for this field might be built by
reproducing an existing one. More precisely, the contributions of this paper are:
• We provide an overview of the current state of development of KGs for patents.
• We discuss the challenges of building such KGs for the patent domain, highlighting what
are the diferences from existing solutions available in other domains.
• We describe how the patent domain can benefit from the development of such resources.
• We introduce our first eforts and results about the use of existing solutions to build
knowledge graphs about patents.
2https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/classification/cpc.html</p>
    </sec>
    <sec id="sec-2">
      <title>2. Challenges in Patent Knowledge Graph Construction</title>
      <p>This section describes the challenges around the KGs construction for patents from the
complexity of patents’ structure, natural language, and use perspectives.</p>
      <sec id="sec-2-1">
        <title>2.1. Complexity of Patents’ Structure</title>
        <p>Patents pose a significant challenge to process and explore due to their length, structure, and
domain-specific vocabulary. Patents are published and preserved through an article-centric
paradigm which usually includes a title, an abstract, innovative claims, images, and a detailed
description of the invention that is protected. The title is a concise text which introduces the
main subject of the patent itself. It difers from titles given to documents of diferent natures
because its purpose is not to raise the interest of the reader; it must precisely describe the
item or intellectual property the patent document describes. The abstract is the section of
patent documents that in a brief paragraph gives an overview of the content of the patent
without exposing too many details about the intellectual property; it is a key element used
today by patent writers and patent ofices for exploring the patent landscape because of its
lightweight complexity to have a first notion of the protected item. Claims are short paragraphs
made by only a few sentences that state what patents legally protect and compose the most
sensitive section of patents; if an innovative element is not explicitly defined in the claims,
other intellectual properties can claim it as an innovation in their respective patent documents.
The description section provides detailed information about what a patent protects, lists all its
components, and provides specific information about its intended uses. The images section
provides pictures that support the description of what is protected and are referenced in the
text. Last but not least, patents refer to other patents or scientific publications by listing them
as the final section of their document. This variety of sections makes it challenging to convey
the intended purposes into KGs. For example, representing in a graph form the relationship
that occurs between an image and the text that describes it is not easy. In fact, the process
that needs to be semantically represented in the patent should reflect the human behavior of
reading the text and connect what is read with the visual information delivered by the image
itself. However, such a challenge still remains open. Another challenge related to the structure
of patents is the representation of what is protected by the patent claims. This is because a KG
describing such part of patent documents should be highly precise and should not contain errors
due to the legal requirements; however, this cannot be guaranteed with today’s technology and
future research is required. Last but not least, the current systems might not be fully precise,
the extraction of details might fail, thus creating incomplete KGs which cannot be completed
with the currently existing techniques.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Natural Language Challenges and Limitations</title>
        <p>One main challenge given by patent documents is that they store and preserve most of the
information in natural language text. Therefore, patent documents inherit challenges and
drawbacks typically recognized by the Natural Language Processing (NLP) community. Natural
language is unstructured thus it is dificult for machines and automatic systems to parse and use
them. Specifically, in the context of KG construction, NLP tools should be employed to extract
entities and identify relationships among them. However, this presents several challenges: i)
it is not easy to understand whether a text span represents a relevant entity to describe the
subject of the patent, ii) the same entity can appear in diferent shapes (e.g., diferent texts
can be used to refer to the same thing), iii) the same text can refer to two or more diferent
entities iv) natural language is complex and it is dificult by means of triples to reproduce the
relationships among entities described in the text. Addressing these challenges is crucial for the
patent domain. In fact, if two patents refer to diferent items but they use the same vocabulary
(i.e., the word cup used to describe a small bowl-shaped container for drinking3 is diferent from
the word cup used to describe trophies4), this must be taken into account while building the
knowledge graph by solving tasks such as entity disambiguation. Another important aspect
that we would like to highlight is the ambiguity of the natural language which might make
misleading or incomplete the information extracted by NLP tools. More precisely, triples in
the form &lt;head, predicate, tail&gt; might not contain suficient context or complete information
to be used. For example, given the sentence A method of assembling a water bottle cap system
positioning the small ring around a base of the small cap from the patent US9771189B25 the triple
&lt;water bottle cap system, position, small ring&gt; might be extracted; however, this
triple is not fully complete since it does not provide the full context of where the small ring
should be positioned.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Knowledge Graph Intended Use</title>
        <p>One of the most dificult challenges around the construction of KGs is the definition of usage
borders. In fact, several KGs about patents can be envisioned to answer the domain demands.
More precisely, the following KGs might be built on the basis of the intended use:
• Metadata KGs. This type of KG describes the metadata of patents and represents
relationships that exist among patents themselves by representing the patent citation
network, relationships that occur among authors and patents, relationships that exist
among patent authors themselves, and relationships that occur among patent authors and
patent ofices. They can also be built by exploiting the keywords associated with patents,
or the CPC codes for their classification. Building these KGs requires the definition of
proper schemas and ontologies since their main goal is to allow eficient and fast search
in a large number of patent documents.
• Entity Mention KGs. This type of KGs might be used to describe entities that are
directly mentioned in patent documents. More precisely, they can be used to describe
whether an entity is specifically created by a patent document, whether an entity is used
to create a new innovation, whether an entity is a legally protected item, and so on. Such
kinds of KGs present challenges in detecting the entities from the textual or visual content
of the patent document. Furthermore, they are also challenging due to the fact that they
should specify the role of the identified entity in the KG.
3https://patents.google.com/patent/US8807371B2/en?q=(cup+drinking)&amp;oq=cup+drinking
4https://patents.google.com/patent/US6783255B1/en?q=(cup+trophy)&amp;oq=cup+trophy
5https://patents.google.com/patent/US9771189B2/en?q=(water+bottle)&amp;oq=water+bottle
• Content-based KG. These KGs describe the content of patents by extracting and formally
representing the relationships between extracted entities by converting the meaning of
natural language sentences into RDF triples. These KGs are dificult to be built because of
(i) the complexity of formally representing the natural language, and (ii) the kind of intent
that should be conveyed in the triples based on the section in which the represented
content is placed.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Patent Knowledge Graph Benefits</title>
      <p>This section describes some of the main benefits that KGs can bring to the patent domain to
address patent domain challenges. In fact, due to the rapid growth of online available patent
data, eficient and efective analysis of such documents has become a crucial task. Furthermore,
it is quite critical for patent searchers to find the right information that can be used to support
business-critical decisions.</p>
      <p>Eficient Search and Discovery. Patent KGs which illustrate the semantic relations between
patents have the potential to facilitate the eficient and efective discovery, exploration, and
inference of patent-relevant information. Various information retrieval systems can easily
exploit such KGs. For instance, patent landscaping systems that aim to identify patents related
to a specific topic, often utilize quite sophisticated models. However, incorporating patent
KGs can lead to more eficient and efective solutions. Another example is a patent
questionanswering system. There has been a lack of efort to develop such systems for patents to this
date. However, KGs would provide the basis for researchers to develop a question-answering
system for patents.</p>
      <p>Provenance. KGs might play a relevant role in tracking novel intellectual properties over time,
allowing a deep analysis of what is being developed at specific points in time and, therefore,
enabling the preservation of relevant historical facts about patents. For example, KGs can be
used to represent the history of patents from the application to the publication, thus enabling
the formal representation of the provenance information about new intellectual properties.
Explainability and Interpretability. KGs are becoming more and more important to allow
the explainability and interpretability of models applied to any kind of data. The patent domain
is not an exception and patent KGs can be relevant to explain why a certain patent is classified
under a certain category as well as why a patent document refers to another one. Furthermore,
interpretability and explainability assessment criteria (e.g., reliability, causality) are more and
more required from society and industry and patent KGs would support such evaluations.
Thay would enhance our understanding of patent processes, uncover patterns used by inner
mechanisms, and empower patent platforms with systems to increase people’s trustworthiness
in intelligent systems for patents.</p>
      <p>Automatization. KGs can unlock the understanding of how new intellectual properties and
items might be exploited for uses they were not designed for. More precisely, patent KGs can
be used with state-of-the-art technologies to represent inventions and their characteristics in
complex vector models that can be used to find similarities among intellectual properties as
well as enable the use of machine learning models on such data for tasks such as classification
and clustering.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Computer Science Patent KG</title>
      <p>In this section, we introduce the reader to our first eforts to build a KG about the content of patent
documents. For such purpose, we explain the modules that we are currently experimenting
with using a subset of patents about the computer science domain. For this, we reproduce the
pipeline described in [5], describe the benefits that a full-fledged KG may bring, and outline
future developments tailored to the patent domain.</p>
      <sec id="sec-4-1">
        <title>4.1. Pipeline outlook and Use Case</title>
        <p>The used pipeline is based on supervised and unsupervised components. More precisely, it is
composed by:
• Extractor Modules. This module uses state-of-the-art extractors to find computer science
entities from the natural language text. More precisely the used tools are DyGIEpp, the
CSO classifier, and the Stanford Core NLP suite. DyGIEpp and the CSO classifier are used
to extract entities and a set of predefined relationships. Entities are associated with 5
diferent types: method, task, material, metric, and other entity. Stanford Core NLP is
used to extract verbs that put into relation entities directly from the text. These modules
provide the basic set of triples that are used to build the KG.
• Cleaning Modules. The cleaning modules use a set of heuristics to lemmatize the
entities, merge similar entities, and link the entities to external knowledge bases such
as DBpedia and Wikidata. Moreover, this module is also used to map verbs with similar
meanings to unique representatives given by a hand-crafted taxonomy (for example, the
verbs use, employ, utilize, and exploit are all mapped to the same verb use).
• Classification Module. The classification module uses transformers to automatically
validate the triples which might contain incorrect or misleading triples. For doing so, this
module exploits a classifier trained on scientific documents and finetuned on trustworthy
triples (i.e., triples that have been frequently extracted and, hence, which have several
scientific papers that support their contained information).
• Ontology-based Module. This module uses a formally defined ontology for representing
the relationships among methods, metrics, tasks, and materials; it maps all the generated
triples to such ontology and discards triples that do not comply with its defined semantics.</p>
        <p>The pipeline has been applied on 1085 patents published in the years 2017 and 2018 from the
The Harvard USPTO Patent Dataset (HUPD)6. To limit our investigation to the computer science
domain, we selected patents whose USPTO class is Data Processing - Artificial Intelligence 7. The
generated KG includes more than 3K entities and more than 4K triples. The reader can find the
used patents as well as the generated KG at https://github.com/danilo-dessi/patent.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. PatentKG Analysis</title>
        <p>In this section, we describe the benefits that can be envisioned with the generated KG. To start
with, it allows the investigation of fine-grain patent elements and their interaction. For example,
6https://huggingface.co/datasets/HUPD/hupd
7https://www.uspto.gov/web/patents/classification/uspc706/defs706.htm
we can observe generated triples describing language generator systems for specific domains e.g.,
&lt;query generator, uses, domain specific language&gt; , or triples such as &lt;boltzmann
machine, analyzes, gibbs distribution&gt; describing complex relationships which have
been explored in domains like physics and cognitive sciences. Additionally, such KGs can also
provide information about broader concepts that exist within the computer science domain.
For example, it is possible to study triples that have been found in more than one patent and,
therefore, describe pieces of knowledge that are more common in the domain. Examples of such
pieces of knowledge are: &lt;computer program, uses, computer storage medium&gt;, &lt;ai
model, uses, learner module&gt;, &lt;neural network, includes, synapsis&gt;. In addition,
this kind of KGs can help patent ofices and stakeholders to explore how elements have been used
in already protected inventions, thus supporting the evaluation process for newly submitted
patents. Last but not least, these KGs might enhance the study of the patent domain dynamics by
analyzing how protected inventions and their components are related over time, thus providing
a means to make sense of the patent landscape.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Limitation and Development Plan</title>
        <p>To better fit the patent domain and overcome some limitations, we plan to revise some modules
of the pipeline. To start with, we are developing ad-hoc modules to extract entities and relations
for patents. This will allow the new pipeline to focus only on entities that are relevant for patent
documents. For doing so, we are experimenting with deep learning models for key-phrase
extraction. Second, we will revise the current verb taxonomy to better represent the use of verbs
in the resulting knowledge graph. This is in fact an important factor in the patent KG because
of the legal nature of its content which restricts the usage and meaning of the vocabulary. Third,
we plan to create specific modules for each patent section; as explained in section 3.1, patent
sections have a specific intent and thus the ontology used to represent the information of such
section should be designed accordingly.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we presented an overview of the use of KGs for the patent domain. In particular,
we highlighted three challenges related to the current paradigm for the protection of intellectual
properties. Then, we presented which benefits such technologies can bring in the domain, and
envision their use for a multitude of tasks. Finally, we introduce the reader to a reproducibility
study by adapting an existing pipeline that is tailored to the scholarly domain to be applied to
the patent domain. We present some examples of the befits KG-based technologies can bring to
the patent domain.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Jaradeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Farfar</surname>
          </string-name>
          , et. al.,
          <article-title>Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Knowledge Capture</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>