<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modular Development in Patent AI Space: A Case Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mahesh Maan</string-name>
          <email>mahesh@projectpq.ai</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patent AI, Patent Retrieval, Prior Art Search, Open Source, Modular</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sam Zellner</string-name>
          <email>sam@projectpq.ai</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anirudh Sanutra</string-name>
          <email>anirudh@projectpq.ai</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GreyB</institution>
          ,
          <addr-line>Mohali</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>PQAI</institution>
          ,
          <addr-line>Atlanta</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Software Architecture</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>36</fpage>
      <lpage>39</lpage>
      <abstract>
        <p>Numerous development eforts are underway that aim to apply the recent advancements in the field of artificial intelligence to a variety of patent-related tasks such as prior-art searching, technology landscaping, patent classification, etc. In this paper, we advocate for researchers to align their work with a modular system of software components. We show how such alignment will make it easier for researchers to prototype new systems, upgrade existing systems, collaborate, and build upon each other's work. We also present such a system of modular components that we created while developing PQAI, an AI-based prior-art search platform.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        As a result of the rapid advancements in the field of machine
learning, the prevalent optimism surrounding AI in general, and the
ongoing eforts in IP industry [
        <xref ref-type="bibr" rid="ref1 ref16 ref17 ref3 ref8 ref9">1, 3, 8, 9, 16, 17</xref>
        ] and academia
[
        <xref ref-type="bibr" rid="ref12 ref13 ref15 ref6">6, 12, 13, 15</xref>
        ], we foresee the development of many AI-centered
patent data mining software platforms in the coming years. If the
current trend persists, the field of AI will continue to evolve rapidly
during this time. As a result, to stay relevant, these platforms will
need to continuously experiment with and adopt the latest and
best AI tools as they become available. In this paper, we show that
by designing modular components, researchers involved in the
development and refinement of such platforms can ensure that
their systems are flexible enough to evolve quickly and versatile
enough to leverage a spectrum of AI technique. We advocate for
more researchers to consider fitting their work within a modular
system. This would enable the community of patent data mining
researchers to more easily collaborate and build upon each other’s
work. We also present an example of such a modular system,
designed during the development of PQAI [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], an AI-based prior-art
search platform. The structure of this paper is as follows: Section1
sets the context of the paper. Section 2 highlights a number of
existing problems associated with the development of AI-centric
patent-data mining platforms (and which might become more
critical in future as these platforms continue to develop in isolation).
Section 3 describes how a common, open-source schema of modular
components could mitigate these problems. Section 4 delves into
the details of the modular system used in PQAI and describes some
of its components. Section 5 presents few examples that show such
components could be interlinked to build modular systems to target
specific patent data mining problems.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROBLEMS</title>
      <p>Currently, most research groups developing patent-AI platforms
work in isolation. This is imparting to a number of ineficiencies
in the development eforts and creating limitations for the users of
such tools too. We are highlighting them below:
(1) Although each platforms has a unique proposition that
focuses on a specific AI-capability, to render it usable in the
form of a software tool, a number of ‘commodity’ software
components for handling patent data also need to be
developed. Such components include patent number parses, patent
data repository wrappers, patent reading interfaces,
exporting and reporting functions, etc. Each development group
has to spend considerable resources building commodity
components that do not add to their core value proposition.
(2) The fact that there is no standardization of patent-software
components makes it dificult for various groups to
collaborate, upgrade, license, or sell their components for reuse by
others. Even though some of the platforms make few API
services available, they are also mostly incompatible with each
other, making it dificult to merge the data/functionalities of
two systems.
(3) The closed source implementation of most platforms means
that it is not feasible for users such as corporate IP teams and
law-firms, who would like to have a custom UI or some
custom functionality on top of what is available on the platform
by default.
(4) Since most platforms do not provide access to their inner
components or APIs, the entry-efort for AI researchers who
want to explore and experiment with patent data (e.g., in
universities) is high. They have to arrange their own databases
and create a base functionality layer as a pre-requisite. This
takes considerable time and efort.
3</p>
    </sec>
    <sec id="sec-3">
      <title>SOLUTION</title>
      <p>
        To mitigate the problems described in the preceding section, we
propose a solution centered on development, dissemination, and
adoption of an open-source schema of components with
standardized interfaces and their reference implementations by the research
community working in the patent-AI space. Our solution is inspired
partly by few initiatives [
        <xref ref-type="bibr" rid="ref2 ref4 ref5 ref7">2, 4, 5, 7</xref>
        ] that facilitate modular and
efifcient development in neighboring fields and partly by our own
exploration of the design space of a patent-AI system while
developing a prior-art search engine. Although modular development
in patent information retrieval has been attempted before [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ],
in practice, there is a general lack of open-source frameworks and
resources critical for driving industry-wide adoption and eficiency.
We believe that the availability of such resources would help
researchers to develop experimental prototypes faster, upgrade
existing software platforms with ease, and smoothly build upon each
other’s work. These resources include:
(1) An open-source schema of standardized software
components: This schema acts as a blueprint of a family of highly
customizable and frequently used software components that
can be inter-connected together like Lego pieces. These
components should be standard in the sense that they have
welldefined input-output characteristics, although the schema
imposes no restriction on their implementation. Following
such a schema can help in ensuring that components
created by groups completely isolated from each other are still
inter-compatible. This schema should be defined at a high
level to be able to accommodate a variety of AI techniques
(few examples are presented in next section).
(2) Open-source implementations of software components:
Access to a library of such components will enable researchers
and developers to avoid spending time creating their own
‘base layer’ functionalities (such as standard searching and
ifltering operations, patent number parsing, patent
rendering, data management, etc.) and reference implementations
for comparison. Instead, maximum efort could then be
invested in creating new and improved components or building
new capabilities by combining existing components in new
ways.
(3) Remote API access to software components and datasets: API
access will facilitate and encourage small-scale development
and experimentation with patent data. This would be useful
for resource constrained eforts, such as where a small
inhouse IP team can hire a freelancer developer to create a
lightweight dashboard for accessing and exploring patent
data of their own specific field, or where a small team of
students is carrying out a university project involving patent
data analysis. The API access would obviate the need for
such groups to set up a heavy system to get a little done.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>COMPONENTS</title>
      <p>
        In this section, we present a non-exhaustive list of versatile
components from the PQAI library [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which ofer functionalities
frequently required in a range of patent data mining operations.
Under the PQAI initiative, the authors aim to define a
standardized schema of these components and release their concrete
implementations as open-source code. For some of the components, we
ifrst describe their abstract versions and then the patent-specific
version. The abstract version is unlikely to be used in real-world
development, but knowing its behavior helps in understanding the
behavior of the family of components that can be derived from it.
Patent Database: This is an instance of abstract component
Storage (described at the end of this list). Patent Database is a special
component in the sense that all components can be configured to
access it. A major benefit of this approach is that components can
pass around references within the Patent Database (e.g., patent
numbers) instead of patent data itself. This keeps the component
interfaces clean and lightweight. Encoder: An encoder takes in an
entity and returns its representation. The input entities and output
representations can both take many forms, making this component
very versatile. For instance, one instantiation of an encoder can
be in the form of a Patent Vectorizer — which accepts a patent
number (as described earlier, all components can retrieve patent
data given the patent number) and returns a vector embedding in
a high dimensional space that corresponds to the given patent. A
bag-of-words encoder can be another example of this component.
Index: An index is a data structure optimized for searching among
entity representations. It difers from a Store in that it may not
necessarily be able to return the original representation. It accepts
a compatible query and returns a set of entity pointers. A Patent
Vector Index, for example, might accept a query vector and return
a set of patent number as top matches for the query. Note that the
Index accepts query representations and not raw queries, therefore,
it has to plugged into a suitable Encoder to turn the raw query
into a compatible representation. Ranker: It accepts a set of
entities and returns a list of the same entities, the order of which is
determined by a ranking criterion. A Patent Ranker for instance,
would accept a set of patents and a user query as input and orders
those patents in decreasing order of relevancy to the given query.
Classifier: A classifier associates one of a finite set of predefined
labels to a patent, where the labels have unique meanings associated
with them. A Patent Classifier for instance, could take as input a
set of patent numbers and associate, with each patent number, a
label, which may mean for example whether this patent is related
to solar cell technology or not. Internally, classifiers can make use
of configurable classifier models, which can be initialized with
inputs such as (patent-number, label) pairs or a textual description.
Consolidator: A consolidator accepts a set of patents and associates
one of a finite set of arbitrary labels to each patent. Essentially, it
creates clusters of patents where each cluster’s patents have some
common characteristics. A Technology Consolidator, for instance,
can accept a set of patent numbers and then group them into, say, 3
groups, depending on the technologies they relate to. Filter: A filter
accepts a set of entities and depending on a filter criterion returns a
subset of them. A Patent Filter, for instance, would filter out patents
satisfying a condition such as a publication date criterion. Filters
can be cascaded to create a Filter Sequence. For instance, a date
period filter can be created by cascading a before-date filter and
an after-date filter. Sorter: The input-output characteristics of a
sorter are similar to a ranker but in its output, only the relative
positions of the entities matter. A Patent Sorter can, for instance,
accept a list of patent numbers and arrange them such that any
patent in the list is succeeded by the most similar patent to it in
the rest of the list. Such a sorter can be useful during a manual
review of patents (all related patents come in sequence and the
reviewer can make use of insights still available in their short term
memory). Patent Number Parser: It accepts plain text as input, then
detects and extracts any patent numbers in it, translates them into
a standard format (e.g., by truncating or adding zeros) and then
outputs a list of patent numbers that can be directly inputted to the
Patent Database component. This component, when used at the
boundary of a patent data mining system, can eliminate all issues
that arise due to patent number format mismatching. Storage is an
abstract wrapper around as a data source. It stores entities that are
all of the same type but other than this, it makes no assumption
about how the data is stored (e.g. whether it is stored in a local
database, in the primary memory, or on a remote server). A Storage
component performs two operations: it saves and retrieves entities.
In the saving operation, it accepts an entity and returns its entity
identifier. Retrieval operation is the opposite of saving operation
an entity is returned in response to a supplied identifier. Storages
can be configured to be read-only too.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5 EXEMPLARY SYSTEM</title>
      <p>In this section we show few examples to demonstrate how the
components described in the last section can be interconnected to
form systems that carry out specific patent-related tasks.</p>
      <p>Patent search engine: The input here is a user query and possibly
one or more filters (such as date restrictions) and the output is a
ranked list of patents. Diagram 1 shows a possible implementation
using standard components.</p>
      <p>Technology monitoring system: The input here is a specification
of a technology area and a time period of interest and the output
is a set of recent patents and published applications. A generic,
reconfigurable classifier is used here, which uses a textual description
of the technology or a white-list of patents to adjust its operation
(e.g., by training an internal ML model). Diagram 2 shows a possible
implementation using standard components.</p>
      <p>Technology landscaping: The output of technology landscaping
studies, unlike the preceding two examples, is much more complex
than a list of patents. A typical output, however, can be broken down
in the form of a number of insights that are reached by manually
analyzing the patent data, typically through charts on a dashboard.
Some of these charts are quite basic, such as patent filing trend
over the years. Others are more sophisticated, such as a heat-map
of patent portfolio sizes held by the major players in a number of
technology sub-domains. Most of these charts, irrespective of the
complexity of the underlying data, plot two categorical variables.
The data for these charts can therefore, be arrived at by making
use of two consolidators, each of which cluster the data points into
discrete clusters which correspond to the plotted categories (see
Figure 3.</p>
    </sec>
    <sec id="sec-6">
      <title>6 CONCLUSIONS AND FUTURE WORK</title>
      <p>We showed how the ongoing and future software development
efforts in the patent-AI space can be greatly facilitated by aligning the
development with a schema of modular and highly-customizable
software components. Through examples we showed how the
components in such a schema can be interconnected like Lego bricks
to create varied patent data mining systems optimized for
diferent tasks. Following such a schema will enable faster
prototyping, smoother upgrades, and easier collaboration among research
groups. We therefore encourage researchers to consider fitting their
work in a common modular system to speed up research and
development in the patent-AI space. In future work, we aim to define and
release such a modular system and open-source implementations
of its constituent components under the PQAI initiative. We invite
industry experts and researchers for collaboration in defining and
standardizing this system.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>AMPLIFIED.</surname>
          </string-name>
          <article-title>Better prior art</article-title>
          .
          <source>Faster</source>
          ,
          <year>2020</year>
          . https://www.amplified.ai/ Last accessed 2021-
          <volume>05</volume>
          -09.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>APACHE</given-names>
            <surname>FOUNDATION. Apache</surname>
          </string-name>
          <string-name>
            <surname>UIMA</surname>
          </string-name>
          ,
          <year>2013</year>
          . https://uima.apache.org/ Last accessed 2021-
          <volume>05</volume>
          -09.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>CINTIAN.</surname>
          </string-name>
          <article-title>Welcome to the Patent Data Revolution</article-title>
          ,
          <year>2020</year>
          . https://www.cintian.ai/ Last accessed 2021-
          <volume>05</volume>
          -09.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>EXPLOSION.</given-names>
            <surname>Industrial-Strength Natural Language Processing</surname>
          </string-name>
          ,
          <year>2021</year>
          . https: //spacy.io/ Last accessed 2021-
          <volume>05</volume>
          -09.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ji</surname>
          </string-name>
          , and X. Cheng. Matchzoo:
          <article-title>A learning, practicing, and developing system for neural text matching</article-title>
          .
          <source>In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>1297</fpage>
          -
          <lpage>1300</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Helmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Horn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Biegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Oppermann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.-R.</given-names>
            <surname>Müller</surname>
          </string-name>
          .
          <article-title>Automating the search for a patent's prior art with a full text similarity search</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>14</volume>
          (
          <issue>3</issue>
          ): e0212103,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>HUGGING</given-names>
            <surname>FACE</surname>
          </string-name>
          .
          <article-title>The AI community building the future</article-title>
          ,
          <year>2021</year>
          . https:// huggingface.co/ Last accessed 2021-
          <volume>05</volume>
          -09.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>INQUARTIK</given-names>
            <surname>CORPORATION</surname>
          </string-name>
          .
          <article-title>Find patents with ease</article-title>
          ,
          <year>2021</year>
          . https://www. inquartik.com/patentcloud/patent-search/ Last accessed 2021-
          <volume>05</volume>
          -09.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] IPRALLY TECHNOLOGIES LTD</article-title>
          .
          <article-title>Patent search done right</article-title>
          ,
          <year>2021</year>
          . https://www. iprally.com/ Last accessed 2021-
          <volume>05</volume>
          -09.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Klenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Romberg</surname>
          </string-name>
          .
          <article-title>Large scale chemical patent mining with uima and unicore</article-title>
          .
          <source>Journal of Cheminformatics</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wolf</surname>
          </string-name>
          , G. Gabel,
          <string-name>
            <given-names>S.</given-names>
            <surname>Beliveau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Beach</surname>
          </string-name>
          .
          <article-title>Examiner assisted automated patents search</article-title>
          .
          <source>In 2016 AAAI Fall Symposium Series</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <article-title>Deeppatent: patent classification with convolutional neural networks and word embedding</article-title>
          .
          <source>Scientometrics</source>
          ,
          <volume>117</volume>
          (
          <issue>2</issue>
          ):
          <fpage>721</fpage>
          -
          <lpage>744</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Liu, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Research on classification and similarity of patent citation based on deep learning</article-title>
          .
          <source>Scientometrics</source>
          ,
          <volume>123</volume>
          (
          <issue>2</issue>
          ):
          <fpage>813</fpage>
          -
          <lpage>839</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>PROJECT</given-names>
            <surname>PQAI. Prior-Art Search</surname>
          </string-name>
          for Everyone,
          <year>2020</year>
          . https://projectpq.ai/ Last accessed 2021-
          <volume>05</volume>
          -09.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Song</surname>
          </string-name>
          , E. Low, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          .
          <article-title>Engineering knowledge graph for keyword discovery in patent search</article-title>
          .
          <source>In Proceedings of the Design Society: International Conference on Engineering Design</source>
          , volume
          <volume>1</volume>
          , pages
          <fpage>2249</fpage>
          -
          <lpage>2258</lpage>
          . Cambridge University Press,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>SIMILARI. Supercharge your</surname>
            <given-names>IP</given-names>
          </string-name>
          ,
          <year>2020</year>
          . https://similari.com/ Last accessed 2021-
          <volume>05</volume>
          -09.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>THREE10</surname>
            <given-names>SOLUTIONS</given-names>
          </string-name>
          , INC.
          <source>The End of Keyword Searching</source>
          ,
          <year>2020</year>
          . https: //www.dorothyai.com/platform Last accessed 2021-
          <volume>05</volume>
          -09.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>