Modular Development in Patent AI Space: A Case Study
                  Mahesh Maan                                               Sam Zellner                                    Anirudh Sanutra
                     GreyB                                                     PQAI                                              GreyB
                  Mohali, India                                            Atlanta, USA                                       Mohali, India
               mahesh@projectpq.ai                                       sam@projectpq.ai                                 anirudh@projectpq.ai

ABSTRACT                                                                                  components could be interlinked to build modular systems to target
Numerous development efforts are underway that aim to apply the                           specific patent data mining problems.
recent advancements in the field of artificial intelligence to a vari-
ety of patent-related tasks such as prior-art searching, technology                       2    PROBLEMS
landscaping, patent classification, etc. In this paper, we advocate for                   Currently, most research groups developing patent-AI platforms
researchers to align their work with a modular system of software                         work in isolation. This is imparting to a number of inefficiencies
components. We show how such alignment will make it easier for                            in the development efforts and creating limitations for the users of
researchers to prototype new systems, upgrade existing systems,                           such tools too. We are highlighting them below:
collaborate, and build upon each other’s work. We also present such
a system of modular components that we created while developing                               (1) Although each platforms has a unique proposition that fo-
PQAI, an AI-based prior-art search platform.                                                      cuses on a specific AI-capability, to render it usable in the
                                                                                                  form of a software tool, a number of ‘commodity’ software
KEYWORDS                                                                                          components for handling patent data also need to be devel-
                                                                                                  oped. Such components include patent number parses, patent
Patent AI, Patent Retrieval, Prior Art Search, Open Source, Modular                               data repository wrappers, patent reading interfaces, export-
Software Architecture                                                                             ing and reporting functions, etc. Each development group
                                                                                                  has to spend considerable resources building commodity
1    INTRODUCTION                                                                                 components that do not add to their core value proposition.
As a result of the rapid advancements in the field of machine learn-                          (2) The fact that there is no standardization of patent-software
ing, the prevalent optimism surrounding AI in general, and the                                    components makes it difficult for various groups to collabo-
ongoing efforts in IP industry [1, 3, 8, 9, 16, 17] and academia                                  rate, upgrade, license, or sell their components for reuse by
[6, 12, 13, 15], we foresee the development of many AI-centered                                   others. Even though some of the platforms make few API ser-
patent data mining software platforms in the coming years. If the                                 vices available, they are also mostly incompatible with each
current trend persists, the field of AI will continue to evolve rapidly                           other, making it difficult to merge the data/functionalities of
during this time. As a result, to stay relevant, these platforms will                             two systems.
need to continuously experiment with and adopt the latest and                                 (3) The closed source implementation of most platforms means
best AI tools as they become available. In this paper, we show that                               that it is not feasible for users such as corporate IP teams and
by designing modular components, researchers involved in the                                      law-firms, who would like to have a custom UI or some cus-
development and refinement of such platforms can ensure that                                      tom functionality on top of what is available on the platform
their systems are flexible enough to evolve quickly and versatile                                 by default.
enough to leverage a spectrum of AI technique. We advocate for                                (4) Since most platforms do not provide access to their inner
more researchers to consider fitting their work within a modular                                  components or APIs, the entry-effort for AI researchers who
system. This would enable the community of patent data mining                                     want to explore and experiment with patent data (e.g., in uni-
researchers to more easily collaborate and build upon each other’s                                versities) is high. They have to arrange their own databases
work. We also present an example of such a modular system, de-                                    and create a base functionality layer as a pre-requisite. This
signed during the development of PQAI [14], an AI-based prior-art                                 takes considerable time and effort.
search platform. The structure of this paper is as follows: Section1
sets the context of the paper. Section 2 highlights a number of                           3    SOLUTION
existing problems associated with the development of AI-centric
patent-data mining platforms (and which might become more criti-                          To mitigate the problems described in the preceding section, we
cal in future as these platforms continue to develop in isolation).                       propose a solution centered on development, dissemination, and
Section 3 describes how a common, open-source schema of modular                           adoption of an open-source schema of components with standard-
components could mitigate these problems. Section 4 delves into                           ized interfaces and their reference implementations by the research
the details of the modular system used in PQAI and describes some                         community working in the patent-AI space. Our solution is inspired
of its components. Section 5 presents few examples that show such                         partly by few initiatives [2, 4, 5, 7] that facilitate modular and ef-
                                                                                          ficient development in neighboring fields and partly by our own
                                                                                          exploration of the design space of a patent-AI system while devel-
PatentSemTech, July 15th, 2021, online                                                    oping a prior-art search engine. Although modular development
© 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org)        in patent information retrieval has been attempted before [10, 11],
                                                                                          in practice, there is a general lack of open-source frameworks and
                                                                                     36
Modular Development in Patent AI Space: A Case Study                                                                 PatentSemTech, July 15th, 2021, online


resources critical for driving industry-wide adoption and efficiency.         entity and returns its representation. The input entities and output
We believe that the availability of such resources would help re-             representations can both take many forms, making this component
searchers to develop experimental prototypes faster, upgrade exist-           very versatile. For instance, one instantiation of an encoder can
ing software platforms with ease, and smoothly build upon each                be in the form of a Patent Vectorizer — which accepts a patent
other’s work. These resources include:                                        number (as described earlier, all components can retrieve patent
                                                                              data given the patent number) and returns a vector embedding in
    (1) An open-source schema of standardized software compo-
                                                                              a high dimensional space that corresponds to the given patent. A
        nents: This schema acts as a blueprint of a family of highly
                                                                              bag-of-words encoder can be another example of this component.
        customizable and frequently used software components that
                                                                              Index: An index is a data structure optimized for searching among
        can be inter-connected together like Lego pieces. These com-
                                                                              entity representations. It differs from a Store in that it may not
        ponents should be standard in the sense that they have well-
                                                                              necessarily be able to return the original representation. It accepts
        defined input-output characteristics, although the schema
                                                                              a compatible query and returns a set of entity pointers. A Patent
        imposes no restriction on their implementation. Following
                                                                              Vector Index, for example, might accept a query vector and return
        such a schema can help in ensuring that components cre-
                                                                              a set of patent number as top matches for the query. Note that the
        ated by groups completely isolated from each other are still
                                                                              Index accepts query representations and not raw queries, therefore,
        inter-compatible. This schema should be defined at a high
                                                                              it has to plugged into a suitable Encoder to turn the raw query
        level to be able to accommodate a variety of AI techniques
                                                                              into a compatible representation. Ranker: It accepts a set of enti-
        (few examples are presented in next section).
                                                                              ties and returns a list of the same entities, the order of which is
    (2) Open-source implementations of software components: Ac-
                                                                              determined by a ranking criterion. A Patent Ranker for instance,
        cess to a library of such components will enable researchers
                                                                              would accept a set of patents and a user query as input and orders
        and developers to avoid spending time creating their own
                                                                              those patents in decreasing order of relevancy to the given query.
        ‘base layer’ functionalities (such as standard searching and
                                                                              Classifier: A classifier associates one of a finite set of predefined la-
        filtering operations, patent number parsing, patent render-
                                                                              bels to a patent, where the labels have unique meanings associated
        ing, data management, etc.) and reference implementations
                                                                              with them. A Patent Classifier for instance, could take as input a
        for comparison. Instead, maximum effort could then be in-
                                                                              set of patent numbers and associate, with each patent number, a
        vested in creating new and improved components or building
                                                                              label, which may mean for example whether this patent is related
        new capabilities by combining existing components in new
                                                                              to solar cell technology or not. Internally, classifiers can make use
        ways.
                                                                              of configurable classifier models, which can be initialized with in-
    (3) Remote API access to software components and datasets: API
                                                                              puts such as (patent-number, label) pairs or a textual description.
        access will facilitate and encourage small-scale development
                                                                              Consolidator: A consolidator accepts a set of patents and associates
        and experimentation with patent data. This would be useful
                                                                              one of a finite set of arbitrary labels to each patent. Essentially, it
        for resource constrained efforts, such as where a small in-
                                                                              creates clusters of patents where each cluster’s patents have some
        house IP team can hire a freelancer developer to create a
                                                                              common characteristics. A Technology Consolidator, for instance,
        lightweight dashboard for accessing and exploring patent
                                                                              can accept a set of patent numbers and then group them into, say, 3
        data of their own specific field, or where a small team of
                                                                              groups, depending on the technologies they relate to. Filter: A filter
        students is carrying out a university project involving patent
                                                                              accepts a set of entities and depending on a filter criterion returns a
        data analysis. The API access would obviate the need for
                                                                              subset of them. A Patent Filter, for instance, would filter out patents
        such groups to set up a heavy system to get a little done.
                                                                              satisfying a condition such as a publication date criterion. Filters
                                                                              can be cascaded to create a Filter Sequence. For instance, a date
4    COMPONENTS                                                               period filter can be created by cascading a before-date filter and
In this section, we present a non-exhaustive list of versatile com-           an after-date filter. Sorter: The input-output characteristics of a
ponents from the PQAI library [14], which offer functionalities               sorter are similar to a ranker but in its output, only the relative
frequently required in a range of patent data mining operations.              positions of the entities matter. A Patent Sorter can, for instance,
Under the PQAI initiative, the authors aim to define a standard-              accept a list of patent numbers and arrange them such that any
ized schema of these components and release their concrete imple-             patent in the list is succeeded by the most similar patent to it in
mentations as open-source code. For some of the components, we                the rest of the list. Such a sorter can be useful during a manual
first describe their abstract versions and then the patent-specific           review of patents (all related patents come in sequence and the
version. The abstract version is unlikely to be used in real-world            reviewer can make use of insights still available in their short term
development, but knowing its behavior helps in understanding the              memory). Patent Number Parser: It accepts plain text as input, then
behavior of the family of components that can be derived from it.             detects and extracts any patent numbers in it, translates them into
Patent Database: This is an instance of abstract component Stor-              a standard format (e.g., by truncating or adding zeros) and then
age (described at the end of this list). Patent Database is a special         outputs a list of patent numbers that can be directly inputted to the
component in the sense that all components can be configured to               Patent Database component. This component, when used at the
access it. A major benefit of this approach is that components can            boundary of a patent data mining system, can eliminate all issues
pass around references within the Patent Database (e.g., patent               that arise due to patent number format mismatching. Storage is an
numbers) instead of patent data itself. This keeps the component              abstract wrapper around as a data source. It stores entities that are
interfaces clean and lightweight. Encoder: An encoder takes in an             all of the same type but other than this, it makes no assumption
                                                                         37
PatentSemTech, July 15th, 2021, online                                                                           Mahesh Maan, Sam Zellner, and Anirudh Sanutra


about how the data is stored (e.g. whether it is stored in a local
database, in the primary memory, or on a remote server). A Storage
component performs two operations: it saves and retrieves entities.
In the saving operation, it accepts an entity and returns its entity
identifier. Retrieval operation is the opposite of saving operation -
an entity is returned in response to a supplied identifier. Storages
can be configured to be read-only too.

5    EXEMPLARY SYSTEM
In this section we show few examples to demonstrate how the
components described in the last section can be interconnected to
form systems that carry out specific patent-related tasks.                    Figure 2: A technology monitoring system designed with
                                                                              standard components
   Patent search engine: The input here is a user query and possibly
one or more filters (such as date restrictions) and the output is a
ranked list of patents. Diagram 1 shows a possible implementation
using standard components.


                                                                              Figure 3: Part of a technology landscaping system designed
                                                                              with standard components


                                                                              6    CONCLUSIONS AND FUTURE WORK
Figure 1: A patent search engine designed with standard                       We showed how the ongoing and future software development ef-
components                                                                    forts in the patent-AI space can be greatly facilitated by aligning the
                                                                              development with a schema of modular and highly-customizable
   Technology monitoring system: The input here is a specification            software components. Through examples we showed how the com-
of a technology area and a time period of interest and the output             ponents in such a schema can be interconnected like Lego bricks
is a set of recent patents and published applications. A generic, re-         to create varied patent data mining systems optimized for differ-
configurable classifier is used here, which uses a textual description        ent tasks. Following such a schema will enable faster prototyp-
of the technology or a white-list of patents to adjust its operation          ing, smoother upgrades, and easier collaboration among research
(e.g., by training an internal ML model). Diagram 2 shows a possible          groups. We therefore encourage researchers to consider fitting their
implementation using standard components.                                     work in a common modular system to speed up research and devel-
                                                                              opment in the patent-AI space. In future work, we aim to define and
   Technology landscaping: The output of technology landscaping
                                                                              release such a modular system and open-source implementations
studies, unlike the preceding two examples, is much more complex
                                                                              of its constituent components under the PQAI initiative. We invite
than a list of patents. A typical output, however, can be broken down
                                                                              industry experts and researchers for collaboration in defining and
in the form of a number of insights that are reached by manually
                                                                              standardizing this system.
analyzing the patent data, typically through charts on a dashboard.
Some of these charts are quite basic, such as patent filing trend
                                                                              REFERENCES
over the years. Others are more sophisticated, such as a heat-map
                                                                               [1] AMPLIFIED. Better prior art. Faster, 2020. https://www.amplified.ai/ Last ac-
of patent portfolio sizes held by the major players in a number of                 cessed 2021-05-09.
technology sub-domains. Most of these charts, irrespective of the              [2] APACHE FOUNDATION. Apache UIMA, 2013. https://uima.apache.org/ Last
                                                                                   accessed 2021-05-09.
complexity of the underlying data, plot two categorical variables.             [3] CINTIAN. Welcome to the Patent Data Revolution, 2020. https://www.cintian.ai/
The data for these charts can therefore, be arrived at by making                   Last accessed 2021-05-09.
use of two consolidators, each of which cluster the data points into           [4] EXPLOSION. Industrial-Strength Natural Language Processing, 2021. https:
                                                                                   //spacy.io/ Last accessed 2021-05-09.
discrete clusters which correspond to the plotted categories (see              [5] J. Guo, Y. Fan, X. Ji, and X. Cheng. Matchzoo: A learning, practicing, and develop-
Figure 3.                                                                          ing system for neural text matching. In Proceedings of the 42nd International ACM
                                                                         38
Modular Development in Patent AI Space: A Case Study                                                                                             PatentSemTech, July 15th, 2021, online


     SIGIR Conference on Research and Development in Information Retrieval, pages                [12] S. Li, J. Hu, Y. Cui, and J. Hu. Deeppatent: patent classification with convolutional
     1297–1300, 2019.                                                                                 neural networks and word embedding. Scientometrics, 117(2):721–744, 2018.
 [6] L. Helmers, F. Horn, F. Biegler, T. Oppermann, and K.-R. Müller. Automating the             [13] Y. Lu, X. Xiong, W. Zhang, J. Liu, and R. Zhao. Research on classification and
     search for a patent’s prior art with a full text similarity search. PloS one, 14(3):             similarity of patent citation based on deep learning. Scientometrics, 123(2):813–
     e0212103, 2019.                                                                                  839, 2020.
 [7] HUGGING FACE. The AI community building the future, 2021. https://                          [14] PROJECT PQAI. Prior-Art Search for Everyone, 2020. https://projectpq.ai/ Last
     huggingface.co/ Last accessed 2021-05-09.                                                        accessed 2021-05-09.
 [8] INQUARTIK CORPORATION. Find patents with ease, 2021. https://www.                           [15] S. Sarica, B. Song, E. Low, and J. Luo. Engineering knowledge graph for keyword
     inquartik.com/patentcloud/patent-search/ Last accessed 2021-05-09.                               discovery in patent search. In Proceedings of the Design Society: International Con-
 [9] IPRALLY TECHNOLOGIES LTD. Patent search done right, 2021. https://www.                           ference on Engineering Design, volume 1, pages 2249–2258. Cambridge University
     iprally.com/ Last accessed 2021-05-09.                                                           Press, 2019.
[10] A. Klenner, S. Bergmann, M. Zimmermann, and M. Romberg. Large scale chemical                [16] SIMILARI. Supercharge your IP, 2020. https://similari.com/ Last accessed 2021-
     patent mining with uima and unicore. Journal of Cheminformatics, 4(1):1–2, 2012.                 05-09.
[11] A. Krishna, B. Feldman, J. Wolf, G. Gabel, S. Beliveau, and T. Beach. Examiner              [17] THREE10 SOLUTIONS, INC. The End of Keyword Searching, 2020. https:
     assisted automated patents search. In 2016 AAAI Fall Symposium Series, 2016.                     //www.dorothyai.com/platform Last accessed 2021-05-09.


                                                                                            39