Modular Development in Patent AI Space: A Case Study Mahesh Maan Sam Zellner Anirudh Sanutra GreyB PQAI GreyB Mohali, India Atlanta, USA Mohali, India mahesh@projectpq.ai sam@projectpq.ai anirudh@projectpq.ai ABSTRACT components could be interlinked to build modular systems to target Numerous development efforts are underway that aim to apply the specific patent data mining problems. recent advancements in the field of artificial intelligence to a vari- ety of patent-related tasks such as prior-art searching, technology 2 PROBLEMS landscaping, patent classification, etc. In this paper, we advocate for Currently, most research groups developing patent-AI platforms researchers to align their work with a modular system of software work in isolation. This is imparting to a number of inefficiencies components. We show how such alignment will make it easier for in the development efforts and creating limitations for the users of researchers to prototype new systems, upgrade existing systems, such tools too. We are highlighting them below: collaborate, and build upon each other’s work. We also present such a system of modular components that we created while developing (1) Although each platforms has a unique proposition that fo- PQAI, an AI-based prior-art search platform. cuses on a specific AI-capability, to render it usable in the form of a software tool, a number of ‘commodity’ software KEYWORDS components for handling patent data also need to be devel- oped. Such components include patent number parses, patent Patent AI, Patent Retrieval, Prior Art Search, Open Source, Modular data repository wrappers, patent reading interfaces, export- Software Architecture ing and reporting functions, etc. Each development group has to spend considerable resources building commodity 1 INTRODUCTION components that do not add to their core value proposition. As a result of the rapid advancements in the field of machine learn- (2) The fact that there is no standardization of patent-software ing, the prevalent optimism surrounding AI in general, and the components makes it difficult for various groups to collabo- ongoing efforts in IP industry [1, 3, 8, 9, 16, 17] and academia rate, upgrade, license, or sell their components for reuse by [6, 12, 13, 15], we foresee the development of many AI-centered others. Even though some of the platforms make few API ser- patent data mining software platforms in the coming years. If the vices available, they are also mostly incompatible with each current trend persists, the field of AI will continue to evolve rapidly other, making it difficult to merge the data/functionalities of during this time. As a result, to stay relevant, these platforms will two systems. need to continuously experiment with and adopt the latest and (3) The closed source implementation of most platforms means best AI tools as they become available. In this paper, we show that that it is not feasible for users such as corporate IP teams and by designing modular components, researchers involved in the law-firms, who would like to have a custom UI or some cus- development and refinement of such platforms can ensure that tom functionality on top of what is available on the platform their systems are flexible enough to evolve quickly and versatile by default. enough to leverage a spectrum of AI technique. We advocate for (4) Since most platforms do not provide access to their inner more researchers to consider fitting their work within a modular components or APIs, the entry-effort for AI researchers who system. This would enable the community of patent data mining want to explore and experiment with patent data (e.g., in uni- researchers to more easily collaborate and build upon each other’s versities) is high. They have to arrange their own databases work. We also present an example of such a modular system, de- and create a base functionality layer as a pre-requisite. This signed during the development of PQAI [14], an AI-based prior-art takes considerable time and effort. search platform. The structure of this paper is as follows: Section1 sets the context of the paper. Section 2 highlights a number of 3 SOLUTION existing problems associated with the development of AI-centric patent-data mining platforms (and which might become more criti- To mitigate the problems described in the preceding section, we cal in future as these platforms continue to develop in isolation). propose a solution centered on development, dissemination, and Section 3 describes how a common, open-source schema of modular adoption of an open-source schema of components with standard- components could mitigate these problems. Section 4 delves into ized interfaces and their reference implementations by the research the details of the modular system used in PQAI and describes some community working in the patent-AI space. Our solution is inspired of its components. Section 5 presents few examples that show such partly by few initiatives [2, 4, 5, 7] that facilitate modular and ef- ficient development in neighboring fields and partly by our own exploration of the design space of a patent-AI system while devel- PatentSemTech, July 15th, 2021, online oping a prior-art search engine. Although modular development © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) in patent information retrieval has been attempted before [10, 11], in practice, there is a general lack of open-source frameworks and 36 Modular Development in Patent AI Space: A Case Study PatentSemTech, July 15th, 2021, online resources critical for driving industry-wide adoption and efficiency. entity and returns its representation. The input entities and output We believe that the availability of such resources would help re- representations can both take many forms, making this component searchers to develop experimental prototypes faster, upgrade exist- very versatile. For instance, one instantiation of an encoder can ing software platforms with ease, and smoothly build upon each be in the form of a Patent Vectorizer — which accepts a patent other’s work. These resources include: number (as described earlier, all components can retrieve patent data given the patent number) and returns a vector embedding in (1) An open-source schema of standardized software compo- a high dimensional space that corresponds to the given patent. A nents: This schema acts as a blueprint of a family of highly bag-of-words encoder can be another example of this component. customizable and frequently used software components that Index: An index is a data structure optimized for searching among can be inter-connected together like Lego pieces. These com- entity representations. It differs from a Store in that it may not ponents should be standard in the sense that they have well- necessarily be able to return the original representation. It accepts defined input-output characteristics, although the schema a compatible query and returns a set of entity pointers. A Patent imposes no restriction on their implementation. Following Vector Index, for example, might accept a query vector and return such a schema can help in ensuring that components cre- a set of patent number as top matches for the query. Note that the ated by groups completely isolated from each other are still Index accepts query representations and not raw queries, therefore, inter-compatible. This schema should be defined at a high it has to plugged into a suitable Encoder to turn the raw query level to be able to accommodate a variety of AI techniques into a compatible representation. Ranker: It accepts a set of enti- (few examples are presented in next section). ties and returns a list of the same entities, the order of which is (2) Open-source implementations of software components: Ac- determined by a ranking criterion. A Patent Ranker for instance, cess to a library of such components will enable researchers would accept a set of patents and a user query as input and orders and developers to avoid spending time creating their own those patents in decreasing order of relevancy to the given query. ‘base layer’ functionalities (such as standard searching and Classifier: A classifier associates one of a finite set of predefined la- filtering operations, patent number parsing, patent render- bels to a patent, where the labels have unique meanings associated ing, data management, etc.) and reference implementations with them. A Patent Classifier for instance, could take as input a for comparison. Instead, maximum effort could then be in- set of patent numbers and associate, with each patent number, a vested in creating new and improved components or building label, which may mean for example whether this patent is related new capabilities by combining existing components in new to solar cell technology or not. Internally, classifiers can make use ways. of configurable classifier models, which can be initialized with in- (3) Remote API access to software components and datasets: API puts such as (patent-number, label) pairs or a textual description. access will facilitate and encourage small-scale development Consolidator: A consolidator accepts a set of patents and associates and experimentation with patent data. This would be useful one of a finite set of arbitrary labels to each patent. Essentially, it for resource constrained efforts, such as where a small in- creates clusters of patents where each cluster’s patents have some house IP team can hire a freelancer developer to create a common characteristics. A Technology Consolidator, for instance, lightweight dashboard for accessing and exploring patent can accept a set of patent numbers and then group them into, say, 3 data of their own specific field, or where a small team of groups, depending on the technologies they relate to. Filter: A filter students is carrying out a university project involving patent accepts a set of entities and depending on a filter criterion returns a data analysis. The API access would obviate the need for subset of them. A Patent Filter, for instance, would filter out patents such groups to set up a heavy system to get a little done. satisfying a condition such as a publication date criterion. Filters can be cascaded to create a Filter Sequence. For instance, a date 4 COMPONENTS period filter can be created by cascading a before-date filter and In this section, we present a non-exhaustive list of versatile com- an after-date filter. Sorter: The input-output characteristics of a ponents from the PQAI library [14], which offer functionalities sorter are similar to a ranker but in its output, only the relative frequently required in a range of patent data mining operations. positions of the entities matter. A Patent Sorter can, for instance, Under the PQAI initiative, the authors aim to define a standard- accept a list of patent numbers and arrange them such that any ized schema of these components and release their concrete imple- patent in the list is succeeded by the most similar patent to it in mentations as open-source code. For some of the components, we the rest of the list. Such a sorter can be useful during a manual first describe their abstract versions and then the patent-specific review of patents (all related patents come in sequence and the version. The abstract version is unlikely to be used in real-world reviewer can make use of insights still available in their short term development, but knowing its behavior helps in understanding the memory). Patent Number Parser: It accepts plain text as input, then behavior of the family of components that can be derived from it. detects and extracts any patent numbers in it, translates them into Patent Database: This is an instance of abstract component Stor- a standard format (e.g., by truncating or adding zeros) and then age (described at the end of this list). Patent Database is a special outputs a list of patent numbers that can be directly inputted to the component in the sense that all components can be configured to Patent Database component. This component, when used at the access it. A major benefit of this approach is that components can boundary of a patent data mining system, can eliminate all issues pass around references within the Patent Database (e.g., patent that arise due to patent number format mismatching. Storage is an numbers) instead of patent data itself. This keeps the component abstract wrapper around as a data source. It stores entities that are interfaces clean and lightweight. Encoder: An encoder takes in an all of the same type but other than this, it makes no assumption 37 PatentSemTech, July 15th, 2021, online Mahesh Maan, Sam Zellner, and Anirudh Sanutra about how the data is stored (e.g. whether it is stored in a local database, in the primary memory, or on a remote server). A Storage component performs two operations: it saves and retrieves entities. In the saving operation, it accepts an entity and returns its entity identifier. Retrieval operation is the opposite of saving operation - an entity is returned in response to a supplied identifier. Storages can be configured to be read-only too. 5 EXEMPLARY SYSTEM In this section we show few examples to demonstrate how the components described in the last section can be interconnected to form systems that carry out specific patent-related tasks. Figure 2: A technology monitoring system designed with standard components Patent search engine: The input here is a user query and possibly one or more filters (such as date restrictions) and the output is a ranked list of patents. Diagram 1 shows a possible implementation using standard components. Figure 3: Part of a technology landscaping system designed with standard components 6 CONCLUSIONS AND FUTURE WORK Figure 1: A patent search engine designed with standard We showed how the ongoing and future software development ef- components forts in the patent-AI space can be greatly facilitated by aligning the development with a schema of modular and highly-customizable Technology monitoring system: The input here is a specification software components. Through examples we showed how the com- of a technology area and a time period of interest and the output ponents in such a schema can be interconnected like Lego bricks is a set of recent patents and published applications. A generic, re- to create varied patent data mining systems optimized for differ- configurable classifier is used here, which uses a textual description ent tasks. Following such a schema will enable faster prototyp- of the technology or a white-list of patents to adjust its operation ing, smoother upgrades, and easier collaboration among research (e.g., by training an internal ML model). Diagram 2 shows a possible groups. We therefore encourage researchers to consider fitting their implementation using standard components. work in a common modular system to speed up research and devel- opment in the patent-AI space. In future work, we aim to define and Technology landscaping: The output of technology landscaping release such a modular system and open-source implementations studies, unlike the preceding two examples, is much more complex of its constituent components under the PQAI initiative. We invite than a list of patents. A typical output, however, can be broken down industry experts and researchers for collaboration in defining and in the form of a number of insights that are reached by manually standardizing this system. analyzing the patent data, typically through charts on a dashboard. Some of these charts are quite basic, such as patent filing trend REFERENCES over the years. Others are more sophisticated, such as a heat-map [1] AMPLIFIED. Better prior art. Faster, 2020. https://www.amplified.ai/ Last ac- of patent portfolio sizes held by the major players in a number of cessed 2021-05-09. technology sub-domains. Most of these charts, irrespective of the [2] APACHE FOUNDATION. Apache UIMA, 2013. https://uima.apache.org/ Last accessed 2021-05-09. complexity of the underlying data, plot two categorical variables. [3] CINTIAN. Welcome to the Patent Data Revolution, 2020. https://www.cintian.ai/ The data for these charts can therefore, be arrived at by making Last accessed 2021-05-09. use of two consolidators, each of which cluster the data points into [4] EXPLOSION. Industrial-Strength Natural Language Processing, 2021. https: //spacy.io/ Last accessed 2021-05-09. discrete clusters which correspond to the plotted categories (see [5] J. Guo, Y. Fan, X. Ji, and X. Cheng. Matchzoo: A learning, practicing, and develop- Figure 3. ing system for neural text matching. In Proceedings of the 42nd International ACM 38 Modular Development in Patent AI Space: A Case Study PatentSemTech, July 15th, 2021, online SIGIR Conference on Research and Development in Information Retrieval, pages [12] S. Li, J. Hu, Y. Cui, and J. Hu. Deeppatent: patent classification with convolutional 1297–1300, 2019. neural networks and word embedding. Scientometrics, 117(2):721–744, 2018. [6] L. Helmers, F. Horn, F. Biegler, T. Oppermann, and K.-R. Müller. Automating the [13] Y. Lu, X. Xiong, W. Zhang, J. Liu, and R. Zhao. Research on classification and search for a patent’s prior art with a full text similarity search. PloS one, 14(3): similarity of patent citation based on deep learning. Scientometrics, 123(2):813– e0212103, 2019. 839, 2020. [7] HUGGING FACE. The AI community building the future, 2021. https:// [14] PROJECT PQAI. Prior-Art Search for Everyone, 2020. https://projectpq.ai/ Last huggingface.co/ Last accessed 2021-05-09. accessed 2021-05-09. [8] INQUARTIK CORPORATION. Find patents with ease, 2021. https://www. [15] S. Sarica, B. Song, E. Low, and J. Luo. Engineering knowledge graph for keyword inquartik.com/patentcloud/patent-search/ Last accessed 2021-05-09. discovery in patent search. In Proceedings of the Design Society: International Con- [9] IPRALLY TECHNOLOGIES LTD. Patent search done right, 2021. https://www. ference on Engineering Design, volume 1, pages 2249–2258. Cambridge University iprally.com/ Last accessed 2021-05-09. Press, 2019. [10] A. Klenner, S. Bergmann, M. Zimmermann, and M. Romberg. Large scale chemical [16] SIMILARI. Supercharge your IP, 2020. https://similari.com/ Last accessed 2021- patent mining with uima and unicore. Journal of Cheminformatics, 4(1):1–2, 2012. 05-09. [11] A. Krishna, B. Feldman, J. Wolf, G. Gabel, S. Beliveau, and T. Beach. Examiner [17] THREE10 SOLUTIONS, INC. The End of Keyword Searching, 2020. https: assisted automated patents search. In 2016 AAAI Fall Symposium Series, 2016. //www.dorothyai.com/platform Last accessed 2021-05-09. 39