Developing Semantic Search for the Patent Domain Daniel Eisinger Jan Mönnich Michael Schroeder Technische Universität Technische Universität Technische Universität Dresden Dresden Dresden BIOTEC BIOTEC BIOTEC Tatzberg 47/49 Tatzberg 47/49 Tatzberg 47/49 01307 Dresden, Germany 01307 Dresden, Germany 01307 Dresden, Germany daniel.eisinger@biotec.tu- jan.moennich@biotec.tu- ms@biotec.tu- dresden.de dresden.de dresden.de ABSTRACT The number of patent applications continues to rise, reach- The patent domain is a very important source of scien- ing 2.35 million worldwide in 2012 alone [11] - only one year tific information that is currently not used to its full poten- after surpassing two million for the first time ever in 2011. tial. Issues such as high numbers of patents, complicated [10]. The number of patent grants is also at an all-time high, language style and inconsistently used vocabulary make the exceeding the one million mark for the first time in 2012 [11]. task of searching for relevant patents extremely complex. Additionally, the documents are not always available in En- While this is already a problem for patent professionals who glish, which makes finding all relevant documents extremely have to invest a lot of time and effort into their search, it difficult. But even for the documents with English-language is even more problematic for academic scientists with little versions, there are some unique challenges that separate the experience in this domain. patent domain from most other document types. Semantic search functionality has been demonstrated to provide large advantages for document search in other do- While it is not unusual to rely mainly on keywords for search- mains. As an example, the search engine GoPubMed of- ing most other document corpora, this approach does not fers advanced search functionality for the biomedical domain return satisfactory results for many patent search tasks. Dif- based on annotating documents with relevant concepts from ferent sections of the patent text are written in completely various ontologies. In this paper, we report on our efforts different styles, patent authors don’t always use standard to provide comparable advances for the patent domain. We terminology (or it may not even exist), and many patents introduce the patent search prototype GoPatents, and we are written in very unspecific language. The problem has describe the experiments that we performed during its de- been summarized by the European Patent Office (EPO) in velopment in the areas of term extraction, term and IPC the following way, using the term “patentese” for the un- class co-occurrence analysis, automated patent categoriza- conventional language style that is typically only used in tion, and automated annotation with ontology concepts. patents: “Newcomers to intellectual property are often sur- prised or even shocked at the way words or phrases familiar in everyday language are used very differently in the world 1. INTRODUCTION of patents. Grammatical constructions that would be un- As evidenced by a growing number of reports about various thinkable in everyday speech or writing are used routinely high-profile patent trials in recent years, having the neces- in patentese. Patentese has words which do not even exist in sary information about all relevant competitor patents can ordinary languages. Furthermore patentese exists in every be vital to a company’s interests. At the same time, patents conceivable natural language version” [1]. can also be a valuable source for academic research, since current research results are often first published in a patent As a result of these problems, professional patent searches and only afterwards (or never) in a journal. Experts have usually don’t rely exclusively on keywords. The most im- estimated that only 10-15% of the patent content is also portant way to improve pure keyword searches is through described in other publications, and that 80-90% of all sci- the use of the classification information that is provided by entific knowledge is contained in patents [2]. Despite that the patent offices. This information can also be used to potential, most academic researchers are to our knowledge filter or expand search results, but in order to make the not using patents, presumably due to the high complexity most of these possibilities, the searcher must have detailed of the domain. knowledge about the classification system. Unfortunately, this is not the case for many academic researchers. Even for professional patent searchers, the process of constructing and refining patent queries is quite complicated and time- consuming. Copyright c 2014 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. Consequently, it is desirable to offer a system that provides Published at Ceur-ws.org an easier option for scientists to perform high-quality patent Proceedings of the First International Workshop on Patent Mining and Its searches and assists patent professionals in completing and Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014. refining their initial queries. In order to provide such as- At KONVENS’14, October 8–10, 2014, Hildesheim, Germany. sistance, it is important to have a clear understanding of the properties of patent classification systems. We there- fore carry out an in-depth investigation of the most com- mon patent classification system, the International Patent Classification (IPC). Since the benefit of using existing an- notations for semantic search has already been demonstrated in the biomedical domain, we use the controlled vocabulary “Medical Subject Headings” (MeSH) that is used to annotate all document abstracts on the biomedical literature database PubMed as a point of comparison. Following this analysis, we give a detailed description of multiple approaches we are proposing to improve patent search, and we introduce the patent retrieval prototype GoPatents that incorporates some of these proposals. Figure 1: IPC vs. MeSH - Terms/classes per hierar- 2. COMPARATIVE ANALYSIS OF MESH chy level. Both hierarchies expand in similar ways. AND IPC Our analysis of MeSH and IPC can be divided into three parts: The first two parts concern the respective hierarchies uments have around nine on average and often even con- and terms of the systems themselves, while the third part ex- siderably more. Additionally, we were able to show that amines their usage for document classification. We analyzed the annotation sets for patents are much less diverse than the latter by collecting classification information from a large for PubMed, leading us to question the completeness of the patent corpus as well as the annotations to all PubMed doc- existing assignments. uments published by early 2011. Table 1 summarizes some core results of our analysis. Property MeSH IPC number of hierarchy entries 54095 69487 number of unique entries 26581 69487 number of main trees 16 8 number of hierarchy levels 13 14 occurrence of class labels in text frequent very rare average number of annotations 9 2 per document proportion of documents with 86% 53% multiple annotations proportion of documents with re- 81% 46% lated annotations (i.e., same hierarchy tree) Figure 2: Percentage of documents with number of annotations. The average number of MeSH anno- Table 1: Comparative analysis MeSH vs. IPC. The tations per PubMed document is much higher than hierarchical structures are similar, but MeSH terms the number of IPC classes per patent. are shorter and more likely to occur in text. The number of MeSH annotations per document far sur- We therefore believe that the use of IPC for patent search passes the number of classes per patent. comes with two serious disadvantages: First, the complexity of the system causes significant problems for non-professional The number of unique MeSH entries is considerably smaller patent searchers since it is very difficult to find the complete than the number for IPC, but since the hierarchy tree of set of IPC classes that are relevant for the search task at MeSH allows for the same heading to appear more than hand. Second, the low number of class assignments may lead once, the sizes are comparable, as are the hierarchies (cf. to unexpectedly low recall for classification-based patent Figure 1). searches that are often performed by professional searchers in order to overcome the problems of keyword search. The comparison of the terms on the other hand shows some major differences. IPC is focused on alphanumeric class codes while MeSH emphasizes terms, IPC definitions are 3. SEMANTIC SEARCH FOR THE PATENT longer, more complicated and less self-contained than MeSH DOMAIN headings, and are therefore much less likely to appear in the This section describes our attempts to solve the problems text. As Figure 2 shows, there are also large differences be- caused for patent search by incomplete class assignments tween the numbers of MeSH annotations per document and and complex patent text. We automatically assign addi- the numbers of IPC annotations per patent: While most tional classes, expand initial queries, and we annotate patent patents have less than five assigned classes, PubMed doc- documents to make faceted search functionality possible. 3.1 Patent Categorization Corpus Precision Recall F1 -measure The most straightforward way of dealing with the problem C73 0.88 0.90 0.89 of incomplete class assignments would be the assignment of C1205 0.88 0.84 0.86 additional classes, but due to the high number of patents as well as the high complexity of the classification system, this Table 2: Evaluation results for confidence thresh- can only be done automatically. Depending on the accuracy old 0.5. The precision values are identical for both of the automatic assignment of relevant classes, the method corpora, but recall is considerably higher for the can be useful for two related but different ways of dealing smaller corpus. with the low number of assigned patent classes: 1. Given a class, find documents for this class. for the confidence threshold 0.5. The results are for the most If the user knows that a particular class is highly rele- part encouraging, with most values approaching 0.9. For the vant for their search, the automatic class assignments purpose of our first task, this means that we can retrieve ad- can be used to discover additional patents that should ditional documents with high confidence. The second task, have been assigned to the class. The recall of the finding additional classes for a given document, is more prob- search can therefore be improved considerably. lematic however. Since we apply all classification models to all documents, a precision score of ≈ 0.9 leads to a high num- 2. Given a document, find classes for this document. ber of incorrect assignments. While using higher values for If the user has already collected a small set of rele- the confidence threshold has a positive effect on precision, vant documents, the automatically assigned classes for it is accompanied by a severe drop in recall and therefore these documents can help them find the classes that are leads to a significantly lower F1 -measure. This problem is related to these documents, even if there is no classi- caused by slower precision growth for the individual classi- fication data available or if there are missing assign- fiers compared to the situation for PubMed/MeSH, making ments. These additional classes again enable them to additional steps necessary. We propose two filtering options: refine their initial search query. Since most patent queries also include a keyword component, many of the incorrect assignments are filtered out automat- Previous approaches to automated patent categorization were ically since they don’t include the required keywords. Ad- usually restricted to higher levels of the hierarchy (e.g., [7, ditionally, we implemented a filter that accepts additional 9, 6]). The only prior effort to classify patents down to the class assignments only if there is an existing patent that lowest level of the IPC involved a complicated three-phase was assigned a similar combination of classes. The filter has algorithm that is not well suited for application on a large multiple possible settings, from very restrictive (only allow corpus; in addition, it already removes large parts of the classes that have previously co-occurred directly) to much hierarchy in the first step, which we believe makes it is too less so (allow pairs of classes if their respective ancestors of restricting for our goal of finding new relevant but poten- a certain hierarchy level have been co-assigned). For a small tially very different classes that were not previously assigned set of example patents, this filter had the desired effect of [3]. We therefore based our system on an approach that filtering out unrelated classes while accepting related ones. has been used successfully for the automated assignment of MeSH terms to PubMed documents by Tsatsaronis et al. IPC code Class definition Features 1 to 5 [8]. It is based on training a series of Maximum Entropy- (abbrev.) classifiers (one for each class) on existing class assignments and applying them to each document that is supposed to A61B 5/00 Measurement for light, sensor, blood, get additional class assignments. diag. purposes patient, tissue A61B 17/00 Surgical tissue, suture, end, In order to evaluate the results of our categorization efforts, instruments surgical, closure we constructed two training corpora from the EPO dataset A61B 17/70 Spinal rod, bone, portion, that was also the basis of our previous analysis. The first positioners member, screw corpus (C73 ) follows strict quality requirements and contains A61F 13/15 Absorbent pads absorbent, material, 73 classes while the second one (C1205 ) has more relaxed re- napkin, web, diaper quirements and is therefore much larger with 1205 classes. A61M 25/00 Catheter catheter, distal, end, This size difference in connection with the expected higher tube, lumen quality of the documents due to the constraints we men- G01N 33/50 Chemical analysis sample, test, cell, tioned above should lead to better categorization results for of biol. materials specimen, light C73 than for C1205 . With our initial evaluation, we tested our method’s ability to retrieve the classes that were actu- Table 3: Most influential positive classifier features. ally assigned to the patents. Therefore, all of these classes Features were extracted from binary Maximum- were considered correct while everything else was considered Entropy classifiers trained on IPC classes with wrong. While this approach can not evaluate our method’s biomedical significance. The positive features for suitability for our objective of assigning new classes, it is the classifiers in the list are useful for identifying nevertheless valuable for determining the quality of the clas- patents that belong to the class. sifiers by comparing their results with the categorization de- cisions made by the experts at the patent offices. The quality of the trained classifiers can also intuitively be judged by looking at the features that make the largest dif- Table 2 shows the macro-average scores (precision, recall and ference in categorizing documents. Table 3 shows the five F1 -measure) of all classifiers using 10-fold cross-validation most influential positive features from binary Maximum- Entropy classifiers for a subset of IPC classes with biomedi- measure was clearly the worst option for the task, and wf- cal significance, i.e., the features that were assigned the high- idf as well as LLR were consistently the best. The two new est positive values by the Maximum Entropy method. The measures we proposed, majority-tf-idf and majority-wf-idf, occurrence of these words in a document that is supposed to were unable to reach the scores that were achieved by wf-idf be classified increases the likelihood of positive classification; and LLR, but they were also considerably better than tf-idf. in other words, the document is more likely to be assigned the category represented by the classifier. Almost all fea- tures listed in the table appear to be well suited to making this distinction, since they are representative of their respec- tive class. Although some of the class definitions are closely related, there is very little overlap in the most influential fea- tures. As an example, the five top features are completely disjunct for class A61B 17/00 about surgical instruments and its descendant A61B 17/70 about spinal positioners. 3.2 Guided Patent Search The second part of our approach to address the problem of Figure 3: Influence of different ranking measures on low numbers of patent class assignments and simplify patent the DCG value of extracted terms. Measure wf-idf search combines multiple systems intended to guide the user performs best, followed by LLR and our proposed towards quickly and easily formulating patent queries that measures majority-tf-idf and majority-wf-idf. The are as complete as possible. An initial user query is used DCG value is the lowest by far for tf-idf. to determine additional relevant query components. Since professional patent search queries are a combination of key- We experimented with background corpora that were either words and class codes in most cases, we investigated ways closely (“diagnostics”) or distantly (“pharma”) related to the to expand both of these components. The discovered terms class that we extracted the terms from, as well as a general and classes are recommended to the user so they can decide corpus with no direct relation. Figure 4 shows the average which of the proposals should be included in the final query. term scores for the first 50 term ranks, demonstrating that for our purpose of extracting relevant terms for a very spe- We demonstrated that additional relevant keywords can be cific domain, there is a clear benefit from choosing a back- extracted from a variety of sources including IPC class def- ground corpus that is closely related to the domain: The initions and external resources such as MeSH. Most impor- scores are highest for the diagnostics background corpus, tantly, we extract keywords from existing patents using es- followed by the pharma corpus and the general corpus. tablished natural language processing techniques after an initial evaluation showed the validity of this approach. Our method is based on analyzing patents from an IPC class that has been identified as relevant by the user. Since sig- nificant numbers of documents are available for most patent classes, this approach is able to deliver large numbers of keyword suggestions that are characteristic for the respec- tive class. In a way, extracting relevant words from class patents is an expansion of our categorization efforts. Table 3 shows that this approach is able to discover useful key- words for search. Since we are also interested in relevant multi-word terms, we performed a more in-depth examina- tion of different ranking algorithms for such extracted term Figure 4: Influence of different background corpora candidates. Additionally, we investigated the influence of on the average scores of extracted terms. On av- the background corpus on the result quality. The evalua- erage, the extracted terms score highest with the tion of the resulting term rankings was performed manually closely related corpus (diagnostics) and lowest with by four information professionals from the Scientific & Busi- the most distant corpus (general patents). ness Information Services department of Roche Diagnostics Penzberg. Interestingly, the experts disagreed often about The identified terms that are relevant for certain classes can the relevance of a term, indicating the high complexity of also be used in the opposite direction, for proposing classi- the problem. fication components to add to keyword queries. If the user enters a keyword that has been mapped to an IPC class, this We evaluated the established statistical term extraction mea- class can be suggested to the user for expanding their query. sure tf-idf as well as previously published measures wf-idf Consequently, even users unfamiliar with the IPC can profit and Log-Likelihood Ratio (LLR), and we introduced two new from classification information without investing too much variants of tf-idf and wf-idf. In order to judge the quality effort into getting to know the classification system. This of the resulting term lists based on the scores given by our is especially true for the biomedical domain, since the avail- experts, we calculated different quality measures such as the ability of detailed domain ontologies leads to very precise average “discounted cumulative gain” (DCG) of the differ- class suggestions. ent rankings. Figure 3 shows clear differences between the ranking methods we investigated: The frequently used tf-idf Apart from mapping keywords to classes and vice versa as shown in the previous paragraphs, it is also possible to use mining systems to the patent domain. We therefore devel- the co-occurrence of either to retrieve more relevant compo- oped a new version of the annotator for patent text, based nents of the same type for the query. For keywords, we have on the original pipeline described in [4]. already presented various possible sources for co-occurrence statistics; for patent classes, the existing patent data rep- In order to help us test the performance of our new anno- resents a more direct source. In order to find closely re- tator, professional patent searchers collected a small set of lated classes to suggest to the user, we analyzed the class patents related to neoplasms and made it available to us. co-assignments in our patent corpus. We collected all pairs The set consisted of 50 patents in total, including a large of classes that were assigned to the same patent and ranked number of USPTO patents and smaller numbers of WIPO them both on the absolute number of co-assignments and and EPO patents. A team of master students with expertise the relative number in the form of their Jaccard-Index. We in the field manually listed all genes and proteins mentioned hypothesize that pairs of classes with high ranks in either in the text. Our gold standard was then created in two fur- ranking are related closely enough that many searches for ther steps in a semi-automated fashion, by first matching one of the classes will also have additional relevant results in these lists to the patent text automatically and then manu- the second class. Figure 5 shows one example of such a pair ally curating the result of this process. of classes, including their definition hierarchy. Although the left class is clearly more application-oriented than the right In order to evaluate our new gene annotator for patent text, one, we argue that many searchers interested in patents from we used it to assign gene names to this manually anno- one class will also find relevant patents in the other one. For tated test corpus of neoplasm patents. The results showed these example classes, searching for only the first class leads a very large variation between individual patents, as had to to over 50% missed possible results, and searching only for be expected from the equally large variation of text styles the second still leads to 25% missed results. and structures of the patents. On average, we reached a somewhat satisfactory precision of 0.75, while the recall still shows a lot of room for improvement at 0.39. These values correspond to an F1 measure of 0.51. Although these re- sults aren’t nearly as good as the ones achieved by the orig- inal BioCreative annotator, we believe that they represent a promising starting point given the inherent complexity of the patent domain. We hope that an analysis of common anno- tation errors will help us further adapt the system to these special requirements, leading to clear improvements espe- cially concerning the recall of the method. Further analysis of patents with particularly good or particularly bad anno- tation results may also help in this process. The current Figure 5: Example for semantically related IPC version of the annotator is however already able to provide classes without any hierarchical relation, detected clear improvements for patent search. In preparation for the using co-assignment information. patent search prototype GoPatents, it has been applied to an EPO corpus of 1.8 million patents, to which it assigned 157 3.3 Annotation of Patent Documents million annotations. The complex and long texts also result in high processing requirements; assigning the annotations with Gene/Protein Names to the aforementioned EPO corpus took approximately 6000 The biomedical search engine GoPubMed 1 offers its users CPU hours. faceted browsing of search results using the terms from Med- ical Subject Headings (MeSH) and Gene Ontology (GO) as While our corpus cannot be considered a representative sam- well as a protein database. This means that the result- ple, our analysis of its documents led to some interesting ob- ing documents can be filtered according to their annotation servations. With the publication years of our patents spread terms, allowing the user to quickly and easily reach a re- between 2001 and 2011, we were able to observe a significant sult set with very high relevance. This is especially useful growth in the average number of annotations per patent be- if the annotation systems are hierarchically organized, since ginning in 2006. The highest number of annotations to a this adds the possibility of choosing more specific or more single patent surpassed 2, 500 gene names. We hypothesize general filter terms in reaction to the results of the search. that the development and more wide-spread application of high-throughput techniques is at least partially responsible In order to provide patent searchers with similar functional- for this increase. We also kept track of which part of the ity, we need a system that can annotate patent documents patents individual annotations were assigned to. Unsurpris- with the relevant concepts from the ontological resources we ingly, the Description section was responsible for the largest intend to use. The protein/gene annotator that is used for number of annotations. However, a very large number of an- GoPubMed provides excellent performance for the types of notations is also contained in tables, which can cause prob- text it was developed for, namely biomedical abstracts. Its lems for some automated extraction methods. quality has been demonstrated at the BioCreative workshop, where it was the best-performing system for the task of gene 3.4 GoPatents - A Semantic Patent Search normalization [5]. However, due to the special properties of patent text it is by no means trivial to transfer existing text Prototype In order to give a demonstration of some of our proposals, 1 http://gopubmed.com/web/gopubmed/ we implemented the patent retrieval prototype GoPatents that enables the user to filter the resulting patent docu- In addition to the described functionality, the user’s search ments using terms from MeSH, Gene Ontology and a protein history is made available, and the hierarchies can be searched database. This functionality is brought over from GoPub- for relevant concepts. Result statistics are calculated auto- Med, but we added the possibility of using IPC classes for matically and can be accessed instantly by the user as soon the same purpose. The user interface is divided into two as the result set has been retrieved. These statistics cover columns, a main window on the right and a side column multiple aspects of the result set, including the most fre- on the left; an overview is given in Figure 6, showing the quently assigned terms from the different hierarchies (MeSH, following main components of the system: GO and proteins), the most frequent patent classes and the top applicants. 4. CONCLUSION We presented our approaches to some of the problems that have to be faced by patent searchers, e.g., complicated text, inconsistent vocabulary and incomplete class assignments. Our suggestions include the use of automated categoriza- tion for adding assignments and improving recall, differ- ent guided patent search strategies that help the user refine their queries, and the use of automated annotators to make faceted browsing possible in the patent domain. Our pro- totype GoPatents demonstrates some of the potential that semantic search can bring to the patent domain. 5. REFERENCES [1] K. H. Atkinson. Toward a more rational patent search paradigm. In Proceedings of the 1st ACM workshop on Patent information retrieval, PaIR ’08, pages 37–40. ACM, 2008. [2] S. Brügmann. PATEXPERT project deliverable 8.1 - Figure 6: Overview of GoPatents patent retrieval state of the art in patent processing, 2006. system prototype. The query is entered in the box [3] Y.-L. Chen and Y.-C. Chang. A three-phase method on top, result documents are shown below, and the for patent classification. Information Processing & faceted browsing functionality is available in the left Management, 48(6):1017–1030, 2012. column. [4] J. Hakenberg, C. Plake, R. Leaman, M. Schroeder, and G. Gonzalez. Inter-species normalization of gene mentions with GNAT. Bioinformatics, 24(16), 2008. • The term hierarchies (left column, second from top) [5] A. A. Morgan, Z. Lu, X. Wang, A. M. Cohen, GoPatents enables the user to refine their search us- J. Fluck, P. Ruch, A. Divoli, K. Fundel, R. Leaman, ing relevant concepts from different sources. The com- J. Hakenberg, et al. Overview of BioCreative II gene plete hierarchies of all annotation systems we used are normalization. Genome biology, 9(Suppl 2):S3, 2008. shown continuously with an indication of how many of [6] D. Tikk, G. Biró, and A. Törcsvári. A hierarchical the retrieved documents were annotated with it. The online classifier for patent categorization. Emerging user can expand lower levels of the hierarchies for more Technologies of Text Mining: Techniques and precise information. Since the IPC class codes are not Applications. IGI Global, 2008. informative for users without patent search experience, hovering the mouse over a code opens a pop-up window [7] A. Trappey, F. Hsu, C. Trappey, and C. Lin. with the complete definition hierarchy of the class. Development of a patent document classification and search platform using a back-propagation network. • The additional filtering options (left column, third to Expert Systems with Applications, 31(4):755–765, 2006. fifth from top) [8] G. Tsatsaronis, N. Macari, S. Torge, H. Dietze, and GoPatents offers additional possibilities for faceted brows- M. Schroeder. A maximum-entropy approach for ing: Search queries can be refined further to filter for accurate document annotation in the biomedical specific applicants or publication dates. domain. Journal of Biomedical Semantics, 3:S2, 2012. [9] S. Verberne, M. Vogel, and E. D’hondt. Patent • The search field for entering queries (main window, classification experiments with the linguistic top) classification system LCS. In Proceedings of CLEF Queries can consist of keywords, IPC classes, terms 2010, CLEF-IP Workshop, 2010. from the different included hierarchies as well as the previously described additional filtering options. [10] World Intellectual Property Organization. World intellectual property indicators - 2012 edition, 2012. • The search results (main window, bottom) [11] World Intellectual Property Organization. World Snippets of the patents that fit the initial query as well intellectual property indicators - 2013 edition, 2013. as any additional requirements made by including or excluding other facets are displayed in the main part of the window, providing links to the full patents.