Applications and Challenges of Text Mining with Patents Hidir Aras, René Hackl-Sommer, Michael Schwantner and Mustafa Sofean FIZ Karlsruhe Hermann-von-Helmholtz-Platz 1, D-76344 Eggenstein-Leopoldshafen firstname.lastname@fiz-karlsruhe.de ABSTRACT Boolean queries, the diligent usage of proximity operators, This paper gives insight into our current research on three and vast lists of synonyms. New functionality, which helps text mining tools for patents designed for information pro- them in searching and analysing the result set, is therefore fessionals. The first tool identifies numeric properties in the greatly appreciated. Tools and methods for ordinary docu- patent text and normalises them, the second extracts a list ments are manifold, the challenge is to adapt or to re-design of keywords that are relevant and reveal the invention in them in such a manner that they work with patents. the patent text, and the third tool attempts to segment the patent’s description into it’s sections. Our tools are used in In this paper, we introduce three text mining tools specifi- the industry and could be applied in research as well. cally designed for patent texts we have implemented or are investigating on, respectively. Section 2 describes the nu- 1. INTRODUCTION meric property extraction, which allows for recognising num- Patents are a very complex and difficult to analyse type of bers, measurements, and intervals. This feature enables the text. As described in [10], their linguistic structure differs user to integrate a search for numeric properties, e.g. for very much from common language. Patents, as a corpus and temperature measurements ranging from 150K to 200K, into as a single document, are both very heterogeneous. They be- his query to enhance the precision. Section 3 shows the chal- long to subject areas as diverse as chemistry, pharmacology, lenges of automatic keyword extraction with focus on the mining and all areas of engineering, with the consequence invention, giving the user the opportunity to get a quicker that all kinds of terminology can be found in a patent cor- overview of the content of a single document or an answer pus. A patent corpus usually covers a long time span, often set. Section 4 outlines the patent description segmentation, from the 1950s to the present. Patents from the princi- a tool for identifying the several parts which constitute a pal patent authorities amount to more than 70 million pub- patent description. With that, the user can limit his search lications. Typographical errors are not uncommon, since to specific parts of the description, again for a higher preci- many patents in their machine-readable form are derived sion. Finally, we conclude this work with our main findings from OCR-processing and machine-translation. Patents are and future work. on the average two up to five times longer than scientific ar- ticles. Their textual part is composed mainly of the detailed description of the invention and the claims. The former is 2. NUMERIC PROPERTY EXTRACTION often similar to scientific articles, whereas the latter is char- In many technical fields, key information is provided in the acterised by a legal language. form of figures and units of measurement. However, when these data appear in full text, they are almost certainly lost Users of patent information usually are information profes- for search and retrieval purposes. The reason for this is that sionals, who cooperate with the research departments or the full text is indexed in a way that makes it searchable with legal department of their companies. They have very high strings. In that manner, only the string representation of a requirements on the correctness and completeness of the numeric property would be searchable, which is, of course, data, on the efficiency of the search interface, and on the wholly unsatisfactory. trustworthiness of the provider. The cause of their search is normally business critical, the endeavour compares to a search for a needle in a haystack. Their search strategy is by 2.1 Related Work To date, some attempts have been made to extract such data far different from a typical Google search; it uses complex automatically from text. A tentative approach in GATE where the identification of numeric properties from patents was addressed as a sub-task is described in [1]. [4] exam- ine the detection of units of measurement in English and Croatian newspaper articles over a small sample of 1745 Copyright l’ 2014 for the individual papers by the papers’ authors. Copy- articles per language using NooJ. [9] investigate the issue ing permitted for private and academic purposes. This volume is published from a Belarussian/Russian perspective with many unique and copyrighted by its editors. Published at Ceur-ws.org Proceedings of language-related challenges relying on NooJ, too. These ap- the First International Workshop on Patent Mining and Its Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014. At KONVENS’14, Octo- proaches lack either the generalisability to an extensive cor- ber 8-10, 2014, Hildesheim, Germany. pus or deal mainly with the Russian language. There is also a commercial tool available from quantalyze1 , however, this a common base unit for all units which describe the same tool appears to identify a much more limited variety of units physical property and convert all instances from the full text than ours and it also lacks the identification of enumerations, to that base unit for indexing and searching. Therefore, which are abundant in patents and therefore indispensable. all instances of units from the full text are converted into their corresponding base units as they are defined in the 2.2 Requirements and Tasks International System of Units (SI). The following sections describe the requirements and rele- vant tasks in numeric property extraction. Identification of Intervals There are two main ways in which intervals can be con- Identification of numbers strued. One relies on context words, in which the words sur- Clearly, a number consisting of digits only can be easily iden- rounding numeric entities indicate an interval, e.g. between tified. For numbers with decimal points we have observed 12 and 100 Watts. Another way comprises the use of sym- that in our data both numbers following English as well as bols, e.g. 5–6 mg or >12 hours. While there are only some German convention are present. Numbers do also appear in phrases that are often encountered which indicate intervals scientific notation, and there is a range of characters that with bounds on both sides, there are many more when it is used to denote a multiplication or a exponentiation. We comes to intervals unbounded on one side. The latter can also note the use of the HTML sup-tag indicating super- appear before or behind the numeric entities to which they script. Examples of valid expressions therefore include: refer, e.g. more than 200ml or 200ml or more. Negated for- mulations like not more than have to be taken into account 1,300.5 (English convention); 1.300,5 (German); as well. Frequently, there are also adverbs present which 3.6 x 10-4; 10^5; 4.5x10sup"5; 8.44 x 10 sup* 10 add no specific information to the context, but just need to Frequently, in patents numbers are spelled-out, as in ten be filtered out, e.g. about, around, roughly. mg instead of 10 mg. These instances are recognised and converted into their respective numerical values. Enumerations and Ratios Enumerations of numbers or even intervals are very com- Identification of units of measurements mon in patents. They usually follow a comma-separated This task, looking simple at first sight, requires some at- pattern: a thickness of 1, 2, 3, 4, or 5 mm. The identifi- tention with respect to spelling (in particular uppercase vs. cation of enumerations is rather straightforward as there is lowercase), spacing, and disambiguation. only a small number of variations that together cover >90% • Upper-/lower case: There are some instances, in which of occurrences. capital letters and small letters refer to different en- tities, e.g. S stands for Siemens, a unit for electric Ratios are used to describe the proportionate relationship conductance, whereas s stands for second. between two or more entities from a common physical di- mension. A sample expression from an everyday background • Spacing: There is some diversity regarding blank char- might be make sure the ratio between sugar and flour is 1:3. acters in spellings of units of measurement consisting This being a simple example, the recognition of ratios is of more than one word, e.g. J per mol-K. Therefore, actually a difficult endeavour. The reason is the immense the longest possible sequence in a series of tokens has heterogeneity in which ratios can be expressed. Simple ra- to be matched. tio formulations are typically separated by colons or slashes. They take general forms like ”Number:Number” or ”Number- • Ambiguity: For a few units, their abbreviated spellings to-Number”. An approach relying solely on these patterns might refer to different entities, e.g. C might stand for will invariably locate many false positives. Degrees Celsius or Coulomb; A might mean Ampere or Angström (cf. Noise Reduction). Noise Reduction The vast majority of units appear after numbers; however, The aim of noise reduction is to eliminate false positives. there are some units that only appear before numbers, like This is a critical task especially for units of measurements the pH value or the refractive index. consisting of only one letter, the most frequent being the aforementioned A and C. Unit normalisation Many measurements of physical properties can be expressed with various units. For example, 800W is equivalent to 800 2.3 Implementation We are using the Apache UIMA framework for the pre- Joules/second, and 180◦ C to 453 degrees Kelvin. For the sented analysis of data. It provides a robust infrastructure measurement of pressure, the following non-exhaustive list for developing modular components and deploying them in of units can be used: kg/m2; N/m2; Pa; Torr; atm; cm a production environment. Finite State Automata (FSA) Hg; ounces per square yard. Additionally, a great number are used throughout for pattern matching. They perform of prefixes like nano, µ, kilo, tera and their abbreviations much better than Java-patterns and regular expressions, and have to be considered. Hence, to get a hit with standard even small improvements add up quickly when it comes to indexing, a user would need to include all sorts of variations processing data in the terabyte range. For the identifica- in order to achieve even a modicum of accuracy and recall. tion of numbers, intervals, and enumerations valid sequences Clearly, a superior way to address these issues is to define of phrase parts and type-related placeholders (both config- 1 https://www.quantalyze.com/ urable) are expressed in a FSA-based grammar. Adapted to the English language, our system currently recog- 3.1 Related Work nises more than 15,000 unit variants belonging to 80 base [2] used term frequency, phrase frequency and the frequency units. Included are all commonly used dimensions like time, of the head noun for identifying the relevant keywords from temperature, or weight, but also many dimensions that are a candidate set. The phrase candidates are sorted according more relevant in professional use, e.g. dynamic viscosity, to the head noun frequency. Afterwards additional statis- solubility, or thermal conductivity. We are using a window- tical filters are applied. [7] reported that technical terms ing technique for ratio recognition. From any occurrence of mainly consist of multi-words, e.g. noun phrases with a the word ratio in the text, up to five words to the left and noun, adjective and the preposition ”of” in English texts. 15 words to the right are evaluated. While this approach Single words in general are less appropriate for represent- manages to identify many valid ratios, many cases still re- ing terminology. Most word combinations describing termi- main in which ratios are not recognised, like ratios for more nology are noun phrases with adjective-noun combinations. than two entities or ratios in alternative formulations (e.g. Experiments also indicate the impact of the term position, 10 parts carbon black and 4 to 6 parts oil extender ). These e.g. in title or a special section. It was also shown that will be dealt with in future versions. proper nouns rarely represent good keywords for represent- ing terminology. Conversion between units is a straightforward task. The units, their variants and conversion rules are kept in a con- figuration file. Three more configuration files are provided 3.2 Challenges and Tasks One main challenge in keyword extraction is related to the for rules to recognise intervals and for the noise reduction, subjectivity of keywords for a particular user, whose exper- respectively. By this means, changes or extensions can be tise, common knowledge about the regarded technical do- effected without the need to change the source code and re- mains and the focus of interest can vary with respect to man- deploy the software. For the noise reduction task, two lists ifold aspects. Besides that, patent full texts describe general have been defined. The first list applies to all units. It con- aspects, state of the art that experts are familiar with and tains terms like figure or example. If one of those global make use of expressions and terms that are rarely used in terms precedes a numeric entity, that entity is judged as classic texts (neologisms). Hence, separating the wheat from noise and removed (examples: figure 1A or drawing 2C ). the chaff can be difficult. Moreover, as the description part The second list is specific to certain units only. If a term of a patent can be very heterogeneous, mixed with tables, contained therein follows a numeric entity, this text passage figures, examples, mathematical or chemical formulas, etc., will be ignored as well (e.g. 13C NMR). Extracted and con- identifying relevant sections that contain keywords that are verted entities are added to our search engine indexes. directly related to the invention can be a tricky task as well. All these challenges call for deeper analysis of the content, in Regarding evaluation, we followed an iterative development order to better understand patent texts and improve search- cycle with many intellectual assessments. In the process, we ing specific aspects or entities in the patent texts. have set up extensive JUnit-tests for software development and continuous integration. When a test person or, later, one of our customers found a specific piece of text that re- quired improvement, we included it. As a result, given the size of our data it has over time become increasingly difficult to find text snippets that are not or faultily recognised. We have not carried out extensive formal recall/precision evalu- ations, because the effort required building a gold standard with significant sample size and real world data (as opposed to manually construed ”difficult” data) is not offset by the projected gains. All our customer feedback indicates that our results are very good. Figure 1: Phrase pattern distribution of top key- 3. KEYWORD EXTRACTION words from three experts (Analysis of EPO patents). Keywords extracted from a document are of great benefit for search and content analysis. In the patent domain important Analyses show that most of the relevant linguistic phrases in keywords can be utilised for searching as well as getting patent texts are noun-sequences and noun-adjective combi- an overview of the topics and the focus of a single patent nations (Figure 1). Despite this, depending on the domain document or an answer set. In both cases they can avoid of interest, complex noun phrases that are used to describe, unnecessary time-consuming and costly analysis e.g. in prior e.g. a process, chemical entity or formula, and verbal phrases art or freedom to operate scenarios. Existing methods for can be observed. The role of the verbal phrases seems to be keyword extraction – be it automatic or supervised – use debatable, as recent results [8] show. either statistical features for detecting keywords based on the distribution of words, sub-words and multi-words, or Investigation of evaluation data from experts indicate that exploit linguistic information (e.g. part-of-speech) over a extracting phrases of length ≤ 5 is reasonable in case of lexical, syntactic or discourse analysis. Furthermore, hybrid linguistic technical terms, which might be different when approaches exist, which try to combine the various types of considering also domain-specific entities from the chemical, algorithms and apply additional heuristic rules, e.g. based bio-pharma, or other domains. Figure 2 shows the frequency on position, length or layout. distribution of the phrase lengths up to 9 words in the an- notated corpus. For example, in the descriptions part, the account the respective parameters. In order to avoid loss of experts annotated more than 350 times phrases consisting information, a conservative method is preferred over utilising of only two words. Focusing on automatic keyword extrac- harsh frequency thresholds. Rather, the overall ranking is tion, a further prerequisite is to deal with similar phrases affected by an elaborated weighting scheme considering be- with different morphological and syntactical structure. For sides intra-section features also field-based analysis for the keyword search or for generating content overviews this syn- sections title, abstract, claims and the descriptions text. tactic variations [5] must be normalised and mapped to one canonical form. For example: circular or rectangular pat- 3.3.1 Dataset and Evaluation terns → circular pattern, rectangular pattern, method for The implemented approach was evaluated based on a corpus combating spam → spam combating method, etc. with 20.000 documents from several domains, e.g. chemical, bio-pharma as well as engineering, from the European patent Another important task that also concerns patent search database comprising granted patent documents having ti- in general is semantic normalisation to aggregate semanti- tle, abstract, claims and descriptions text. An expert-based cally equivalent or similar phrases which can vary in wording study served to create a test corpus of 70 patent documents considerably. The recognition of specific entities – be it sim- annotated with keywords in the aforementioned main sec- ple or complex forms, identifying taxonomic relations, syn- tions of the patent text. Therefore, the two participating onyms, chemical entities, enumerations, etc. represent other experts marked up to 20 most relevant keywords in a patent challenges in the course of understanding a given patent text document that characterise the topic and the focus of the de- beyond general linguistic phrases or terms. In classic key- scribed invention. The main textual sections comprising the word extraction, keywords in title or abstract are automat- combined title-abstract, claims and descriptions were eval- ically regarded as important, while for patents a sophisti- uated separately, i.e. keywords sets were not mixed. The cated weighting scheme based on analysing keyword occur- created (annotated) datasets were used for evaluating the rence and co-occurrence with respect to different sections is keyword extraction. For evaluating the implemented base- required. A further task is to decide how the final keyword lines based on the TF-iDF weighting scheme, the rank-based set is presented to the user. While in classic keyword extrac- evaluation metrics precision@k, recall@k and F-Score have tion rarely more than 10 keywords are returned to the user, been used. in the patent domain information professionals indicate that displaying 50, even 100 keywords would be desirable. For the field combination title-abstract, the exact keyword match results for precision varied between 34% for the top 10 keywords and 20% up to 30% for the top 20. Looking 3.3 Implementation and Evaluation at recall considering a wider range of up to 50 keywords, a A proof-of-concept prototype based on linguistic and sta- score around 40% was calculated. As exact match does not tistical analysis was implemented in order to evaluate some consider syntactic variations for the extracted key phrases, of the described tasks. The general procedure comprised a fuzzy matching method was applied as well. Depending the steps for linguistic and statistical pre-processing, noun on the fuzziness parameter, false positives may also be re- phrase extraction and analysis and phrase weighting based turned, which only can be detected by manual expert-based on features such as length, position, TF-iDF weight or sec- inspection. The results after applying the fuzzy matching tion. A typical linguistic pre-processing includes sentence method were much better for precision (˜75% for the top detection, tokenisation, POS-tagging and noun phrase chunk- 10 keywords and 46% for the top 20 keywords) and recall ing. The noun phrase extraction allows to identify basic pat- (˜87%). For the claims the precision varied between 27% terns of important noun phrase chunks, while applying a fil- and 30% for the top 20 keywords in case of exact match, tering method for removing irrelevant (stop-)words at start while again the recall for the extracted keywords increased and end. As many syntactic variations of the extracted key- from 27% to approx. 46% when taking a wider range of words may occur besides a syntactic normalisation method, up to 50 keywords. For fuzzy matching, a precision score linguistic and statistical analysis must be applied in order above 75% for the top 10 keywords and 70% for the top 20 to reduce the candidate set for ranking. A candidate phrase was achieved. In claims, the recall for the top 50 keywords is evaluated by means of a scoring formula that takes into was about 92%. Due to the heterogeneity and the amount of text present in the descriptions part, the challenges seem here much higher. For the TF-iDF baseline the exact match results for precision varied between 14%-15%, while the re- call for the top 10-50 keywords increases from 8% to 25%. Applying fuzzy matching, the scores for precision were again much better. Depending on the fuzziness parameter for the matching similarity that varied between 0.5 (50% match) and 0.9 (90% match), the precision score was between 80% and 50% for the top 50 keywords for the regarded dataset. 4. TEXT SEGMENTATION Patent documents are lengthy, abundant, and full of de- tails, such that it may hinder the topic analysis for humans and for machines as well. One of the text mining techniques Figure 2: Phrase length distribution of top keywords which can ease these intricacies is text segmentation [3]. The for abstract, claims and descriptions. automatic structuring of patent texts into pre-defined sec- Table 1: A list of sections in description text of the patent. Section Types Example Detailed Description Best Mode of the Invention, Embodiments of the Invention Background Background of Invention, Prior Art Summary Summary of the Invention, Objectives of the Invention, Disclosures Methods Procedures, Operations, Experiments Drawing and Figures Detailed Description of the Drawing Applicability Industrial Applications, Applications of the Invention Technical Field Technical Field of the Invention, Field of Technology Examples Embodiment Example, Experimental Example Sequences List of Sequences, Numerical Sequence References List of References, Literatures Statements Statement of Government Rights, Acknowledgement tions will serve as a pre-processing step to patent retrieval numbers, math equations, and formulas) via a tokenisation and information extraction, as well as enable the interested process. Then, we created the positive-list which contains people to understand easily the structure of a patent that terms that appear more than five times in all headings of the leads to fast, efficient, and easy access to specific information dataset, and the first-token list which includes terms from which they are looking for. Furthermore, noun phrases of the headers which appear more than five times as the first important sections in the patent texts could be used as main word of a header. features for patent classification and clustering to achieve a good performance. 4.2 Header Detection and Meaning In cooperation with a patent expert, we identified segmen- The textual part of a patent contains title, abstract, claims, tation guidelines. These guidelines help us to understand and the detailed description (DetD) of the invention. The the section types (Table 1) in the DetD. In order to discover latter includes the summary, embodiment, and the descrip- the headers inside the DetD, we need to get the boundary tion of figures and drawings of the invention. As of the of the headers. i.e., the header’s start and end. We call this amount of information in DetD, there is a need for auto- operation Header Detection. Then, we identify the text con- mated tools, which can determine the document-level struc- tent which is related to each header. The header meaning ture of the DetD, identify the different sections and map on the other hand is represented by assigning the header to them automatically to known section types. There has been an appropriate section type (e.g.; summary, example, back- previous work which showed that the semantic of the patent ground, method, etc). Here, a rule-based approach is more document structure is valuable in patent retrieval [6], but it suitable because in the patent domain, there is no sufficient only focused on structured patent text which is labelled by training data for a machine learning algorithm to be success- specific tags in the original text. The work in [1] presented ful. To do so, we develop a rule-based algorithm to identify a rule-based information extraction system to automatically headers and their boundaries. The output consists of all annotate patents with relevant metadata including section headers and their positions inside the DetD. Our algorithm titles. In this section, we describe our text segmentation works as follows: As input we take the DetD as a sequence of method which is used to recognise the structure of the DetD. paraghraphs. Then, we test the following features to decide whether a paragraph is a header or not: There are many challenges that arise in patent text segmen- tation, for example measuring the similarity between the A. The number of words in the paragraph. sentences is difficult to use because there are a lot of iden- B. The number of characters in the paragraph. tical terms in the sentences. Another challenge is that the patent contains a lot of new technical terminologies which C. True, if all letters in the current paragraph are in upper are hard to collect when using a term matching technique. case; false otherwise. To meet these challenges, we currently develop a patent text segmentation tool which automatically segments the patent D. True, if all words in the paragraph start with an upper text into semantic sections by discovering the headers inside case letter; false otherwise. the texts, identifying the text content which is related to E. True, if the current paragraph contains words from the each header, and determining the meaning of the header. positive-list, false otherwise. 4.1 Dataset and Preprocessing F. True, if in the current paragraph more words start with Our dataset consists of a random sample of 139,233 patents a capital letter than with a lowercase; false otherwise. from the European Patent Office (EPO) and converted by FIZ Karlsruhe2 into a proprietary XML format with tagged G. True, if the current paragraph starts with a bullet; paragraphs. Processing techniques have been applied to false otherwise. understand the type, style, and format of headings inside H. True, if the previous or the next paragraph starts with patent texts. We started by parsing XML files to get a list of a bullet; false otherwise. headings in the description part. The headers pass through a cleansing process that is represented by removing unde- I. True, if the first token in the paragraph appears in the sired tokens in each header (e.g.; numbers, special charac- first-token list; false otherwise. ters, words containing special symbols, words starting with J. True, if the current text paragraph contains a simple 2 http://www.fiz-karlsruhe.de chemical text; false otherwise. K. The average header length in the dataset’s headers. specific domain and use, e.g. treatment of diseases, medical substances, etc. than in an isolated manner. Possible en- L. The average number of words in the dataset’s headers. hanced methods for keyword context analysis could rely on We use these features on each input paragraph of the DetD semantic analysis based on the co-occurrence method, (la- to build decision rules for the header detection. Some of the tent) semantic analysis or other dedicated semi-supervised decision rules are listed below: and unsupervised machine learning techniques. Further- more, a more enhanced method for semantic segmentation of i. C is true and G is false and A≥1 and J is false patent text needs to deal with patents that do not have any heading inside their texts and address the overlap problem ii. D is true, E is true, A≥1, G is false, and J is false between section types. Our final goal is to develop a flexible, iii. G is true, H is false, A